ICCS 2015 Main Track (MT) Session 4

Time and Date: 10:15 - 11:55 on 2nd June 2015

Room: M101

Chair: Sascha Hell

411	Point Distribution Tensor Computation on Heterogeneous Systems [abstract] Abstract: Big data in observational and computational sciences impose increasing challenges on data analysis. In particular, data from light detection and ranging (LIDAR) measurements are questioning conventional methods of CPU-based algorithms due to their sheer size and complexity as needed for decent accuracy. These data describing terrains are natively given as big point clouds consisting of millions of independent coordinate locations from which meaningful geometrical information content needs to be extracted. The method of computing the point distribution tensor is a very promising approach, yielding good results to classify domains in a point cloud according to local neighborhood information. However, an existing KD-Tree parallel approach, provided by the VISH visualization framework, may very well take several days to deliver meaningful results on a real-world dataset. Here we present an optimized version based on uniform grids implemented in OpenCL that is able to deliver results of equal accuracy up to 24 times faster on the same hardware. The OpenCL version is also able to benefit from a heterogeneous environment and we analyzed and compared the performance on various CPU, GPU and accelerator hardware platforms. Finally, aware of the heterogeneous computing trend, we propose two low-complexity dynamic heuristics for the scheduling of independent dataset fragments in multi-device heterogenous systems.	Ivan Grasso, Marcel Ritter, Biagio Cosenza, Werner Benger, Günter Hofstetter, Thomas Fahringer
465	Toward a multi-level parallel framework on GPU cluster with PetSC-CUDA for PDE-based Optical Flow computation [abstract] Abstract: In this work we present a multi-level parallel framework for the Optical Flow computation on a GPUs cluster, equipped with a scientific computing middleware (the PetSc library). Starting from a flow-driven isotropic method, which models the optical flow problem through a parabolic partial differential equation (PDE), we have designed a parallel algorithm and its software implementation that is suitable for heterogeneous computing environments (multiprocessor, single GPU and cluster of GPUs). The proposed software has been tested on real SAR images sequences. Experiments highlight the performances obtained and a gain of about 95% with respect to the sequential implementation.	Salvatore Cuomo, Ardelio Galletti, Giulio Giunta, Livia Marcellino
472	Performance Analysis and Optimisation of Two-Sided Factorization Algorithms for Heterogeneous Platform [abstract] Abstract: Many applications, ranging from big data analytics to nanostructure designs, require the solution of large dense singular value decomposition (SVD) or eigenvalue problems. A first step in the solution methodology for these problems is the reduction of the matrix at hand to condensed form by two-sided orthogonal transformations. This step is standardly used to significantly accelerate the solution process. We present a performance analysis of the main two-sided factorizations used in these reductions: the bidiagonalization, tridiagonalization, and the upper Hessenberg factorizations on heterogeneous systems of multicore CPUs and Xeon Phi coprocessors. We derive a performance model and use it to guide the analysis and to evaluate performance. We develop optimized implementations for these methods that get up to $80\%$ of the optimal performance bounds. Finally, we describe the heterogeneous multicore and coprocessor development considerations and the techniques that enable us to achieve these high-performance results. The work here presents the first highly optimized implementation of these main factorizations for Xeon Phi coprocessors. Compared to the LAPACK versions optmized by Intel for Xeon Phi (in MKL), we achieve up to $50\%$ speedup.	Khairul Kabir, Azzam Haidar, Stanimire Tomov, Jack Dongarra
483	High-Speed Exhaustive 3-locus Interaction Epistasis Analysis on FPGAs [abstract] Abstract: Epistasis, the interaction between genes, has become a major topic in molecular and quantitative genetics. It is believed that these interactions play a significant role in genetic variations causing complex diseases. Several algorithms have been employed to detect pairwise interactions in genome-wide association studies (GWAS) but revealing higher order interactions remains a computationally challenging task. State of the art tools are not able to perform exhaustive search for all three-locus interactions in reasonable time even for relatively small input datasets. In this paper we present how a hardware-assisted design can solve this problem and provide fast, efficient and exhaustive third-order epistasis analysis with up-to-date FPGA technology.	Jan Christian Kässens, Lars Wienbrandt, Jorge González-Domínguez, Bertil Schmidt and Manfred Schimmler
487	Evaluating the Potential of Low Power Systems for Headphone-based Spatial Audio Applications [abstract] Abstract: Embedded architectures have been traditionally designed tailored to perform a dedicated (specialized) function, and in general feature a limited amount of processing resources as well as exhibit very low power consumption. In this line, the recent introduction of systems-on-chip (SoC) composed of low power multicore processors, combined with a small graphics accelerator (or GPU), presents a notable increment of the computational capacity while partially retaining the appealing low power consumption of embedded systems. This paper analyzes the potential of these new hardware systems to accelerate applications that integrate spatial information into an immersive audiovisual virtual environment or into video games. Concretely, our work discusses the implementation and performance evaluation of a headphone-based spatial audio application on the Jetson TK1 development kit, a board equipped with a SoC comprising a quad-core ARM processor and an NVIDIA "Kepler" GPU. Our implementations exploit the hardware parallelism of both types of architectures by carefully adapting the underlying numerical computations. The experimental results show that the accelerated application is able to move up to 300 sound sources simultaneously in real time on this platform.	Jose A. Belloch, Alberto Gonzalez, Rafael Mayo, Antonio M. Vidal, Enrique S. Quintana-Orti