ICCS 2017 Main Track (MT) Session 13

Time and Date: 9:00 - 10:40 on 14th June 2017

Room: HG D 1.1

Chair: Michael Kirby

194	cuHines: Solving Multiple (Batched) Hines systems on NVIDIA GPUs. Human Brain Project [abstract] Abstract: The simulation of the behavior of the Human Brain is one of the most important challenges today in computing. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulation need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Ca capacitance on neurons’ morphology. This is carried out using the Hines Algorithm. Although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on NVIDIA GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. To evaluate the impact of the optimizations on real inputs, we have used 6 different morphologies in terms of size and branches. Our studies have proven that the optimizations proposed in the present work can achieve a high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two K80 NVIDIA GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling even when dealing with number of neurons.	Pedro Valero-Lara, Ivan Martínez-Pérez, Antonio J. Peña, Xavier Martorell, Raül Sirvent and Jesús Labarta
213	Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations [abstract] Abstract: Computational kinematics is a fundamental tool for the design, simulation, control, optimization and dynamic analysis of multibody systems - mechanical systems whose bodies are connected by joints which allow relative movement. The analysis of complex multibody systems and the need for real time solutions requires the development of kinematic and dynamic formulations that reduces computational cost, the selection and efficient use of the most appropriated solvers and the exploiting of all the computer resources using parallel computing techniques. The topological approach based on group equations and natural coordinates reduces the computation time in comparison with well-known global formulations and enables the use of parallelism techniques which can be applied at different levels: simultaneous solution of equations, use of multithreading routines for each equation, or a combination of both. This paper studies and compares these topological formulation and parallel techniques to ascertain which combination performs better in two applications. The first application is the use of dedicated systems for the real time control of small multibody systems, defined by a few number of equations and small linear systems, so shared-memory parallelism in combination with linear algebra routines is analyzed in a small multicore and in Raspberry Pi. The control of a Stewart platform is used as a case study. The second application is the study of large multibody systems in which the kinematic analysis must be performed several times during the design of multibody systems. A simulator which allows us to control the formulation, the solver, the parallel techniques and size of the problem has been developed and tested in more powerful computational systems with larger multicores and GPU.	Gregorio Bernabe, Jose-Carlos Cano, Domingo Gimenez, Javier Cuenca, Antonio Flores, Mariano Saura-Sanchez and Pablo Segado-Cabezos
209	On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization [abstract] Abstract: The growing interest to incorporate new features into mobile devices has increased the number of signal processing applications running over processors designed for mobile computing. A challenging signal processing field is acoustic source localization, which is attractive for applications such as automatic camera steering systems, human-machine interfaces, video gaming or audio surveillance. In this context, the emergence of systems-on-chip (SoC) that contain a small graphics accelerator (or GPU), contributes a notable increment of the computational capacity while partially retaining the appealing low-power consumption of embedded systems. This is the case, for example, of the Samsung Exynos 5422 SoC that includes a Mali-T628 MP6 GPU. This work evaluates an OpenCL-based implementation of a method for sound source localization, namely, the Steered-Response Power with Phase Transform (SRP-PHAT) algorithm, on GPUs of this type. The results show that the proposed implementation can work in real time with high-resolution spatial grids using up to 12 microphones.	Jose A. Belloch, Jose M. Badia, Francisco D. Igual, Maximo Cobos and Enrique S. Quintana-Ortí
379	Fast Genome-Wide Third-order SNP Interaction Tests with Information Gain on a Low-cost Heterogeneous Parallel FPGA-GPU Computing Architecture [abstract] Abstract: Complex diseases may result from many genetic variants interacting with each other. For this reason, genome-wide interaction studies (GWIS) are currently performed to detect pairwise SNP interactions. While the computations required here can be completed within reasonable time, it has been inconvenient yet to detect third-order SNP interactions for large-scale datasets due to the cubic complexity of the problem. In this paper we introduce a feasible method for third-order GWIS analysis of genotyping data on a low-cost heterogeneous computing system that combines a Virtex-7 FPGA and a GeForce GTX 780 Ti GPU, with speedups between 70 and 90 against a CPU-only approach and a speedup of approx. 5 against a GPU-only approach. To estimate effect sizes of third-order interactions we employed information gain (IG), a measure that has been applied on a genome-wide scale only for pairwise interactions in the literature yet.	Lars Wienbrandt, Jan Christian Kässens, Matthias Hübenthal and David Ellinghaus
459	Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures [abstract] Abstract: This paper presents new algorithmic approaches and optimization techniques for LU factorization and matrix inversion of millions of very small matrices using GPUs. These problems appear in many scientific applications including astrophysics and generation of block-Jacobi preconditioners. We show that, for very small problem sizes, design and optimization of GPU kernels require a mindset different from the one usually used when designing LAPACK algorithms for GPUs. Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. We also take advantage of the small matrix sizes to eliminate the intermediate row interchanges in both the factorization and inversion kernels. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6x for the factorization, and 14x for the inversion, using double precision arithmetic on a Pascal P100 GPU.	Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov and Jack Dongarra