ICCS 2016 Main Track (MT) Session 3

Time and Date: 16:40 - 18:20 on 6th June 2016

Room: KonTiki Ballroom

Chair: Andrea Zonca

369	A Performance Characterization of Streaming Computing on Supercomputers [abstract] Abstract: Streaming computing models allow for on-the-fly processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more and more used on supercomputers to solve data-intensive problems. Because supercomputers have been mainly used for compute-intensive workload, supercomputer performance metrics focus on the number of floating point operations in time and cannot fully characterize a streaming application performance on supercomputer. We introduce the injection and processing rates as main metrics to characterize the performance of streaming computing on supercomputers. We analyze the dynamics of these quantities in a modified STREAM benchmark developed atop of an MPI streaming library in a series of different configurations. We show that after a brief transient the injection and processing rates converge to sustained rates. We also demonstrate that streaming computing performance strongly depends on the number of connections between data producers and consumers and on the processing task granularity.	Stefano Markidis, Ivy Bo Peng, Roman Iakymchuk, Erwin Laure, Gokcen Kestor, Roberto Gioiosa
35	High-Performance Tensor Contractions for GPUs [abstract] Abstract: We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.	Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack Dongarra, Christopher Earl, Joel Falcou, Azzam Haidar, Ian Karlin, Tzanio Kolev, Ian Masliah, Stanimire Tomov
52	Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs [abstract] Abstract: Solving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually. This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2x faster.	Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra
143	Performing Unstructured Grid Flux Calculations on GPUs [abstract] Abstract: The Finite Volume Method (FVM) is a numerical approach for the approximate solution of Partial Differential Equations (PDE) on discretized volumetric fields. Accurate solutions of PDEs derived from continuum mechanics, especially of complex fields, require structured or unstructured meshes with an ever increasing number of computational volumes. Computing solutions, particularly solutions to time dependent equations, with the Finite Volume Method can take 1000s of computer cores of a supercomputer months to complete. With increased computational and memory throughput, Graphics Processing Units (GPU) have the potential to improve on current im-plementations, providing a decrease in time to solu-tion of FVMs. Through the use of a model equation, we show that GPUs can improve the performance of an open source computational continuum mechanics toolbox, OpenFOAM. It is shown herein that through the use of an NVIDIA Tesla K20 achieves 3-10 times greater performance than using all 10 cores of an Intel Xeon E5-2670 v2.	Matthew Conley, Christian Sarofeen, Hua Shan and Justin Williams
255	Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU [abstract] Abstract: Sparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations. Many-core processors such as GPUs accelerate SpMV computations with high parallelism and memory bandwidth compared to CPUs; however, even for many-core processors the performance of SpMV is still strongly limited by memory bandwidth and lower locality of memory access to input vector causes further performance degradation. We propose a new sparse matrix format called the Adaptive Multi-level Blocking (AMB) format, which aggressively reduces the memory traffic in SpMV computation to improve performance. By several optimization techniques such as division and blocking of the given matrix, the column indices are compressed and the reusability of input vector element in the cache is highly improved. An auto-tuning mechanism determines the best set of parameters for each matrix data by estimating the memory traffic and predicting the performance of a given SpMV computation. For 32 matrix datasets taken from the Sparse Matrix Collection collected by the University of Florida, AMB format achieves speedups of up to x2.92 compared to NVIDIA's cuSparse library and up to x1.40 compared to yaSpMV, which was recently proposed and has been the best known library to date for fast SpMV computation.	Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka