Main Track (MT) Session 4

Time and Date: 14:10 - 15:50 on 11th June 2014

Room: Kuranda

Chair: Y. Cui

222	GPU Optimization of Pseudo Random Number Generators for Random Ordinary Differential Equations [abstract] Abstract: Solving differential equations with stochastic terms involves a massive use of pseudo random numbers. We present an application for the simulation of wireframe buildings under stochastic earthquake excitation. The inherent potential for vectorization of the application is used to its full extent on GPU accelerator hardware. A representative set of pseudo random number generators for uniformly and normally distributed pseudo random numbers has been implemented, optimized, and benchmarked. The resulting optimized variants outperform standard library implementations on GPUs. The techniques and improvements shown in this contribution using the Kanai-Tajimi model can be generalized to other random differential equations or stochastic models as well as other accelerators.	Christoph Riesinger, Tobias Neckel, Florian Rupp, Alfredo Parra Hinojosa, Hans-Joachim Bungartz
229	Design and Implementation of Hybrid and Native Communication Devices for Java HPC [abstract] Abstract: MPJ Express is a messaging system that allows computational scientists to write and execute parallel Java applications on High Performance Computing (HPC) hardware. The software is capable of executing in two modes namely cluster and multicore modes. In the cluster mode, parallel applications execute in a typical cluster environment where multiple processing elements communicate with one another using a fast interconnect like Gigabit Ethernet or other proprietary networks like Myrinet and Infiniband. In this context, the MPJ Express library provides communication devices for Ethernet and Myrinet. In the multicore mode, the parallel Java application executes on a single system comprising of shared memory or multicore processors. In this paper, we extend the MPJ Express software to provide two new communication devices namely the native and hybrid device. The goal of the native communication device is to interface the MPJ Express software with native—typically written in C—MPI libraries. In this setting the bulk of messaging logic is offloaded to the underlying MPI library. This is attractive because MPJ Express can exploit latest features, like support for new interconnects and efficient collective communication algorithms of the native MPI library. The second device, called the hybrid device, is developed to allow efficient execution of parallel Java applications on clusters of shared memory or multicore processors. In this setting the MPJ Express runtime system runs a single multithreaded process on each node of the cluster—the number of threads in each process is equivalent to processing elements within a node. Our performance evaluation reveals that the native device allows MPJ Express to achieve comparable performance to native MPI libraries—for latency and bandwidth of point-to-point and collective communications—which is a significant gain in performance compared to existing communication devices. The hybrid communication device—without any modifications at application level—also helps parallel applications achieve better speedups and scalability. We witnessed comparative performance for various benchmarks—including NAS Parallel Benchmarks—with hybrid device as compared to the existing Ethernet communication device on a cluster of shared memory/multicore processors.	Bibrak Qamar, Ansar Javed, Mohsan Jameel, Aamir Shafi, Bryan Carpenter
231	Deploying a Large Petascale System: the Blue Waters Experience [abstract] Abstract: Deployment of a large parallel system is typically a very complex process, involving several steps of preparation, delivery, installation, testing and acceptance. Despite the availability of various petascale machines currently, the steps and lessons from their deployment are rarely described in the literature. This paper presents the experiences observed during the deployment of Blue Waters, the largest supercomputer ever built by Cray and one of the most powerful machines currently available for open science. The presentation is focused on the final deployment steps, where the system was intensively tested and accepted by NCSA. After a brief introduction of the Blue Waters architecture, a detailed description of the set of acceptance tests employed is provided, including many of the obtained results. This is followed by the major lessons learned during the process. Those experiences and lessons should be useful to guide similarly complex deployments in the future.	Celso Mendes, Brett Bode, Gregory Bauer, Jeremy Enos, Cristina Beldica, William Kramer
248	FPGA-based acceleration of detecting statistical epistasis in GWAS [abstract] Abstract: Genotype-by-genotype interactions (epistasis) are believed to be a significant source of unexplained genetic variation causing complex chronic diseases but have been ignored in genome-wide association studies (GWAS) due to the computational burden of analysis. In this work we show how to benefit from FPGA technology for highly parallel creation of contingency tables in a systolic chain with a subsequent statistical test. We present the implementation for the FPGA-based hardware platform RIVYERA S6-LX150 containing 128 Xilinx Spartan6-LX150 FPGAs. For performance evaluation we compare against the method iLOCi. iLOCi claims to outperform other available tools in terms of accuracy. However, analysis of a dataset from the Wellcome Trust Case Control Consortium (WTCCC) with about 500,000 SNPs and 5,000 samples still takes about 19 hours on a MacPro workstation with two Intel Xeon quad-core CPUs, while our FPGA-based implementation requires only 4 minutes.	Lars Wienbrandt, Jan Christian Kässens, Jorge González-Domínguez, Bertil Schmidt, David Ellinghaus, Manfred Schimmler