ICCS 2016 Main Track (MT) Session 2

Time and Date: 14:30 - 16:10 on 6th June 2016

Room: KonTiki Ballroom

Chair: Maria Indrawan

115 EMINENT: EMbarrassINgly parallEl mutatioN Testing [abstract]
Abstract: During the last decade, the fast evolution in communication networks has facilitated the development of complex applications that manage vast amounts of data, like Big Data applications. Unfortunately, the high complexity of these applications hampers the testing process. Moreover, generating adequate test suites to properly check these applications is a challenging task due to the elevated number of potential test cases. Mutation testing is a valuable technique to measure the quality of the selected test suite that can be used to overcome this difficulty. However, one of the main drawbacks of mutation testing lies on the high computational cost associated to this process. In this paper we propose a dynamic distributed algorithm focused on HPC systems, called EMINENT, which has been designed to face the performance problems in mutation testing techniques. EMINENT alleviates the computational cost associated with this technique since it exploits parallelism in cluster systems to reduce the final execution time. In addition, several experiments have been carried out on three applications in order to analyse the scalability and performance of EMINENT. The results show that EMINENT provides an increase in the speed-up in most scenarios.
Pablo C. Cañizares, Mercedes G. Merayo, Alberto Núñez
386 CHiS: Compressed Hierarchical Schur Linear System Solver for 3D FDFD Photonic Device Analysis with Hardware Acceleration [abstract]
Abstract: Finite-difference frequency-domain (FDFD) analysis of wave optics and photonics requires linear system solver for discretized vector Helmholtz equation. The linear system can be ill-conditioned when computation domain is large or perfectly-matched layers (PMLs) are used. Direct factorization of the linear systems for 3D photonic simulation may require tremendous amount of computation resources. We propose compressed hierarchical Schur method (CHiS) for computation time and memory usage savings. The results show that the CHiS method takes 45% less factorization time and 35% less memory usage compared with the uncompressed hierarchical Schur method in selected test. The computing procedure also involves many dense linear algebra operations, which can be efficiently executed in modern high-performance hardwares such as graphics processing units (GPUs) and multicore/manycore processors. We investigate the GPU acceleration strategy and hardware tuning by rescheduling the factorization. The proposed CHiS is also tested on a dual-GPU server for performance analysis. These new techniques can efficiently utilize modern high-performance environment and greatly accelerate development of future development of photonic devices and circuits.
Cheng-Han Du and Weichung Wang
424 Faster cloud Star Joins with reduced disk spill and network communication [abstract]
Abstract: Combining powerful parallel frameworks and on-demand commodity hardware, cloud computing has made both analytics and decision support systems canonical to enterprises of all sizes. Associated with unprecedented volumes of data stacked by such companies, filtering and retrieving them are pressing challenges. This data is often organized in star schemas, in which Star Joins are ubiquitous and expensive operations. In particular, excessive disk spill and network communication are tight bottlenecks for all current MapReduce or Spark solutions. Here, we propose two efficient solutions that drop the computation time by at least 60%: the Spark Bloom-Filtered Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ). Conversely, a direct Spark implementation of a sequence of joins renders poor performance, showcasing the importance of further filtering for minimal disk spill and network communication. Finally, while SBJ is twice faster when memory per executor is large enough, SBFCJ is remarkably resilient to low memory scenarios. Both algorithms pose very competitive solutions to Star Joins in the cloud.
Jaqueline Joice Brito, Thiago Mosqueiro, Ricardo Rodrigues Ciferri, Cristina Dutra De Aguiar Ciferri
438 Jupyter in High Performance Computing [abstract]
Abstract: High Performance Computing has traditionally been the natural habitat of highly specialized parallel programming experts running large batch jobs. With every field of Science becoming richer and richer in the amount of data available, many more scientists are transitioning to Supercomputers or cloud computing resources. In this paper I would like to review how the Jupyter project, a suite of scientific computing tools, can help to democratize access to Supercomputers by lowering the entry barrier for new scientific communities and provide a gradual path to harnessing more distributed computing capabilities. I will start from the interactive usage of the Jupyter Notebook, a widespread browser-based data exploration environment, on a HPC cluster, then explain how notebooks can be used as scripts directly or in a workflow environment and finally how batch data processing like traditional MPI, Spark and XSEDE Gateways can benefit from inter-operating with a Jupyter Notebook environment.
Andrea Zonca
463 High Performance LDA through Collective Model Communication Optimization [abstract]
Abstract: LDA is a widely used machine learning technique for big data analysis. The application includes an inference algorithm that iteratively updates a model until it converges. A major challenge is the scaling issue in parallelization owing to the fact that the model size is huge and parallel workers need to communicate the model continually. We identify three important features of the model in parallel LDA computation: 1. The volume of model parameters required for local computation is high; 2. The time complexity of local computation is proportional to the required model size; 3. The model size shrinks as it converges. By investigating collective and asynchronous methods for model communication in different tools, we discover that optimized collective communication can improve the model update speed, thus allowing the model to converge faster. The performance improvement derives not only from accelerated communication but also from reduced iteration computation time as the model size shrinks during the model convergence. To foster faster model convergence, we design new collective communication abstractions and implement two Harp-LDA applications, "lgs" and "rtt". We compare our new approach with Yahoo! LDA and Petuum LDA, two leading implementations favoring asynchronous communication methods in the field, on a 100-node, 4000-thread Intel Haswell cluster. The experiments show that "lgs" can reach higher model likelihood with shorter or similar execution time compared with Yahoo! LDA, while "rtt" can run up to 3.9 times faster compared with Petuum LDA when achieving similar model likelihood.
Bingjing Zhang, Bo Peng, Judy Qiu