ICCS 2017 Main Track (MT) Session 16

Time and Date: 15:45 - 17:25 on 12th June 2017

Room: HG D 1.2

Chair: Fabrício Enembreck

135	StoreRush: An Application-Level Approach to Harvesting Idle Storage in a Best Effort Environment [abstract] Abstract: For a production HPC system where storage devices are shared between multiple applica- tions and managed in a best effort manner, contention is often a major problem leading to some storage devices being more loaded than others and causing a significant reduction in I/O throughput. In this paper, we describe our latest efforts StoreRush to resolve this practical issue at the application level without requiring modification to the file and storage system. The proposed scheme uses a two-level messaging system to harvest idle storage via re-routing I/O requests to a less congested storage location so that write performance is improved while lim- iting the impact on read by throttling re-routing if deemed too much. An analytical model is derived to guide the setup of optimal throttling factor. The proposed scheme is verified against production applications Pixie3D, XGC1 and QMCPack during production windows, which very well demonstrated the effectiveness (e.g., up to 1.8x improvement in write) and scalability of our approach (up to 131,072 cores).	Qing Liu, Norbert Podhorszki, Jong Choi, Jeremy Logan, Matt Wolf, Scott Klasky, Tahsin Kurc and Xubin He
204	Fast Parallel Construction of Correlation Similarity Matrices for Gene Co-Expression Networks on Multicore Clusters [abstract] Abstract: Gene co-expression networks are gaining attention in the present days as useful representations of biologically interesting interactions among genes. The most computationally demanding step to generate these networks is the construction of the correlation similarity matrix, as all pairwise combinations must be analyzed and complexity increases quadratically with the number of genes. In this paper we present MPICorMat, a hybrid MPI/OpenMP parallel approach to construct similarity matrices based on Pearson’s correlation. It is based on a previous tool (RMTGeneNet) that has been used on several biological studies and proved accurate. Our tool obtains the same results as RMTGeneNet but significantly reduces runtime on multicore clusters. For instance, MPICorMat generates the correlation matrix of a dataset with 61,170 genes and 160 samples in less than one minute using 16 nodes with two Intel Xeon Sandy-Bridge processors each (256 total cores), while the original tool needed almost 4.5 hours. The tool is also compared to another available approach to construct correlation matrices on multicore clusters, showing better scalability and performance. MPICorMat is an open-source software and it is publicly available at https://sourceforge.net/projects/mpicormat/.	Jorge González-Domínguez and María J. Martín
261	The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems [abstract] Abstract: A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on GEMM, to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL and CuBLAS by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, respectively, for large numbers of DGEMM operations using matrices of size 2 × 2 to 20 × 20	Jack Dongarra, Sven Hammarling, Nick Higham, Samuel Relton, Pedro Valero-Lara and Mawussi Zounon
333	OUTRIDER: Optimizing the mUtation Testing pRocess In Distributed EnviRonments [abstract] Abstract: The adoption of commodity clusters has been widely extended due to its cost-effectiveness and the evolution of networks. These systems can be used to reduce the long execution time of applications that require a vast amount of computational resources, and especially of those techniques that are usually deployed in centralized environments, like testing. Currently, one of the main challenges in testing is to obtain an appropriate test suite. Mutation testing is a widely used technique aimed at generating high quality test suites. However, the execution of this technique requires a high computational cost. In this work we propose OUTRIDER, an HPC-based optimization that contributes to bridging the gap between the high computational cost of mutation testing and the parallel infrastructures of HPC systems aimed to speed-up the execution of computational applications. This optimization is based on our previous work called EMINENT, an algorithm focused on parallelizing the mutation testing process using MPI. However, since EMINENT does not efficiently exploit the computational resources in HPC systems, we propose 4 strategies to alleviate this issue. A thorough experimental study using different applications shows an increase of up to 70% performance improvement using these optimizations.	Pablo C. Cañizares, Alberto Núñez and Juan de Lara
112	Topology-aware Job Allocation in 3D Torus-based HPC Systems with Hard Job Priority Constraints [abstract] Abstract: In this paper, we address the topology-aware job allocation problem on 3D torus-based high performance computing systems, with the objective of reducing system fragmentation. Firstly, we propose a group-based job allocation strategy, which leads to a more global optimization of resource allocation. Secondly, we propose two shape allocation methods to determine the topological shape for each input job, including a zigzag allocation method for communication non-sensitive jobs, and a convex allocation method for communication sensitive jobs. Thirdly, we propose a topology-aware job mapping algorithm to reduce the system fragmentation brought in by the job mapping process, including a target bin selection method and a bi-directional job mapping method. The evaluation results validate the efficiency of our approach in reducing system fragmentation and improving system utilization.	Kangkang Li, Maciej Malawski and Jarek Nabrzyski