ICCS 2015 Main Track (MT) Session 1

Time and Date: 10:35 - 12:15 on 1st June 2015

Room: M101

Chair: Jorge Veiga Fachal

53 Diarchy: An Optimized Management Approach for MapReduce Masters [abstract]
Abstract: The MapReduce community is progressively replacing the classic Hadoop with Yarn, the second-generation Hadoop (MapReduce 2.0). This transition is being made due to many reasons, but primarily because of some scalability drawbacks of the classic Hadoop. The new framework has appropriately addressed this issue and is being praised for its multi-functionality. In this paper we carry out a probabilistic analysis that emphasizes some reliability concerns of Yarn at the job master level. This is a critical point, since the failures of a job master involves the failure of all the workers managed by such a master. In this paper, we propose Diarchy, a novel system for the management of job masters. Its aim is to increase the reliability of Yarn, based on the sharing and backup of responsibilities between two masters working as peers. The evaluation results show that Diarchy outperforms the reliability performance of Yarn in different setups, regardless of cluster size, type of job, or average failure rate and suggest a positive impact of this approach compared to the traditional, single-master Hadoop architecture.
Bunjamin Memishi, María S. Pérez, Gabriel Antoniu
61 MPI-Parallel Discrete Adjoint OpenFOAM [abstract]
Abstract: OpenFOAM is a powerful Open-Source (GPLv3) Computational Fluid Dynamics tool box with a rising adoption in both academia and industry due to its continuously growing set of features and the lack of license costs. Our previously developed discrete adjoint version of OpenFOAM allows us to calculate derivatives of arbitrary objectives with respect to a potentially very large number of input parameters at a relative (to a single primal flow simulation) computational cost which is independent of that number. Discrete adjoint OpenFOAM enables us to run gradient-based methods such as topology optimization efficiently. Up until recently only a serial version was available limiting both the computing performance and the amount of memory available for the solution of the problem. In this paper we describe a first parallel version of discrete adjoint OpenFOAM based on our adjoint MPI library.
Markus Towara, Michel Schanen, Uwe Naumann
98 Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience [abstract]
Abstract: Exascale studies project reliability challenges for future HPC systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR’s interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2 % are achieved. Thus, we conclude that GVR’s interfaces and implementation are flexible, portable, and create a gentle-slope path to tolerate growing error rates in future systems.
Andrew Chien, Pavan Balaji, Pete Beckman, Nan Dun, Aiman Fang, Hajime Fujita, Kamil Iskra, Zachary Rubenstein, Ziming Zheng, Robert Schreiber, Jeff Hammond, James Dinan, Ignacio Laguna, David Richards, Anshu Dubey, Brian van Straalen, Mark Hoemmen, Michael Heroux, Keita Teranishi, Andrew Siegel
106 Characterizing a High Throughput Computing Workload: The Compact Muon Solenoid (CMS) Experiment at LHC [abstract]
Abstract: High throughput computing (HTC) has aided the scientific community in the analysis of vast amounts of data and computational jobs in distributed environments. To manage these large workloads, several systems have been developed to efficiently allocate and provide access to distributed resources. Many of these systems rely on job characteristics estimates (e.g., job runtime) to characterize the workload behavior, which in practice is hard to obtain. In this work, we perform an exploratory analysis of the CMS experiment workload using the statistical recursive partitioning method and conditional inference trees to identify patterns that characterize particular behaviors of the workload. We then propose an estimation process to predict job characteristics based on the collected data. Experimental results show that our process estimates job runtime with 75% of accuracy on average, and produces nearly optimal predictions for disk and memory consumption.
Rafael Ferreira Da Silva, Mats Rynge, Gideon Juve, Igor Sfiligoi, Ewa Deelman, James Letts, Frank Wuerthwein, Miron Livny
182 Performance Tuning of MapReduce Jobs Using Surrogate-Based Modeling [abstract]
Abstract: Modeling workflow performance is crucial for finding optimal configuration parameters and optimizing execution times. We apply the method of surrogate-based modeling to performance tuning of MapReduce jobs. We build a surrogate model defined by a multivariate polynomial containing a variable for each parameter to be tuned. For illustrative purposes, we focus on just two parameters: the number of parallel mappers and the number of parallel reducers. We demonstrate that an accurate performance model can be built sampling a small set of the parameter space. We compare the accuracy and cost of building the model when using different sampling methods as well as when using different modeling approaches. We conclude that the surrogate-based approach we describe is both less expensive in terms of sampling time and more accurate than other well-known tuning methods.
Travis Johnston, Mohammad Alsulmi, Pietro Cicotti, Michela Taufer