Tools for Program Development and Analysis in Computational Science (Tools) Session 1

Time and Date: 10:35 - 12:15 on 12th June 2017

Room: HG E 33.3

Chair: Andreas Knüpfer

450	Performance Analysis of Parallel Python Applications [abstract] Abstract: Python is progressively consolidating itself within the HPC community with its simple syntax, large standard library, as well as powerful third-party libraries for scientific computing that are especially attractive to domain scientists. Despite Python lowering the bar for accessing parallel computing, utilizing the capacities of HPC systems efficiently remains a challenging task, after all. Yet, at the moment only few supporting tools exist and provide merely basic information in the form of summarized profile data. In this paper, we present our efforts in developing event-based tracing support for Python within the performance monitor Extrae to provide detailed information and enable a profound performance analysis. We present concepts to record the complete communication behavior as well as to capture entry and exit of functions in Python to provide the according application context. We evaluate our implementation in Extrae by analyzing the well-established electronic structure simulation package GPAW and demonstrate that the recorded traces provide equivalent information as for traditional C or Fortran applications and, therefore, offering the same profound analysis capabilities now for Python, as well.	Michael Wagner, Germán Llort, Estanislao Mercadal, Judit Giménez and Jesús Labarta
159	Scaling Score-P to the next level [abstract] Abstract: As part of performance measurements with Score-P, a description of the system and the execution locations is recorded into the performance measurement reports. For large-scale measurements using a million or more processes, the global system description can consume all the available memory. While the information stored process-locally during measurement is small, the memory requirement becomes a bottleneck in the process of constructing a global representation of the whole system. To address this problem we implemented a new system description in Score-P that exploits regular structures of the system, and results, on homogeneous systems, in a system description of constant size. Furthermore, we present a parallel algorithm to create a global view from the process-local information. The scalable system description comes at the price that it is no longer possible to assign individual names to each system element, but only enumerate elements of the same type. We have successfully tested the new approach on the full JUQUEEN system with up to nearly two million processes.	Daniel Lorenz and Christian Feld
528	Design Evaluation of a Performance Analysis Trace Repository [abstract] Abstract: Parallel and high performance computing experts are obsessed with performance and scalability. Performance analysis and tuning are important and complex but there is a number of software tools to support this. One methodology for such tools is detailed recording of parallel runtime behavior in event traces and their subsequent analysis. This regularly produces very large data sets with their own challenges for handling and data management. This paper evaluates the utilization of the MASi research data management service as a trace repository to store, manage, and find traces in an efficient and usable way. First, we give an introduction to trace technologies in general, metadata in OTF2 traces specifically, and the MASi research data management service. Then, the trace repository is described with its potential for both performance analysts and parallel tool developers, followed with how we implemented it using existing metadata and how it can utilized. Finally, we give an outlook on how we plan to put the repository into productive use for the benefit of researchers using traces.	Richard Grunzke, Maximilian Neumann, Thomas Ilsche, Volker Hartmann, Thomas Jejkal, Rainer Stotzka, Andreas Knüpfer and Wolfgang E. Nagel
475	Software Framework for Parallel BEM Analyses with H-matrices Using MPI and OpenMP [abstract] Abstract: A software framework has been developed for use in parallel boundary element method (BEM) analyses. The framework program was parallelized in a hybrid parallel programming model, and both multiple processes and threads were used. Additionally, an H-matrix library for a distributed memory parallel computer was also developed to accelerate the analysis. In this paper, we describe the basic design concept for the framework and details of its implementation. The framework program, which was written with MPI functions and OpenMP directives, is mainly intended to reduce the user’s parallel programming costs. We also show the results of a sample analysis performed with approximately 60,000 unknowns. The numerical results verify the effectiveness of both the parallelization and the H-matrix method. In the test analysis, which was performed using a single core, the H-matrix version of the framework is 17-fold faster than the dense matrix version. The parallel framework program with the H-matrix attains an approximately 50-fold acceleration using 128 cores when compared with sequential computation.	Takeshi Iwashita, Akihiro Ida, Takeshi Mifune and Yasuhito Takahashi