Tools for Program Development and Analysis in Computational Science (TOOLS) Session 1

Time and Date: 14:10 - 15:50 on 7th June 2016

Room: Macaw

Chair: Jie Tao

346	Inclusive Cost Attribution for Cache Use Profiling [abstract] Abstract: For performance analysis tools to be useful, they need to show the relation of detected bottlenecks to source code. To this end, it often makes sense to use the instruction triggering a problematic event. However for cache line utilization, information on usage is only available at eviction time, but may be better attributed to the instruction which loaded the line. Such attribution is impossible with current processor hardware. Callgrind, a cache simulator part of the open-source Valgrind tool, can do this. However, it only provides Self Costs. In this paper, we extend the cost attribution of cache use metrics to inclusive costs which helps for top-down analysis of complex workloads. The technique can be used for all event types where collected metrics should to be attributed to instructions executing earlier in a program run to be useful.	Josef Weidendorfer, Jens Breitbart
18	KGEN: A Python Tool for Automated Fortran Kernel Generation and Verification [abstract] Abstract: Computational kernels, which are small pieces of software that selectively capture the characteristics of larger applications, have been used successfully for decades. Kernels allow for the testing of a compiler's ability to optimize code, performance of future hardware and reproducing compiler bugs. Unfortunately they can be rather time consuming to create and do not always accurately represent the full complexity of large scientific applications. Furthermore, expert knowledge is often required to create such kernels. In this paper, we present a Python-based tool that greatly simplifies the generation of computational kernels from Fortran based applications. Our tool automatically extracts partial source code of a larger Fortran application into a stand-alone executable kernel. Additionally, our tool also generates state data necessary for proper execution and verification of the extracted kernel. We have utilized our tool to extract more than thirty computational kernels from a million-line climate simulation model. Our extracted kernels have been used for a variety of purposes including: code modernization, identification of limitations in compiler optimizations, numerical algorithm debugging, compiler bug reporting, and for procurement benchmarking.	Youngsung Kim, John Dennis, Christopher Kerr, Raghu Raj Prasanna Kumar, Amogh Simha, Allison Baker, Sheri Mickelson
224	HPCmatlab: A Framework for Fast Prototyping of Parallel Applications in Matlab [abstract] Abstract: The HPCmatlab framework has been developed for Distributed Memory Programming in Matlab/Octave using the Message Passing Interface (MPI). The communication routines in the MPI library are implemented using MEX wrappers. Point-to-point, collective as well as one-sided communication is supported. Benchmarking results show better performance than the Mathworks Distributed Computing Server. HPCmatlab has been used to successfully parallelize and speed up Matlab applications developed for scientific computing. The application results show good scalability, while preserving the ease of programmability. HPCmatlab also enables shared memory programming using Pthreads and Parallel I/O using the ADIOS package.	Xinchen Guo, Mukul Dave, Sayeed Mohamed
106	Runtime verification of scientific codes using statistics [abstract] Abstract: Runtime verification of large-scale scientific codes is difficult because they often involve thousands of processes, and generate very large data structures. Further, the programs often embody complex algorithms making them difficult for non-experts to follow. Notably, typical scientific codes implement mathematical models that often possess predictable statistical features. Therefore, incorporating statistical analysis techniques in the verification process allows using program’s state to reveal unusual details of the computation at runtime. In our earlier work, we proposed a statistical framework for debugging large-scale applications. In this paper, we argue that such framework can be useful in the runtime verification process of scientific codes. We demonstrate how two production simulation programs are verified using statistics. The system is evaluated on a 20,000-core Cray XE6.	Minh Ngoc Dinh, David Abramson, Chao Jin
150	Source Transformation of C++ Codes for Compatibility with Operator Overloading [abstract] Abstract: In C++, new features and semantics can be added to an existing software package without sweeping code changes by introducing a user-defined type using operator overloading. This approach is used, for example, to add capabilities such as algorithmic differentiation. However, the introduction of operator overloading can cause a multitude of compilation errors. In a previous paper, we identified code constructs that cause a violation of the C++ language standard after a type change, and a tool called OO-Lint based on the Clang compiler that identifies these code constructs with lint-like messages. In this paper, we present an extension of this work that automatically transforms such problematic code constructs in order to make an existing code base compatible with a semantic augmentation through operator overloading. We applied our tool to the CFD software OpenFOAM and detected and transformed 23 instances of problematic code constructs in 160,000 lines of code. A significant amount of these root causes are included up to 425 times in other files causing a tremendous compiler error amplification. In addition, we show the significance of our work with a case study of the evolution of the ice flow modeling software ISSM, comparing a recent version which was manually type changed with a legacy version. The recent version shows no signs of problematic code constructs. In contrast, our tool detected and transformed a remarkable amount of issues in the legacy version that previously had to be manually located and fixed.	Alexander Hück, Jean Utke, Christian Bischof

Tools for Program Development and Analysis in Computational Science (TOOLS) Session 2

Time and Date: 16:20 - 18:00 on 7th June 2016

Room: Macaw

Chair: Jie Tao

447	Online MPI Trace Compression using Event Flow Graphs and Wavelets [abstract] Abstract: Performance analysis of scientific parallel applications is essential to use High Performance Computing (HPC) infrastructures efficiently. Nevertheless, collecting detailed data of large-scale parallel programs and long-running applications is infeasible due to the huge amount of performance information generated. Even though there are no technological constraints in storing Terabytes of performance data, the constant flushing of such data to disk introduces a massive overhead into the application that makes the performance measurements worthless. This paper explores the use of Event flow graphs together with wavelet analysis and EZW-encoding to provide MPI event traces that are orders of magnitude smaller while preserving accurate information on timestamped events. Our mechanism compresses the performance data online while the application runs, thus, reducing the pressure put on the I/O system due to buffer flushing. As a result, we achieve lower application perturbation, reduced performance data output, and the possibility to monitor longer application runs.	Xavier Aguilar, Karl Fuerlinger, Erwin Laure
194	WOWMON: A Machine Learning-based Profiler for Self-adaptive Instrumentation of Scientific Workflows [abstract] Abstract: Performance debugging using program profiling and tracing for scientific workflows can be extremely difficult for two reasons. 1) Existing performance tools lack the ability to automatically produce global performance data based on local information from coupled scientific applications, particularly at runtime. 2) Profiling/tracing with static instrumentation may incur high overhead and significantly slow down science-critical tasks. To gain more insights on work- flows we introduce a lightweight workflow monitoring infrastructure, WOWMON (WOrkfloW MONitor), which enables user’s access not only to cross-application performance data such as end-to-end latency and execution time of individual workflow components at runtime, but also to customized performance events. To reduce profiling overhead, WOWMON uses adaptive selection of performance metrics based on machine learning algorithms to guide profilers collecting only metrics that have most impact on performance of workflows. Through the study of real scientific workflows (e.g., LAMMPS) with the help of WOWMON, we found that performance of workflows can be significantly affected by both software and hardware factors, such as policy of process mapping and hardware configurations of clusters. Moreover, we experimentally show that WOWMON can reduce data movement for profiling by up to 54% without missing key metrics for performance debugging.	Xuechen Zhang, Hasan Abbasi, Kevin Huck, Allen Malony
334	A DSL based toolchain for design space exploration in structured parallel programming [abstract] Abstract: We introduce a DSL based toolchain supporting the design of parallel applications where parallelism is structured after parallel design pattern compositions. A DSL provides the possibility to write high level parallel design pattern expressions representing the structure of parallel applications, to refactor the pattern expressions, to evaluate their non-functional properties (e.g. ideal performance, total parallelism degree, etc.) and finally to generate parallel code ready to be compiled and run on different target architectures. We discuss a proof-of-concept prototype implementation of the proposed toolchain generating FastFlow code and show some preliminary results achieved using the prototype implementation.	Marco Danelutto, Massimo Torquati, Peter Kilpatrick