ICCS 2019 Main Track (MT) Session 5

Time and Date: 14:20 - 16:00 on 13th June 2019

Room: 1.5

Chair: Jorge González-Domínguez

367 An On-line Performance Introspection Framework for Task-based Runtime Systems [abstract]
Abstract: The expected high levels of parallelism together with the heterogeneity of new computing systems pose many challenges to current performance monitoring frameworks. Classical post-mortem approaches will not be sufficient for such dynamic, complex and highly concurrent environments. First, the amounts of data that can be generated from such systems will be impractical. And second, the access to real-time performance data to orchestrate program execution will be a necessity. In this paper, we present a lightweight monitoring infrastructure developed within the AllScale Runtime System, a task-based runtime system for extreme scale. This monitoring component provides on-line introspection capabilities that help the runtime scheduler in its decision making process and adaptation, while introducing minimum overhead. In addition, the monitoring component provides several post-mortem reports as well as real-time data visualisation that can be of great help in the task of performance debugging.
Xavier Aguilar, Herbert Jordan, Thomas Heller, Alexander Hirsch, Thomas Fahringer and Erwin Laure
405 Productivity-aware Design and Implementation of Distributed Tree-based Search Algorithms [abstract]
Abstract: Parallel tree-based search algorithms are present in different areas, such as operations research, machine learning and artificial intelligence. This class of algorithms is highly compute-intensive, irregular and usually relies on context-specific data structures and hand-made code optimizations. Therefore, C and C++ are the languages often employed, due to their low-level features and performance. In this work, we investigate the use of Chapel high-productivity language for the design and implementation of distributed tree search algorithms for solving combinatorial problems. The experimental results show that Chapel is a suitable language for this purpose, both in terms of performance and productivity. Despite the use of high-level features, the distributed tree search in Chapel is on average 16% slower and reaches up to 85% of the scalability observed for its MPI+OpenMP counterpart.
Tiago Carneiro Pessoa and Nouredine Melab
462 Development of Element-by-Element Kernel Algorithms in Unstructured Implicit Low-Order Finite-Element Earthquake Simulation for Many-Core Wide-SIMD CPUs [abstract]
Abstract: Acceleration of the Element-by-Element (EBE) kernel in matrix-vector products is essential for high-performance in unstructured implicit finite-element applications. However, the EBE kernel is not straight forward to attain high performance due to random data access with data recurrence. In this paper, we develop methods to circumvent these data races for high performance on many-core CPU architectures with wide SIMD units. The developed EBE kernel attains 16.3% and 20.9% of FP32 peak on Intel Xeon Phi Knights Landing based Oakforest-PACS and Intel Skylake Xeon Gold processor based system, respectively. This leads to 2.88-fold speedup over the baseline kernel and 2.03-fold speedup of the whole finite-element application on Oakforest-PACS. An example of urban earthquake simulation using the developed finite-element application is shown.
Kohei Fujita, Masashi Horikoshi, Tsuyoshi Ichimura, Larry Meadows, Kengo Nakajima, Muneo Hori and Lalith Maddegedara
516 A High-productivity Framework for Adaptive Mesh Refinement on Multiple GPUs [abstract]
Abstract: Recentlygrid-basedphysicalsimulationswithmultipleGPUs require effective methods to adapt grid resolution to certain sensitive regions of simulations. In the GPU computation, an adaptive mesh re- finement (AMR) method is one of the effective methods to compute certain local regions that demand higher accuracy with higher resolu- tion. However, the AMR methods using multiple GPUs demand compli- cated implementation and require various optimizations suitable for GPU computation in order to obtain high performance. Our AMR framework provides a high-productive programming environment of a block-based AMR for grid-based applications. Programmers just write the stencil functions that update a grid point on Cartesian grid, which are executed over a tree-based AMR data structure effectively by the framework. It also provides the efficient GPU-suitable methods for halo exchange and mesh refinement with a dynamic load balance technique. The framework- based application for compressible flow has achieved to reduce the com- putational time to less than 15% with 10% of memory footprint in the best case compared to the equivalent computation running on the fine uniform grid. It also has demonstrated good weak scalability with 84% of the parallel efficiency on the TSUBAME3.0 supercomputer.
Takashi Shimokawabe and Naoyuki Onodera
197 Harmonizing Sequential and Random Access to Datasets in Organizationally Distributed Environments [abstract]
Abstract: Computational science is rapidly developing, which pushes the boundaries in data management concerning the size and structure of datasets, data processing patterns, geographical distribution of data and performance expectations. In this paper, we present a solution for harmonizing data access performance, i.e. finding a compromise between local and remote read/write efficiency that would fit those evolving requirements. It is based on variable-size logical data-chunks (in contrast to fixed-size blocks), direct storage access and several mechanisms improving remote data access performance. The solution is implemented in the Onedata system and suited to its multi-layer architecture, supporting organizationally distributed environments -- with limited trust between data providers. The solution is benchmarked and compared to XRootD + XCache, which offers similar functionalities. The results show that the performance of both systems is comparable, although overheads in local data access are visibly lower in Onedata.
Michał Wrzeszcz, Łukasz Opioła, Bartosz Kryza, Łukasz Dutka, Renata Słota and Jacek Kitowski