Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY) Session 1

Time and Date: 10:35 - 12:15 on 1st June 2015

Room: M208

Chair: Stephane Louise

743	Alchemy Workshop Keynote: Programming heterogeneous, manycore machines: a runtime system's perspective [abstract] Abstract: Heterogeneous manycore parallel machines, mixing multicore CPUs with manycore accelerators provide an unprecedented amount of processing power per node. Dealing with such a large number of heterogeneous processing units -- providing a highly unbalanced computing power -- is one of the biggest challenge that developpers of HPC applications have to face. To Fully tap into the potential of these heterogeneous machines, pure offloading approaches, that consist in running an application on host cores while offloading part of the code on accelerators, are not sufficient. In this talk, I will go through the major software techniques that were specifically designed to harness heterogeneous architectures, focusing on runtime systems. I will discuss some of the most critical issues programmers have to consider to achieve portability of performance, and how programming languages may evolve to meet such as goal. Eventually, I will give some insights about the main challenges designers of programming environments will have to face in upcoming years.	Raymond Namyst
433	On the Use of a Many-core Processor for Computational Fluid Dynamics Simulations [abstract] Abstract: The increased availability of modern embedded many-core architectures supporting floating-point operations in hardware makes them interesting targets in traditional high performance computing areas as well. In this paper, the Lattice Boltzmann Method (LBM) from the domain of Computational Fluid Dynamics (CFD) is evaluated on Adapteva’s Epiphany many-core architecture. Although the LBM implementation shows very good scalability and high floating-point efficiency in the lattice computations, current Epiphany hardware does not provide adequate amounts of either local memory or external memory bandwidth to provide a good foundation for simulation of the large problems commonly encountered in real CFD applications.	Sebastian Raase, Tomas Nordström
263	A short overview of executing Γ Chemical Reactions over the ΣC and τC Dataflow Programming Models [abstract] Abstract: Many-core processors offer top computational power while keeping the energy consumption reasonable compared to complex processors. Today, they enter both high-performance computing systems, as well as embedded systems. However, these processors require dedicated programming models to efficiently benefit from their massively parallel architectures. The chemical programming paradigm has been introduced in the late eighties as an elegant way of formally describing distributed programs. Data are seen as molecules that can freely react thanks to operators to create new data. This paradigm has also been used within the context of grid computing and now seems to be relevant for many-core processors. Very few implementations of runtimes for chemical programming have been proposed, none of them giving serious elements on how it can be deployed onto a real architecture. In this paper, we propose to implement some parts of the chemical paradigm over the ΣC dataflow programming language, that is dedicated to many-core processors. We show how to represent molecules using agents and communication links, and to iteratively build the dataflow graph following the chemical reactions. A preliminary implementation of the chemical reaction mechanisms is provided using the τC dataflow compilation toolchain, a language close to ΣC, in order to demonstrate the relevance of the proposition.	Loïc Cudennec, Thierry Goubier
435	Threaded MPI Programming Model for the Epiphany RISC Array Processor [abstract] Abstract: The Adapteva Epiphany RISC array processor offers high computational energy-efficiency and parallel scalability. However, extracting performance with a standard parallel programming model remains a great challenge. We present an effective programming model for the low-power Epiphany architecture based on the Message Passing Interface (MPI) standard. Using MPI exploits the similarities between the Epiphany architecture and a networked parallel distributed cluster. Furthermore, our approach enables codes written with MPI to execute on the RISC array processor with little modification. We present experimental results for the threaded MPI implementation of matrix-matrix multiplication and highlight the importance of fast inter-core data transfers. Our high-level programming methodology achieved an on-chip performance of 9.1 GFLOPS.	David Richie, James Ross, Song Park and Dale Shires

Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY) Session 2

Time and Date: 14:30 - 16:10 on 1st June 2015

Room: M208

Chair: Stephane Louise

529	An Empirical Evaluation of a Programming Model for Context-Dependent Real-time Streaming Applications [abstract] Abstract: We present a Programming Model for real-time streaming applications on high performance embedded multi- and many-core systems. Realistic streaming applications are highly dependent on the execution context (usually of physical world), past learned strategies, and often real-time constraints. The proposed Programming Model encompasses both real-time requirements, determinism of execution and context dependency. It is an extension of the well-known Cyclo-Static Dataflow (CSDF), for its desirable properties (determinism and composability), with two new important data-flow filters: Select-duplicate, and Transaction which retain the main properties of CSDF graphs and also provide useful features to implement real-time computational embedded applications. We evaluate the performance of our programming model thanks to several real-life case-studies and demonstrate that our approach overcomes a range of limitations that use to be associated with CSDF models.	Xuan Khanh Do, Stephane Louise, Albert Cohen
617	A Case Study on Using a Proto-Application as a Proxy for Code Modernization [abstract] Abstract: The current HPC system architecture trend consists in the use of many-core and heterogeneous architectures. Programming and runtime approaches struggle to scale with the growing number of nodes and cores. In order to take advantage of both distributed and shared memory levels, flat MPI seems unsustainable. Hybrid parallelization strategies are required. In a previous work we have demonstrated the efficiency of the D&C approach for the hybrid parallelization of finite element method assembly on unstructured meshes. In this paper we introduce the concept of proto-application as a proxy between computer scientists and application developers.The D&C library has been entirely developed on a proto-application, extracted from an industrial application called DEFMESH, and then ported back and validated on the original application. In the meantime, we have ported the D&C library in AETHER, an industrial fluid dynamics code developed by Dassault Aviation. The results show that the speed-up validated on the proto-application can be reproduced on other full scale applications using similar computational patterns. Nevertheless, this experience draws the attention on code modernization issues, such as data layout adaptation and memory management. As the D\&C library uses a task based runtime, we also make a comparison between Intel\textregistered Cilk\texttrademark Plus and OpenMP.	Nathalie Möller, Eric Petit, Loïc Thébault, Quang Dinh
422	A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures [abstract] Abstract: Maximizing the data throughput is a very common implementation objective for several streaming applications. Such task is particularly challenging for implementations based on many-core and multi-core target platforms because, in general, it implies tackling several NP-complete combinatorial problems. Moreover, an efficient design space exploration requires an accurate evaluation on the basis of dataflow program execution profiling. The focus of the paper is on the methodology challenges for obtaining accurate profiling measures. Experimental results validate a many-core platform built by an array of Transport Triggered Architecture processors for exploring the partitioning search space based on the execution trace analysis.	Malgorzata Michalska, Jani Boutellier, Marco Mattavelli
424	Execution Trace Graph Based Multi-Criteria Partitioning of Stream Programs [abstract] Abstract: One of the problems proven to be NP-hard in the field of many-core architectures is the partitioning of stream programs. In order to maximize the execution parallelism and obtain the maximal data throughput for a streaming application it is essential to find an appropriate actors assignment. The paper proposes a novel approach for finding a close-to-optimal partitioning configuration which is based on the execution trace graph of a dataflow network and its analysis. We present some aspects of dataflow programming that make the partitioning problem different in this paradigm and build the heuristic methodology on them. Our optimization criteria include: balancing the total processing workload with regards to data dependencies, actors idle time minimization and reduction of data exchanges between processing units. Finally, we validate our approach with experimental results for a video decoder design case and compare them with some state-of-the-art solutions.	Malgorzata Michalska, Simone Casale-Brunet, Endri Bezati, Marco Mattavelli
365	A First Step to Performance Prediction for Heterogeneous Processing on Manycores [abstract] Abstract: In order to maintain the continuous growth of the performance of computers while keeping their energy consumption under control, the microelecttronic industry develops architectures capable of processing more and more tasks concurrently. Thus, the next generations of microprocessors may count hundreds of independent cores that may differ in their functions and features. As an extensive knowledge of their internals cannot be a prerequisite to their programming and for the sake of portability, these forthcoming computers necessitate the compilation flow to evolve and cope with heterogeneity issues. In this paper, we lay a first step toward a possible solution to this challenge by exploring the results of SPMD type of parallelism and predicting performance of the compilation results so that our tools can guide a compiler to build an optimal partition of task automatically, even on heterogeneous targets. We show on experimental results a very good accuracy of our tools to predict real world performance.	Nicolas Benoit, Stephane Louise

Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY) Session 3

Time and Date: 16:40 - 18:20 on 1st June 2015

Room: M208

Chair: Stephane Louise

528	Towards an automatic co-generator for manycores’ architecture and runtime: STHORM case-study [abstract] Abstract: The increasing design complexity of manycore architectures at the hardware and software levels imposes to have powerful tools capable of validating every functional and non-functional property of the architecture. At the design phase, the chip architect needs to explore several parameters from the design space, and iterate on different instances of the architecture, in order to meet the defined requirements. Each new architectural instance requires the configuration and the generation of a new hardware model/simulator, its runtime, and the applications that will run on the platform, which is a very long and error-prone task. In this context, the IP-XACT standard has become widely used in the semiconductor industry to package IPs and provide low level SW stack to ease their integration. In this work, we present a primer work on a methodology to automatically configuring and assembling an IP-XACT golden model and generating the corresponding manycore architecture HW model, low-level software runtime and applications. We use the STHORM manycore architecture and the HBDC application as a case study.	Charly Bechara, Karim Ben Chehida, Farhat Thabet
249	Retargeting of the Open Community Runtime to Intel Xeon Phi [abstract] Abstract: The Open Community Runtime (OCR) is a recent effort in the search for a runtime for extreme scale parallel systems. OCR relies on the concept of a dynamically generated task graph to express the parallelism of a program. Rather than being directly used for application development, the main purpose of OCR is to become a low-level runtime for higher-level programming models and tools. Since manycore architectures like the Intel Xeon Phi are likely to play a major role in future high performance systems, we have implemented the OCR API for shared-memory machines, including the Xeon Phi. We have also implemented two benchmark applications and performed experiments to investigate the viability of the OCR as a runtime for manycores. Our experiments and a comparison with OpenMP indicate that OCR can be an efficient runtime system for current and emerging manycore systems.	Jiri Dokulil, Siegfried Benkner
14	Prefetching Challenges in Distributed Memories for CMPs [abstract] Abstract: Prefetch engines working on distributed memory systems behave independently by analyzing the memory accesses that are addressed to the attached piece of cache. They potentially generate prefetching requests targeted at any other tile on the system that depends on the computed address. This distributed behavior involves several challenges that are not present when the cache is unified. In this paper, we identify, analyze, and quantify the effects of these challenges, thus paving the way to future research on how to implement prefetching mechanisms at all levels of this kind of system with shared distributed caches.	Marti Torrents, Raul Martínez, Carlos Molina