Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY) Session 2

Time and Date: 14:30 - 16:10 on 1st June 2015

Room: M208

Chair: Stephane Louise

529 An Empirical Evaluation of a Programming Model for Context-Dependent Real-time Streaming Applications [abstract]
Abstract: We present a Programming Model for real-time streaming applications on high performance embedded multi- and many-core systems. Realistic streaming applications are highly dependent on the execution context (usually of physical world), past learned strategies, and often real-time constraints. The proposed Programming Model encompasses both real-time requirements, determinism of execution and context dependency. It is an extension of the well-known Cyclo-Static Dataflow (CSDF), for its desirable properties (determinism and composability), with two new important data-flow filters: Select-duplicate, and Transaction which retain the main properties of CSDF graphs and also provide useful features to implement real-time computational embedded applications. We evaluate the performance of our programming model thanks to several real-life case-studies and demonstrate that our approach overcomes a range of limitations that use to be associated with CSDF models.
Xuan Khanh Do, Stephane Louise, Albert Cohen
617 A Case Study on Using a Proto-Application as a Proxy for Code Modernization [abstract]
Abstract: The current HPC system architecture trend consists in the use of many-core and heterogeneous architectures. Programming and runtime approaches struggle to scale with the growing number of nodes and cores. In order to take advantage of both distributed and shared memory levels, flat MPI seems unsustainable. Hybrid parallelization strategies are required. In a previous work we have demonstrated the efficiency of the D&C approach for the hybrid parallelization of finite element method assembly on unstructured meshes. In this paper we introduce the concept of proto-application as a proxy between computer scientists and application developers.The D&C library has been entirely developed on a proto-application, extracted from an industrial application called DEFMESH, and then ported back and validated on the original application. In the meantime, we have ported the D&C library in AETHER, an industrial fluid dynamics code developed by Dassault Aviation. The results show that the speed-up validated on the proto-application can be reproduced on other full scale applications using similar computational patterns. Nevertheless, this experience draws the attention on code modernization issues, such as data layout adaptation and memory management. As the D\&C library uses a task based runtime, we also make a comparison between Intel\textregistered Cilk\texttrademark Plus and OpenMP.
Nathalie Möller, Eric Petit, Loïc Thébault, Quang Dinh
422 A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures [abstract]
Abstract: Maximizing the data throughput is a very common implementation objective for several streaming applications. Such task is particularly challenging for implementations based on many-core and multi-core target platforms because, in general, it implies tackling several NP-complete combinatorial problems. Moreover, an efficient design space exploration requires an accurate evaluation on the basis of dataflow program execution profiling. The focus of the paper is on the methodology challenges for obtaining accurate profiling measures. Experimental results validate a many-core platform built by an array of Transport Triggered Architecture processors for exploring the partitioning search space based on the execution trace analysis.
Malgorzata Michalska, Jani Boutellier, Marco Mattavelli
424 Execution Trace Graph Based Multi-Criteria Partitioning of Stream Programs [abstract]
Abstract: One of the problems proven to be NP-hard in the field of many-core architectures is the partitioning of stream programs. In order to maximize the execution parallelism and obtain the maximal data throughput for a streaming application it is essential to find an appropriate actors assignment. The paper proposes a novel approach for finding a close-to-optimal partitioning configuration which is based on the execution trace graph of a dataflow network and its analysis. We present some aspects of dataflow programming that make the partitioning problem different in this paradigm and build the heuristic methodology on them. Our optimization criteria include: balancing the total processing workload with regards to data dependencies, actors idle time minimization and reduction of data exchanges between processing units. Finally, we validate our approach with experimental results for a video decoder design case and compare them with some state-of-the-art solutions.
Malgorzata Michalska, Simone Casale-Brunet, Endri Bezati, Marco Mattavelli
365 A First Step to Performance Prediction for Heterogeneous Processing on Manycores [abstract]
Abstract: In order to maintain the continuous growth of the performance of computers while keeping their energy consumption under control, the microelecttronic industry develops architectures capable of processing more and more tasks concurrently. Thus, the next generations of microprocessors may count hundreds of independent cores that may differ in their functions and features. As an extensive knowledge of their internals cannot be a prerequisite to their programming and for the sake of portability, these forthcoming computers necessitate the compilation flow to evolve and cope with heterogeneity issues. In this paper, we lay a first step toward a possible solution to this challenge by exploring the results of SPMD type of parallelism and predicting performance of the compilation results so that our tools can guide a compiler to build an optimal partition of task automatically, even on heterogeneous targets. We show on experimental results a very good accuracy of our tools to predict real world performance.
Nicolas Benoit, Stephane Louise