Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY) Session 1

Time and Date: 16:30 - 18:10 on 10th June 2014

Room: Bluewater I

Chair: Stéphane Louise

348 τC: C with Process Network Extensions for Embedded Manycores [abstract]
Abstract: Current and future embedded manycores targets bring complex and heterogeneous architectures with a large number of processing cores, making both parallel programming to this scale and understanding the architecture itself a daunting task. Process Networks and other dataflow based Models of Computation (MoC) are a good base to present a universal model of the underlying manycore architectures to the programmer. If a language displays a simple to grasp MoC in a consistent way across architectures, the programmer can concentrate the efforts on optimizing the expression of parallelism in the application instead of porting a given code on a given system. Such goal would provide the C-language equivalent for the manycores. In this paper, we present a process network extension to C called τ C and its mapping to both a POSIX target and the P2012/STHORM platform, and show how the language offers an architecture independent solution of this problem.
Thierry Goubier, Damien Couroussé, Selma Azaiez
96 Application-Level Performance Optimization: A Computer Vision Case Study on STHORM [abstract]
Abstract: Computer vision applications constitute one of the key drivers for embedded many-core architectures. In order to exploit the full potential of such systems, a balance between computation and communication is critical, but many computer vision algorithms present a highly data-dependent behavior that complexify this task. To enable application performance optimization, the development environment must provide the developer with tools for fast and precise application-level performance analysis. We describe the process to port and optimize a face detection application onto the STHORM many-core accelerator using the STHORM OpenCL SDK. We identify the main factors that limit performance and discern the contributions arising from: the application itself, the OpenCL programming model, and the STHORM OpenCL SDK. Finally, we show how these issues can be addressed in the future to enable developers to further improve application performance.
Vítor Schwambach, Sébastien Cleyet-Merle, Alain Issard, Stéphane Mancini
387 Generating Code and Memory Buffers to Reorganize Data on Many-core Architectures [abstract]
Abstract: The dataflow programming model has shown to be a relevant approach to efficiently run massively parallel applications over many-core architectures. In this model, some particular builtin agents are in charge of data reorganizations between user agents. Such agents can Split, Join and Duplicate data onto their communication ports. They are widely used in signal processing for example. These system agents, and their associated implementations, are of major importance when it comes to performances, because they can stand on the critical path (think about Amdhal's law). Furthermore, a particular data reorganization can be expressed by the developer in several ways, that may lead to inefficient solutions (mostly unneeded data copies and transfers). In this paper, we propose several strategies to manage data reorganization at compile time, with a focus on indexed accesses to shared buffers to avoid data copies. These strategies are complementary: they ensure correctness for each system agent configuration, as well as performance when possible. They have been implemented within the Sigma-C industry-grade compilation toolchain and evaluated over the Kalray MPPA 256-core processor.
Loïc Cudennec, Paul Dubrulle, François Galea, Thierry Goubier, Renaud Sirdey
359 Self-Timed Periodic Scheduling For Cyclo-Static DataFlow Model [abstract]
Abstract: Real-time and Time constrained applications programmed on many-core systems can suffer from unmet timing constraints even with correct-by-construction schedules. Such unexpected results are usually caused by unaccounted for delays of resource sharing (\emph{e.g.} the communication medium). In this paper we address the three main sources of unpredictable behaviors: First, we propose to use a deterministic Model of Computation (MoC), more specifically, the well-formed CSDF subset of process networks; Second, we propose a run-time management strategy of shared resources to avoid unpredictable timings; Third, we promote the use of a new scheduling policy, the so-said Self-Timed Periodic (STP) scheduling, to improve performance and decrease synchronization costs by taking into account resource sharing or resource constraints. This is a quantitative improvement above state-of-the-art scheduling policies which assumed fixed delays of inter-processor communication and did not take correctly into account subtle effects of synchronization.
Dkhil Ep.Jemal Amira, Xuankhanh Do, Stephane Louise, Dubrulle Paul, Christine Rochange