Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY) Session 1

Time and Date: 10:15 - 11:55 on 8th June 2016

Room: Macaw

Chair: Stephane Louise

546 Reinventing computing in the post-Moore era [abstract]
Abstract: After 50 years of unrelenting exponential Moore's Law progress, much of the high tech industry have built in assumptions that the future will always bring cheaper, faster, better devices. The trillion dollar question today is: what happens when the music stops? In this talk I will review "how we got here" and propose what will happen to the semiconductor and software industry post-Moore. Insights are based on experiences from the crowd funded many-core Parallella computing platform and Epiphany design work in leading edge processes.
Andreas Olofsson
506 Advances in Run-Time Performance and Interoperability for the Adapteva Epiphany Coprocessor [abstract]
Abstract: The energy-efficient Adapteva Epiphany architecture exhibits massive many-core scalability in a physically compact 2D array of RISC cores with a fast network-on-chip (NoC). The architecture presents many features and constraints which contribute to software design challenges for the application developer. Addressing these challenges within the software stack that supports application development is critical to improving productivity and expanding the range of applications for the architecture. We report here on advances that have been made in the COPRTHR-2 software stack targeting the Epiphany architecture that address critical issues identified in previous work. Specifically, we describe improvements that bring greater control and precision to the design of compact compiled binary programs in the context of the limited per-core local memory of the architecture. We describe a new design for run-time support that has been implemented to dramatically improve the program load and execute performance and capabilities. Finally, we describe developments that advance host-coprocessor interoperability to expand the functionality available to the application developer.
David Richie, James Ross
347 Implementing OpenSHMEM for the Adapteva Epiphany RISC Array Processor [abstract]
Abstract: The energy-efficient Adapteva Epiphany architecture exhibits massive many-core scalability in a physically compact 2D array of RISC cores with a fast network-on-chip (NoC). With fully divergent cores capable of MIMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to partitioned global address space (PGAS) parallel programming models. Following an investigation into the use of two-sided communication using threaded MPI, one-sided communication using SHMEM is being explored. Here we present work in progress on the development of an OpenSHMEM 1.2 implementation for the Epiphany architecture.
James Ross, David Richie
462 Pattern Based Cache Coherency Architecture for Embedded Manycores [abstract]
Abstract: Modern parallel programming frameworks like OpenMP often rely on shared memory concepts to harness the processing power of parallel systems. But for embedded devices, memory coherence protocols tend to account for a sizable portion of chip's power consumption. This is why any means to lower this impact is important. Our idea for this issue is to use the fact that most of usual workloads display a regular behavior with regards to their memory accesses to prefetch the relevant memory lines in locale caches of execution cores on a manycore system. Our contributions are, on one hand the specifications of a hardware IP for prefetching memory access patterns, and on another hand, a hybrid protocol which extends the classic MESI/baseline architecture to reduce the control and coherence related traffic by at least an order of magnitude. Evaluations are done on two benchmark programs and show the potential of this approach.
Jussara Marandola, Stephane Louise, Loic Cudennec

Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY) Session 2

Time and Date: 13:25 - 15:05 on 8th June 2016

Room: Macaw

Chair: Stephane Louise

488 Using Semantics-Aware Composition and Weaving for Multi-Variant Progressive Parallelization [abstract]
Abstract: When writing parallel software for high performance computing, a common practice is to start from a sequential variant of a program that is consecutively enriched with parallelization directives. This process - progressive parallelization - has the advantage that, at every point in time, a correct version of the program exists. However, progressive parallelization leads to an entanglement of concerns, especially, if different variants of the same functional code have to be maintained and evolved concurrently. We propose orchestration style sheets (OSS) as a novel approach to separate parallelization concerns from problem-specific code by placing them in reusable style sheets, so that concerns for different platforms are always separated, and never lead to entanglement. A weaving process automatically generates platform-specific code for required target platforms, taking semantic properties of the source code into account. Based on a scientific computing case study for fluid mechanics, we show that OSS are an adequate way to improve maintainability and reuse of Fortran code parallelized for several different platforms.
Johannes Mey, Sven Karol, Uwe Aßmann, Immo Huismann, Joerg Stiller, Jochen Fröhlich
402 Evaluating Performance and Energy-Efficiency of a parallel Signal Correlation Algorithm on current Multi- and Many-Core Architectures [abstract]
Abstract: Increasing variety and affordability of multi- and many-core embedded architectures can pose both a challenge and opportunity to developers of high performance computing applications. In this paper we present a case study where we develop and evaluate a unified parallel approach to correlation signal correlation algorithm,currently in use in a commercial/industrial locating system. We utilize both HPX C++ and CUDA runtimes to achieve scalable code for current embedded multi- and many-core architectures (NVIDIA Tegra, Intel Broadwell M, Arm Cortex A-15). We also compare our approach onto traditional high-performance hardware as well as a native embedded many-core variant. To increase the accuracy of our performance analysis we introduce dedicated performance model. The results show that our approach is feasible and enables us to harness the advantages of modern micro-server architectures, but also indicates that there are limitations to some of the currently existing many-core embedded architectures, that can lead to traditional hardware being superior both in efficiency and absolute performance.
Arne Hendricks, Thomas Heller, Andreas Schaefer, Maximilian Kasparek, Dietmar Fey
201 Tabu Search for Partitioning Dynamic Dataflow Programs [abstract]
Abstract: An important challenge of dataflow programming is the problem of partitioning dataflow components onto a target architecture. A common objective function associated to this problem is to find the maximum data processing throughput. This NP-complete problem is very difficult to solve with high quality close-to-optimal solutions for the very large size of the design space and the possibly large variability of input data. This paper introduces four variants of the tabu search metaheuristic expressly developed for partitioning components of a dataflow program. The approach relies on the use of a simulation tool, capable of estimating the performance for any partitioning configuration exploiting a model of the target architecture and the profiling results. The partitioning solutions generated with tabu search are validated for consistency and high accuracy with experimental platform executions.
Malgorzata Michalska, Nicolas Zufferey, Marco Mattavelli
283 A Partition Scheduler Model for Dynamic Dataflow Programs [abstract]
Abstract: The definition of an efficient scheduling policy is an important, difficult and open design problem for the implementation of applications based on dynamic dataflow programs for which optimal closed-form solutions do not exist. This paper describes an approach based on the study of the execution of a dynamic dataflow program on a target architecture with different scheduling policies. The method is based on a representation of the execution of a dataflow program with the associated dependencies, and on the cost of using scheduling policy, expressed as a number of conditions that need to be verified to have a successful execution within each partition. The relation between the potential gain of the overall execution satisfying intrinsic data dependencies and the runtime cost of finding an admissible schedule is a key issue to find close-to-optimal solutions for the scheduling problem of dynamic dataflow applications.
Malgorzata Michalska, Endri Bezati, Simone Casale Brunet, Marco Mattavelli
309 A Fast Evaluation Approach of Data Consistency Protocols within a Compilation Toolchain [abstract]
Abstract: Shared memory is a critical issue for large distributed systems. Despite several data consistency protocols have been proposed, the selection of the protocol that best suits to the application requirements and system constraints remains a challenge. The development of multi-consistency systems, where different protocols can be deployed during runtime, appears to be an interesting alternative. In order to explore the design space of the consistency protocols a fast and accurate method should be used. In this work we rely on a compilation toolchain that transparently handles data consistency decisions for a multi-protocol platform. We focus on the analytical evaluation of the consistency configuration that stands within the optimization loop. We propose to use a TLM NoC simulator to get feedback on expected network contentions. We evaluate the approach using five workloads and three different data consistency protocols. As a result, we are able to obtain a fast and accurate evaluation of the different consistency alternatives.
Loïc Cudennec, Safae Dahmani, Guy Gogniat, Cédric Maignan, Martha Johanna Sepulveda