### Mathematical Methods and Algorithms for Extreme Scale (MMAES) Session 1

#### Time and Date: 11:20 - 13:00 on 10th June 2014

#### Room: Mossman

#### Chair: Vassil Alexandrov

227 | Fast Iterative Method in solving Eikonal equations : a multi-level parallel approach [abstract]Abstract: The fast marching method is widely used to solve the eikonal equation. By introducing a new way of managing propagation interfaces which avoid the use of expensive data structures, the fast iterative method reveals to be a faster variant with a higher parallel potential compared to the fast marching method. We investigate in this paper a multi-level parallel approach for the Fast Iterative Method which is well fitted for today heterogenous and hierarchical architectures. We show experiment results which focus on the fine-grained parallel level of the algorithm and we give a performance analysis. |
Florian Dang, Nahid Emad |

304 | A Parallel Implementation of Singular Value Decomposition for Video-on-Demand Services Design Using
Principal Component Analysis
[abstract]Abstract: We have developed a mathematical model for video on demand server design based on principal component analysis. Singular value decomposition on the video correlation matrix is used to perform the PCA. The challenge is to counter the computational complexity, which grows proportionally to n^3, where n is the number of video streams. We present a solution from high performance computing, which splits the problem up and computes it in parallel on a distributed memory system. |
Raul Ramirez-Velarde, Martin Roderus, Carlos Barba-Jimenez, Raul Perez-Cazares |

213 | Boosting FDTD Performance by Compiler- Generated Recursive Space-Time Update Schemes [abstract]Abstract: Traditional implementations of the explicit Finite-Difference Time-Domain (FDTD) solvers use layer by layer time update which requires reload of the whole mesh data into memory (and synchronization between processes for parallel solvers) at every time step. This type of update is cache inefficient and renders the task to be memory bound by the slowest bandwidth in the computer memory hierarchy. Depending on the task size and computer architecture, this can be cache, shared memory, interprocessor communication or disk I/O bandwidth. For large scale calculations, especially when mesh data size exceeds the total available processor memory, the traditional FDTD simulation becomes prohibitively inefficient. There exist alternative approaches to implement FDTD solvers that explore the whole time-space problem and utilize additional data locality due to the time dimension. It was shown that it is possible to reach the theoretical peak rate of algorithm efficiency for any given task size and compute architecture, even for problems with very large computational grids (10^12 Yee cells) [1]. Efficient usage of the whole
computer memory bandwidth hierarchy when implementing numerical algorithms is crucial for extreme scale computing since one may expect this hierarchy be rather diverse in the (future)
exascale computers.
In this work we present a systematic way of implementing a space-time update scheme which is based on a recursive decomposition of N+1 dimensional space-time data dependency graph of the whole calculation into smaller subgraphs. We compose a Locally Recursive Nonlocally Asynchronous (LRnLA) update algorithm [1]: each subgraph is split locally into similar subgraphs while processing of independent subgraphs may be performed concurrently.
For explicit stencils the dependency graph is locally governed by the stencil slope in coordinate-time plane. We use primitive triangular up- and down-pointed geometric shapes to represent the projections of the dependency graph on any coordinate-time plane and develop universal mnemonic rules to process the shapes for arbitrary space dimension. In our implementation these shapes and rules are encoded into C++ template class hierarchy by using boost fusion functionality [2], thus the update algorithm is generated mainly at compile time by a standard C++ compiler. This keeps the implementation highly portable and minimizes the overhead for run-time analysis of the data dependency graph. Termination of the recurrence (smallest subgraph) performs the actual FDTD stencil operation corresponding to a time update of a single mesh cell.
The resulting FDTD algorithm is cache oblivious [3] and also allows for concurrent execution of subtasks of different size. Concurrent task scheduling is governed by the dependencies between subgraphs which become known in course of the shape decomposition. Depending on the computer architecture the scheduling may simultaneously take into account different parallel execution levels such as MPI and multithreading. Concurrent execution mechanisms are switched on (programmed) for subgraphs reaching some suitable size (rank) in course of recursion.
In this presentation we discuss the implementation and analyze the performance of the implemented 3D FDTD algorithm for various computer architectures, including multicore systems and large clusters (up to 9000 cores). We demonstrate the FDTD update performance reaching up to 75% of the estimated CPU peak which is 10-30 times higher than that of the traditional FDTD solvers. We also demonstrate an almost perfect parallel scaling of the implemented solver. We discuss the effect of mesh memory layouts such as Z-curve (Morton order) increasing locality of data or interleaved layouts for vectorized updates. The implementation of the algorithm for GPU is discussed with some preliminary results.
[1] Zakirov A V and Levchenko V D 2009 PIERS proceedings, Moscow, Russia 580--584
[2] www.boost.org/libs/fusion/
[3] H. Prokop. Cache-Oblivious Algorithms. Masterâ€™s thesis, MIT. 1999. |
Ilya Valuev and Andrey Zakirov |

409 | Challenges of Big Data Mining [abstract]Abstract: At present, Big Data becomes reality that no one can ignore. Big Data is our environment whenever we need to make a decision. Big Data is a buzz word that makes everyone understands how important it is. Big Data shows a big opportunity for academia, industry and government. Big Data then is a big challenge for all parties. This talk will discuss some fundamental issues of Big Data problems, such as data heterogeneity vs. decision heterogeneity, data stream research and data-driven decision management. Furthermore, this talk will provide a number of real-life Bid Data Applications. In the conclusion, the talk suggests a number of open research problems in Data Science, which is a growing field beyond Big Data. |
Yong Shi |

410 | Scalable Stochastic and Hybrid Methods and Algorithms for Extreme Scale Computing [abstract]Abstract: Novel mathematics and mathematical modelling approaches together with scalable algorithms are needed to enable key applications at extreme-scale. This is especially true as HPC systems continue to scale up in compute node and processor core count. At the moment computational scientists are at the critical point/threshold of novel mathematics development as well as large-scale algorithm development and re-design and implementation that will affect most of the application areas. Thus the paper will focus on the mathematical and algorithmic challenges and approaches towards exascale and beyond and in particular on stochastic and hybrid methods that in turn lead to scalable scientific algorithms with minimal or no global communication, hiding network and memory latency, have very high computation/communication overlap, have no synchronization points. |
Vassil Alexandrov |