Large Scale Computational Physics (LSCP) Session 1

Time and Date: 10:15 - 11:55 on 2nd June 2015

Room: V102

Chair: Fukuko YUASA

757	Workshop on Large Scale Computational Physics - LSCP [abstract] Abstract: The LSCP workshop focuses on symbolic and numerical methods and simulations, algorithms and tools (software and hardware) for developing and running large-scale computations in physical sciences. Special attention goes to parallelism, scalability and high numerical precision. System architectures are also of interest as long as they are supporting physics related calculations, such as: massively parallel systems, GPUs, many-integrated-cores, distributed (cluster, grid/cloud) computing, and hybrid systems. Topics are chosen from areas including: theoretical physics (high energy physics, nuclear physics, astrophysics, cosmology, quantum physics, accelerator physics), plasma physics, condensed matter physics, chemical physics, molecular dynamics, bio-physical system modeling, material science/engineering, nanotechnology, fluid dynamics, complex and turbulent systems, and climate modeling.	Elise de Doncker, Fukuko Yuasa
96	The Particle Accelerator Simulation Code PyORBIT [abstract] Abstract: The particle accelerator simulation code PyORBIT is presented. The structure, implementation, history, parallel and simulation capabilities, and future development of the code are discussed. The PyORBIT code is a new implementation and extension of algorithms of the original ORBIT code that was developed for the Spallation Neutron Source accelerator at the Oak Ridge National Laboratory. The PyORBIT code has a two level structure. The upper level uses the Python programming language to control the flow of intensive calculations performed by the lower level code implemented in the C++ language. The parallel capabilities are based on MPI communications. The PyORBIT is an open source code accessible to the public through the Google Open Source Projects Hosting service.	Andrei Shishlo
115	Simulations of several finite-sized objects in plasma [abstract] Abstract: Interaction of plasma with finite-sized objects is one of central problems in the physics of plasmas. Since object charging is often nonlinear and involved, it is advisable to address this problem with numerical simulations. First-principle simulations allow studying trajectories of charged plasma particles in self-consistent force fields. One of such approaches is the particle-in-cell (PIC) method, where the use of spatial grid for the force calculation significantly reduces the computational complexity. Implementing finite-sized objects in PIC simulations is often a challenging task. In this work we present simulation results and discuss the numerical representation of objects in the DiP3D code, which enables studies of several independent objects in various plasma environments.	Wojciech Miloch
196	DiamondTorre GPU implementation algorithm of the RKDG solver for fluid dynamics and its using for the numerical simulation of the bubble-shock interaction problem [abstract] Abstract: In this paper the solver based upon the RKDG method for solving three-dimensional Euler equations of gas dynamics is considered. For the numerical scheme the GPU implementation algorithm called DiamondTorre is used, which helps to improve the performance speed of calculations. The problem of the interaction of a spherical bubble with a planar shock wave is considered in the three-dimensional setting. The obtained calculations are in agreement with the known results of experiments and numerical simulations. The calculation results are obtained with the use of the PC.	Boris Korneev, Vadim Levchenko
460	Optimal Temporal Blocking for Stencil Computation [abstract] Abstract: Temporal blocking is a class of algorithms which reduces the required memory bandwidth (B/F ratio) of a given stencil computation, by “blocking” multiple time steps. In this paper, we prove that a lower limit exists for the reduction of the B/F attainable by temporal blocking, under certain conditions. We introduce the PiTCH tiling, an example of temporal blocking method that achieves the optimal B/F ratio. We estimate the performance of PiTCH tiling for various stencil applications on several modern CPUs. We show that PiTCH tiling achieves 1.5 ∼ 2 times better B/F reduction in three-dimensional applications, compared to other temporal blocking schemes. We also show that PiTCH tiling can remove the bandwidth bottleneck from most of the stencil applications considered.	Takayuki Muranushi, Junichiro Makino

Large Scale Computational Physics (LSCP) Session 2

Time and Date: 14:10 - 15:50 on 2nd June 2015

Room: V102

Chair: Fukuko YUASA

684	A Case Study of CUDA FORTRAN and OpenACC for an Atmospheric Climate Kernel [abstract] Abstract: The porting of a key kernel in the tracer advection routines of the Community Atmosphere Model - Spectral Element (CAM-SE) to use Graphics Processing Units (GPUs) using OpenACC is considered in comparison to an existing CUDA FORTRAN port. The development of the OpenACC kernel for GPUs was substantially simpler than that of the CUDA port. Also, OpenACC performance was about 1.5x slower than the optimized CUDA version. Particular focus is given to compiler maturity regarding OpenACC implementation for modern fortran, and it is found that the Cray implementation is currently more mature than the PGI implementation. Still, for the case that ran successfully on PGI, the PGI OpenACC runtime was slightly faster than Cray. The results show encouraging performance for OpenACC implementation compared to CUDA while also exposing some issues that may be necessary before the implementations are suitable for porting all of CAM-SE. Most notable are that GPU shared memory should be used by future OpenACC implementations and that derived type support should be expanded.	Matthew Norman, Jeffrey Larkin, Aaron Vose and Katherine Evans
585	OpenCL vs OpenACC: lessons from development of lattice QCD simulation code [abstract] Abstract: OpenCL and OpenACC are generic frameworks for heterogeneous programming using CPU and accelerator devices such as GPUs. They have contrasting features: the former explicitly controls devices through API functions, while the latter generates such procedures along a guide of the directives inserted by a programmer. In this paper, we apply these two frameworks to a general-purpose code set for numerical simulations of lattice QCD, which is a computational physics of elementary particles based on the Monte Carlo method. The fermion matrix inversion, which is usually the most time-consuming part of the lattice QCD simulations, is off-loaded to the accelerator devices. From a viewpoint of constructing reusable components based on the object-oriented programming and also tuning the code to achieve high performance, we discuss feasibility of these frameworks through the practical implementations.	Hideo Matsufuru, Sinya Aoki, Tatsumi Aoyama, Kazuyuki Kanaya, Shinji Motoki, Yusuke Namekawa, Hidekatsu Nemura, Yusuke Taniguchi, Satoru Ueda, Naoya Ukita
515	Application of GRAPE9-MPX for high precision calculation in particle physics and performance results [abstract] Abstract: There are scientific applications which require calculations with high precision such as Feynman loop integrals and orbital integrations. These calculations also need to be accelerated. We have been developing dedicated accelerator systems which consist of processing elements for high precision arithmetic operations and a programing interface. GRAPE9-MPX is our latest system with multiple Field Programmable Gate Array (FPGA) boards on which our developed PEs are implemented. We present the performance results for GRAPE9-MPX extended to have upto 16 FPGA boards for quadruple/hexuple/octuple-precision with some optimization. The achieved performance for a Feynman loop integral with 12 FPGA boards is 26.5 Gflops for quadruple precision. We also give an analytical consideration for the performance results.	Hiroshi Daisaka, Naohito Nakasato, Tadashi Ishikawa, Fukuko Yuasa
734	Adaptive Integration for 3-loop Feynman Diagrams with Massless Propagators [abstract] Abstract: We apply multivariate adaptive integration to problems arising from self-energy Feynman loop diagrams with massless internal lines. Results are obtained with the ParInt integration software package, which is layered over MPI (Message Passing Interface) and incorporates advanced parallel computation techniques such as load balancing among processes that may be distributed over a network of nodes. To solve the problems numerically we introduce a parameter r in a factor of the integrand function. Some problem categories allow setting r = 0; other cases require an extrapolation as r -> 0. Furthermore we apply extrapolation with respect to the dimensional regularization parameter by setting the dimension n = 4 - 2*eps and extrapolating as eps -> 0. Timing results show near optimal parallel speedups with ParInt for the problems at hand.	Elise de Doncker, Fukuko Yuasa, Omofolakunmi Olagbemi