Large Scale Computational Physics (LSCP) Session 2

Time and Date: 14:10 - 15:50 on 2nd June 2015

Room: V102

Chair: Fukuko YUASA

684	A Case Study of CUDA FORTRAN and OpenACC for an Atmospheric Climate Kernel [abstract] Abstract: The porting of a key kernel in the tracer advection routines of the Community Atmosphere Model - Spectral Element (CAM-SE) to use Graphics Processing Units (GPUs) using OpenACC is considered in comparison to an existing CUDA FORTRAN port. The development of the OpenACC kernel for GPUs was substantially simpler than that of the CUDA port. Also, OpenACC performance was about 1.5x slower than the optimized CUDA version. Particular focus is given to compiler maturity regarding OpenACC implementation for modern fortran, and it is found that the Cray implementation is currently more mature than the PGI implementation. Still, for the case that ran successfully on PGI, the PGI OpenACC runtime was slightly faster than Cray. The results show encouraging performance for OpenACC implementation compared to CUDA while also exposing some issues that may be necessary before the implementations are suitable for porting all of CAM-SE. Most notable are that GPU shared memory should be used by future OpenACC implementations and that derived type support should be expanded.	Matthew Norman, Jeffrey Larkin, Aaron Vose and Katherine Evans
585	OpenCL vs OpenACC: lessons from development of lattice QCD simulation code [abstract] Abstract: OpenCL and OpenACC are generic frameworks for heterogeneous programming using CPU and accelerator devices such as GPUs. They have contrasting features: the former explicitly controls devices through API functions, while the latter generates such procedures along a guide of the directives inserted by a programmer. In this paper, we apply these two frameworks to a general-purpose code set for numerical simulations of lattice QCD, which is a computational physics of elementary particles based on the Monte Carlo method. The fermion matrix inversion, which is usually the most time-consuming part of the lattice QCD simulations, is off-loaded to the accelerator devices. From a viewpoint of constructing reusable components based on the object-oriented programming and also tuning the code to achieve high performance, we discuss feasibility of these frameworks through the practical implementations.	Hideo Matsufuru, Sinya Aoki, Tatsumi Aoyama, Kazuyuki Kanaya, Shinji Motoki, Yusuke Namekawa, Hidekatsu Nemura, Yusuke Taniguchi, Satoru Ueda, Naoya Ukita
515	Application of GRAPE9-MPX for high precision calculation in particle physics and performance results [abstract] Abstract: There are scientific applications which require calculations with high precision such as Feynman loop integrals and orbital integrations. These calculations also need to be accelerated. We have been developing dedicated accelerator systems which consist of processing elements for high precision arithmetic operations and a programing interface. GRAPE9-MPX is our latest system with multiple Field Programmable Gate Array (FPGA) boards on which our developed PEs are implemented. We present the performance results for GRAPE9-MPX extended to have upto 16 FPGA boards for quadruple/hexuple/octuple-precision with some optimization. The achieved performance for a Feynman loop integral with 12 FPGA boards is 26.5 Gflops for quadruple precision. We also give an analytical consideration for the performance results.	Hiroshi Daisaka, Naohito Nakasato, Tadashi Ishikawa, Fukuko Yuasa
734	Adaptive Integration for 3-loop Feynman Diagrams with Massless Propagators [abstract] Abstract: We apply multivariate adaptive integration to problems arising from self-energy Feynman loop diagrams with massless internal lines. Results are obtained with the ParInt integration software package, which is layered over MPI (Message Passing Interface) and incorporates advanced parallel computation techniques such as load balancing among processes that may be distributed over a network of nodes. To solve the problems numerically we introduce a parameter r in a factor of the integrand function. Some problem categories allow setting r = 0; other cases require an extrapolation as r -> 0. Furthermore we apply extrapolation with respect to the dimensional regularization parameter by setting the dimension n = 4 - 2*eps and extrapolating as eps -> 0. Timing results show near optimal parallel speedups with ParInt for the problems at hand.	Elise de Doncker, Fukuko Yuasa, Omofolakunmi Olagbemi