Lawrence Berkeley National Laboratory Recent Work Title Evaluating and optimizing the NERSC workload on knights landing Permalink https://escholarship.org/uc/item/75c1571h ISBN 9781509052189 Authors Barnes, T Cook, B Deslippe, J et al. Publication Date 2017-01-30 DOI 10.1109/PMBS.2016.010 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Evaluating and Optimizing the NERSC Workload on Knights Landing Pierre Carrier, Nathan Wichmann, Marcus Taylor Barnes, Brandon Cook, Jack Deslippe*, Wagner Douglas Doerfler, Brian Friesen, Yun (Helen) Cray Inc. He, Thorsten Kurth, Tuomas Koskela, Mathieu Paul Kent Lobet, Tareq Malas, Leonid Oliker, Andrey Oak Ridge National Lab Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Christopher Kerr Henri Vincenti, Samuel Williams NCAR Consultant Lawrence Berkeley National Lab John Dennis *Corresponding Author National Center for Atmospheric Research Abstract—NERSChaspartneredwith20representativeappli- representative application teams with the goal of porting and cation teams to evaluate performance on the Xeon-Phi Knights optimizing the selected applications to the KNL architecture. Landing architecture and develop an application-optimization For the purposes of testing and optimizing performance on strategy for the greater NERSC workload on the recently KNL with access to a small number of KNLs, application installedCorisystem.Inthisarticle,wepresentearlycasestudies and summarized results from a subset of the 20 applications teams put together single-node runs that were representative highlighting the impact of important architecture differences of projected science problems on the full Cori system. In between the Xeon-Phi and traditional Xeon processors. We some cases, representative kernels were extracted to match summarizethestatusoftheapplicationsanddescribethegreater the single-node work of the large science runs at scale. optimization strategy that has formed. In this article, we discuss the lessons learned from op- timizing a number of these representative applications and I. INTRODUCTION kernels and discuss the impact of key KNL hardware features NERSC[1]istheproductionHPCfacilityfortheU.S.DOE (including missing features). Finally, we will discuss how the Office of Science. The center supports over 6000 users with application readiness process has informed the creation of an morethan600uniqueapplicationsinawidevarietyofscience overall optimization strategy at NERSC targeting the broader domains. HPC systems deployed at NERSC must support a user community. diverse workload and broad user base but also need to satisfy an increasing demand for cycles from this community as II. NOVELFEATURESOFKNIGHTS-LANDINGRELEVANT spelled out in the requirements studies of the various science TOAPPLICATIONS domains. [2] In addition to meeting the science demand, For the last couple decades, NERSC has fielded massively NERSC is transitioning to an energy efficient data center and parallel distributed memory systems powered by traditional requires high performing, energy efficient architectures. This x86 processors. The previously largest NERSC system, Edi- transition has recently begun at NERSC with the installation son, is powered by over 5500 dual Xeon (Ivy-Bridge) nodes of the Cori system, a Cray XC40 system powered by 9300+ with 12 cores per socket. The majority of CPU cycles on Intel Xeon-Phi “Knights Landing” (KNL) processors added Edison are utilized in applications whose parallelism is ex- to an existing Cori phase 1 system powered by 1500+ Xeon pressed by MPI, with a growing minority of applications (Haswell) processors. additionally expressing on-node parallelism via OpenMP. A InordertoprepareuserapplicationsforCori’sKNLproces- small but growing number of applications are also written sors, NERSC has developed an “application-readiness” strat- using a PGAS and/or task based programming model. egy involving both in-depth work with strategic application In comparison to Edison, the Cori KNL based system partners as well as an extensive training programming for the contains a number of important architectural differences that greater NERSC community. require attention from application developers. We list some of The application-readiness effort is engaging 20 represen- the most important hardware features below: tative application teams (Table I) across significant domain 1.Manycores-TheXeon-Phi7250processorspoweringthe areas and algorithmic paradigms to develop optimization case Corisystemcontain68physicalcoreswith4hardwarethreads studiesandrelevantexpertisethatcanthenbeusedtodevelop percore,foratotalof272hardwarethreadsperprocessor.This an optimization strategy for general applications at NERSC. is nearly 3 times the number of cores per node as Edison (24) Towards this end, NERSC has established the NERSC Exas- and close to 6 times the number of hardware threads. These cale Science Application Program (NESAP), a collaboration cores have an AVX frequency (1.2 GHz) that is half that of of NERSC staff along with experts at Cray and Intel, and the Edison’s “Ivy-Bridge” processor (2.4 GHz). TABLEI functionals, which are often important for the accurate calcu- NESAPTIER1AND2APPLICATIONS lation of charge distributions [6], [7], band gaps [6], [8], [9], polarizabilities [10], and structural properties [8], [11]. Such Application ScienceArea Algorithm Boxlib Multiple AMR calculationsaredominatedbythecostofcomputingtheaction EBChombo Multiple AMR oftheexchangeoperator,Kˆ,ontheoccupiedelectronorbitals, CESM Climate Grid {ψ } [12], [13], [14]: ACME Climate Grid i MPAS-O Ocean Grid Gromacs Chemistry/Biology MD (Kˆψ )(r)=−(cid:88)nocc ψ (r)(cid:90) ψj(r(cid:48))ψi(r(cid:48))dr(cid:48)(r) (1) Meraculous Genomics Assembly i j |r(cid:48)−r| NWChem Chemistry PWDFT j=1 PARSEC MaterialSci. RSDFT Quantum MaterialSci. PWDFT where nocc is the number of occupied bands. Eq. 1 must be ESPRESSO calculated for each occupied orbital, and thus requires the BerkeleyGW MaterialSci. MBPT calculation of a total of n2 integrals. In QE, these integrals EMGeo Geosciences KrylovSolver occ XGC1 Fusion PIC arecalculatedbysubroutinevexx,throughtheuseofFFTs,as WARP Accelerators PIC shown in Algorithm 1. QE facilitates two primary approaches M3D Fusion CD/PIC for parallelizing the evaluation of Eq. 1: (1) plane-wave HACC Astrophysics N-Body MILC HEP QCD parallelization, in which the work of computing each individ- Chroma NuclearPhysics QCD ual FFT is distributed across many processes, and (2) band DWF HEP QCD parallelization, in which multiple FFTs are simultaneously MFDN NuclearPhysics SparseLA computedbydifferentsetsofprocesses.Inpreviouswork[15] withintheNESAPeffort,wehaveexpandedupontheexisting band parallelization through the introduction of a “band pair” 2. Configurable on-chip interconnect - The cores are inter- parallelization algorithm, along with other enhancements that connected in a 2D mesh, with each mesh point comprising enable optimal parallelization of the calculation of exact two cores (a tile) sharing an L2 cache. The mesh can be exchange without negatively impacting the parallelization of configured in a number of NUMA configurations. For this other regions of the code. These modifications have lead to study, we concentrate on “quad mode”, where the cores and significant improvements in the strong scaling efficiency of DDR4 are logically contained in one NUMA domain, and the the code and increase the density of node level work. MCDRAM in a 2nd, but the distributed tag-directory for each memory controller is localized within four quadrants. Algorithm 1 Evaluation of Eq. 1 3.Widevectorunits-Eachcorecontainstwo512-bitSIMD 1: procedure VEXX units (supporting the AVX512 ISA) with fused multiply-add 2: ... (FMA)supportenabling32doubleprecisionFLOPspercycle, 3: for i in 1:nocc do compared to Edison’s 8 FLOPS per cycle. 4: ... 4.On-packageMCDRAM-TheKnightsLandingprocessor 5: c(g)=FFT[1/|r(cid:48)−r|] has 1 MB of L2 cache per tile (shared between two compute 6: for j in 1:nocc do cores) but lacks a shared L3 cache. On the other hand, it 7: ρij(r)=ψi(r)ψj(r) contains 16 GB of on-package MCDRAM with a bandwidth 8: ρij(g)=FFT[ρij(r)] of 450 GB/s as measured by STREAM (compared to 85 9: vij(g)=c(g)ρij(g) GB/s from the DDR4 memory subsystem). The MCDRAM 10: vij(r)=FFT−1[vij(g)] ccaanchbeeocrosnpfilgitubreedtwaeseandtdhreessesacbolnefimguemraotiroyn,sa.s a direct mapped 11: (Kˆψi)(r) += ψj(r)vij(r) We discuss in the following sections a number of optimiza- tion case studies targeting the above hardware features and We focus here on our efforts to improve the single-node summarize key results across applications in the final section. performance of Algorithm 1 on KNL, which is primarily Unless otherwise noted, our KNL numbers were collected governed by the efficiency of the hybrid MPI-OpenMP plane- onsingleXeon-Phi7210processorswithaslightlylowerAVX wave parallelization implementation in QE. The many-core frequencyof1.1GHzand64coresratherthan68onthe7250 nature of the KNL architecture renders on-node OpenMP partonCori.Ourcomparisonsweremadeagainstsingle-node threading an appealing option for the optimal use of memory runsontheCoriPhase1systempoweredbytwo16coreXeon resources; however, efficiently scaling to large numbers of E5-2698 (Haswell) processors clocked at 2.3 GHz. threads is challenging, as shown by the red curve in Fig. 1. The difficulty of obtaining good thread scaling in Algorithm III. CASESTUDY1-QUANTUMESPRESSO 1 derives from the structure of the inner loop, which involves Quantum ESPRESSO (QE) is a suite for performing plane- three “vector-multiplication”-like operations (used to compute wave density functional theory (DFT) simulations [3], [4] of ρ (r), v (g), and (Kˆψ )(r)), interleaved by FFTs. Because ij ij i systems in materials science and chemistry [5]. We focus, ofthisstructure,eachvector-multiplicationiscomputedwithin here, on the performance of calculations employing hybrid an individual OpenMP region, and several OpenMP fork and 80 100 Before OMP Improvements 70 After OMP Improvements 80 60 s) 50 e ( 60 p m u peed 40 Wallti 40 S 30 20 20 10 0 C D F 0 0 20 40 60 ache DR ASTM E Cores M Fig. 1. Improvements to the thread scaling of Algorithm 1 on KNL. Each Fig. 2. Performance of Algorithm 1 on a single KNL node using several calculationisperformedonsystemof64watermoleculesusingasingleKNL differentmemorymodes.Fromlefttoright:cachemode,flatmodeusingonly node. DDR,andflatmodewithexplicitMCDRAMuseviaFASTMEMdirectives. 400 joinoperationsarerequiredforeachiterationoftheinnerloop. Haswell KNL We reduce the amount of OpenMP overhead by changing the 300 loop structure such that each OpenMP region encompasses a s) loop over bands. In the case of the calculation of ρij(r), this e ( corresponds, in simplified terms, to: alltim 200 W !$omp parallel do ... 100 DO j=1, n_occ DO ir = 1, nr 0 rho(ir,j)=psi(ir,i)*psi(ir,j) M M O O ENDDEONDDO PI Old PI New MP Old MP New !$omp end parallel do Fig.3. Single-nodeperformanceofAlgorithm1onKNLandHaswell.The One might be tempted to add a collapse(2) clause to MPIresultscorrespondtorunninginpureMPImode,whiletheOMPresults the OpenMP directive above. However, we noticed that this correspondtorunningwith1MPItaskand16OpenMPthreadsper16cores. reducedtheperformancebynearly2x.Theexplanationisthat The “Old” results are obtained using a version of QE immediately prior to theadditionoftheimprovementsdescribedinthispaper. vector loads/stores are replaced by gather/scatter operations because the compiler can no longer guarantee stride 1 access. In addition, we bundle the computation of the FFTs such enablesflatmodetooutperformcachemode,asshowninFig. that multiple FFTs are performed by each library call. This 2 by 10-15% without variability. reduces the number of OpenMP fork and join operations Fig. 3 compares the performance of Algorithm 1 between and gives a factor ∼3.3× speedup compared to sequential KNL and Haswell. On both Haswell and KNL, pure MPI execution of single multi-threaded FFTs - which may be mode was originally found to be more efficient than mixed additionally attributed to the ability of the library to optimize MPI-OpenMP parallelization. However, implementation mod- cache-reuse among the many FFTs. These modifications lead ifications described in this section are observed to enable tosubstantialimprovementsinthethreadingefficiencyofQE, a mixed MPI-OpenMP approach to outperform pure MPI as shown in Fig. 1 parallelization. The net result of our enhancements is a 2.9× AnadditionalconcernwhenrunningonKNListheoptimal speedup in the best single-node performance observed on choice of memory mode. As shown in Fig. 2, running cache KNL. Furthermore, whereas prior to optimization, a single- mode is approximately a factor of two faster than running node KNL calculation was found to be 1.6× slower than the exclusively out of DDR in flat mode - though variability correspondingsingle-nodeHaswellcalculation,itisnowfound between runs was observed in cache mode attributed to the to be 1.8× faster. fact that the working set size greatly exceeded 16GB leading to potential cache-conflicts in the direct-mapped MCDRAM IV. CASESTUDY2-NYX/BOXLIB cache. We used FASTMEM directives to place the arrays BoxLib is a software framework for developing parallel, containing ψ , ρ , v , and (Kˆψ )(r) into MCDRAM. This block-structured, adaptive mesh refinement (AMR) codes. i ij ij i approach significantly reduces the walltimes associated with ApplicationsofBoxLibincludecompressiblehypersonicflows the vector multiplication operations in Algorithm vexx, and for cosmology [18], reactive flows for radiating systems such as supernovae [17], low Mach number flows for stellar convection [23] terrestrial combustion [26], and fluctuating hydrodynamics [24], porous media flows [25], and others. In most of these applications, the physical properties of the simulationsareexpressedinFORTRANkernels,whileBoxLib itself handles domain decomposition, AMR, memory man- agement, parallelization (both MPI and OpenMP), boundary condition information, inter-node communication, and disk I/O. It also provides multigrid linear solvers on both uniform and adaptively refined meshes. The parallelism in BoxLib codes is hierarchical; inter-node parallelism is expressed via MPI by decomposing the total problem domain into boxes of arbitrary rectangular shape, and distributing them among MPI processes. The intra-node parallelism takes the form of Fig.4. ThreadstrongscalinginNyxonIntelXeonPhi7210. OpenMP directives, spawning multiple threads to work on different regions of each box. In this report, we focus on the cosmology code Nyx [18] as a proxy for other BoxLib are long in the stride-1 dimension, in order to maximize applications. the effectiveness of prefetching, and short in the other two A few kernels dominate code execution time in Nyx. These dimensions,allowingustofitthetilesintotherelativelysmall include the piecewise-parabolic method (PPM) reconstruc- last-level cache (1 MB shared between two cores) on Xeon tion of the baryon state data [21], the analytical Riemann Phi. A typical tile size is (64,4,4), while a typical box size solver [20], Gauss-Seidel red-black smoothing during the is (64,64,64). multigrid solve for self-gravity [19], the equation of state Optimizing the point-wise operations in Nyx is more chal- (EOS) [22], and the radiative heating and cooling terms [22]. lenging, as they benefit less from loop tiling and each kernel The first three kernels are “horizontal” operations, in that has unique performance characteristics. The equation of state, they require information from neighboring cells, while the for example, computes the free electron density and the latter two are “vertical” (point-wise) operations, requiring no temperature in each cell via Newton-Raphson iteration. The neighboring data. data dependence of the Newton-Raphson algorithm prohibits Horizontaloperationsinvolvealargeamountofdatamove- vectorization. Instead, we have enjoyed success rewriting the ment (e.g., stencils) and low arithmetic intensity; conse- algebraic expression of various functions to minimize the quently they are ideal candidates for optimization via cache number of divides required. reuse. In the initial implementation of OpenMP in Nyx1, the Together, these optimization strategies have resulted in a general approach was to decorate the triply-nested {x,y,z} large improvement in thread scaling on the Xeon Phi archi- loops in these kernels with !omp parallel do ... tecture. In Figure 4 we show the thread strong scaling on a collapse(2) directives. While simple and non-invasive single pre-production 7210 Xeon Phi processor. When using to implement, this approach led to frequent last-level cache a single MPI process and 64 OpenMP threads, Nyx now runs misses, and thus high memory bandwidth. As a result, the ∼5× faster on Xeon Phi than it did without the optimizations strongscalingthreadperformancesaturatedatasmallnumber discussed above. We have found that, even with a large of threads (∼5) per MPI process, due to memory band- problem (2563 problem domain, with a memory footprint of width saturation. To improve cache reuse and thus thread O(10) GB), the L2 cache miss rate is <1%. concurrency, we implemented loop tiling [27], splitting large AscanbeseeninsummaryFigures9and10thesemodifica- boxes into smaller tiles, and distributing them among threads. tion lead to speedups around 2× on KNL. The code currently Specifically, an MPI process builds a list of all tiles spanning performs about 40% faster on a single Haswell node than a all boxes which it owns, and then in a parallel region the KNLnode.Fig.11showsthatdespitethelowL2missrate,the OpenMPthreadsiterateoverthelistoftilesusingstaticchunk amount bytes transferred from DRAM (or MCDRAM) during sizeschedulingtodeterminethenumberoftilesonwhicheach execution is nearly 5× greater on KNL, presumably due to thread must operate. This list replaces the triply-nested loops the lack of an L3 cache. over {x,y,z}, and thus no collapse clause is necessary. The loop tiling approach improves data locality as well as V. CASESTUDY3-CESM threadloadbalance,andasaresult,mosthorizontaloperations inNyxnowstrongscaleeffectivelyupto∼64threadsonXeon The Community Earth System Model (CESM) [28] de- Phi.2 In particular, we choose “pencil”-shaped tiles which veloped at the National Center for Atmospheric Research (NCAR) is a coupled earth-system model which consists of multiple-components models: atmosphere, ocean, sea-ice, 1Priortothatpoint,NyxhadbeenapurelyMPI-parallelizedapplication. land-ice,land,river-runoff,andcoupler.Thecostoftheatmo- 2MostNyxkernelshavelowmemorylatency,soweseelittlebenefitfrom usingmultiplehardwarethreadspercore. spherecomponents(dynamicsandphysics)typicallydominate the total run time (60%). We discuss two of the atmosphere components below. A. MG2 The MG2 kernel is version 2 of the Morrison-Gettleman microphysics package [29], [30]. This component of the atmospheric physics was chosen as it is representative of the general code in the atmospheric physics packages and MG2 typically consumes about 10% of total CESM run time. Previous studies of the MG2 kernel have shown to be compute bound with poor vectorization due to: short loop bounds O(10), dependent sequence of instructions, and com- Fig.5. MG2performanceimprovementonHaswellandKNL. plex branches in the loops. Heavy use is made of the mathe- maticalintrinsics:POW,LOG,EXP,andGAMMAandlackof vectorizationforcestheuseofscalarimplementationsofthese HOMME is primarily a compute bound code, however intrinsics. The scalar performance of: POW, LOG, and EXP several known regions of the code are memory bound. Some on KNL was significantly worse than on Haswell. Further, of the optimization steps performed for HOMME[35] are: Intel GAMMA function was 2.6× slower on Haswell than 1) Thread memory copies in boundary exchange the GAMMA function used in the code. MG2 was written 2) Restructure data and loops for vectorization with extensive use of elemental functions which were found 3) Rewritemessagepassinglibraryandimplementspecial- to inhibit vectorization because of limitations in the compiler. ized communications operations Below are the major optimization steps applied to MG2; 4) Rearrange calculations to optimize cache reuse the optimizations performed were shown to improve the per- 5) Replace division with inversion of multiplication formance on both Haswell and KNL [31], [32]: 6) Compile time specification of loop and array bounds 1) Simplify expressions to minimize number of mathemat- A major change made to HOMME was the redesign of ical operations the OpenMP threading scheme. In the original scheme, par- 2) Use the internal coded GAMMA function allelization was performed over elements at a high-level and 3) Remove the elemental attribute from subroutines and over loops at the lower-levels. The revised scheme uses the explicitly define the loops in the routines same high-level parallelization over elements. However, the 4) Replace declaration of assumed shaped arrays with lower-looplevelparallelizationhasbeenreplacedwithahigh- explicitly defined loop declarations level parallelization in the tracer and vertical dimensions. 5) Replace division with inversion of multiplication The new scheme supports element, tracer, and vertical high- 6) Use aggressive compiler flags such as those to reduce level parallelization simultaneously. The scheme significantly numerical accuracy; used with caution reduces the number of parallel regions in the code from over 7) Use directives and profile to guide optimizations a hundred to four. As a consequence of the redesign, the code Directives can be helpful if you understand why they are is much easier to maintain as the SHARED and PRIVATE needed and were used after trying other code modifications. OpenMP declarations do not have to be declared. Use of the vectorization and optimization reports from the Fig. 6 shows the HOMME scaling performance on a single Intel compiler help guide the optimization procedure. KNL node. Nested OpenMP is used in the optimized version. Fig. 5 shows the final MG2 performance improvement on The best KNL time is achieved with 64 MPI tasks, 2 threads a single Haswell node (64 MPI tasks) and a single KNL each for the element and tracer dimensions and 1 thread for node (64 MPI tasks × 4 OpenMP threads per task). Speedup the vertical dimension. While performance saturates with 64 from the optimized code on Haswell and KNL are 1.75 and cores and 1 hardware thread per core for the original version 1.92 respectively. However, single node KNL still does not on KNL, the optimized version achieved max performance outperform single node Haswell for this kernel. with up 64 cores and 4 hardware threads per core. Fig. 9 and Fig. 10 shows the final HOMME performance B. HOMME improvementonasingleHaswellnodeandasingleKNLnode. The High Order Methods Modeling Environment C. Issues and Concerns (HOMME) model is a atmospheric dynamic core available in CESM. HOMME uses a spectral element method to Performance studies reveal that the scalar version of the discretize horizontally and a finite difference approximation Intel mathematical functions: LOG, EXP, PWR were over 5 vertically[33]. A continuous Galerkin finite-element times slower on KNL than those on Haswell. This is an open method[34] is used for the spectral element method. issue with Intel under investigation. HOMME takes about 35% of the cycles in a typical CESM Assumed shaped arrays declarations are used extensively run. in CESM. These declarations have been changed to explicitly performedusingthePETSclibraryonanunstructuredmeshin the R,z - plane and a regular field-line following mesh in the φ-direction.Forthisstep,thechargesofthemarkerparticles are deposited onto the nodes of the unstructured mesh. After recent optimizations to its collision kernel [38] and due to the sparsity of the calls to the Poisson solver, most of the computing time in XGC1 is spent in the electron push (pushe) kernel, where the current optimization efforts have been focused. The electron push consists of four main steps: B(cid:126)-field interpolation, E(cid:126)-field interpolation, the particle push, and the search for the new particle location; after each RK4 step,anewparticlepositionisobtainedandthecorresponding Fig. 6. HOMME scaling performance with original and optimized versions unstructured mesh element has to be searched. onasingleKNLnode. In previous work, parallelization with MPI+OpenMP has been implemented in all kernels of XGC1 and the full code scaleswellupto10000’sofcores.However,thepushekernel definethearrayboundsandasaresultitimprovesthespeedup performance was limited by almost no automatic compiler in the loops associated with the declarations. A significant vectorization. The E(cid:126) and B(cid:126) interpolation routines consist performance improvement in HOMME would be achieved if of loops with trip counts of 3 and 4, respectively, that are thecompilerwereabletocollapseandlinearizetheinnermost executed once per particle per RK4 step. Typically, the com- loops. Intel are working on implementing the feature. piler would consider these loops too short for vectorization The use of the OMP SIMD directives were explored for and produce serial code. To enable vectorization, the loops performance portability. However, performance gain is only were refactored so that the innermost loop is over a block of realized when explicitly providing the aligned list of variables particles, whose size can be tuned to the size of the vector tothedirective.However,providingsuchalistiscumbersome. register. The compiler has been instructed to vectorize the The Cray CCE compiler performance was also investigated loops over particle blocks using !$omp simd directives that for MG2 and HOMME, and in many cases ran faster than result in good vectorization efficiency. The limiting factor withtheIntelcompiler.Howevermoreportingandverification in the interpolation loops is retrieving data from the grid work needs to be done to use CCE for the full CESM code. pointsclosesttotheparticleposition,afundamentallyrandom memory access pattern that results in gather/scatter instruc- VI. CASESTUDY4-XGC1 tions. We have ameliorated this somewhat by employing a The XGC1 code is a full distribution function global 5D sorting algorithm at the beginning of the sub-cycling that gyrokinetic Particle-In-Cell (PIC) code for simulations of improves data memory locality. With sorting, the E(cid:126) and B(cid:126)- turbulent plasma phenomena in a magnetic fusion device. It field interpolation kernels gained 1.5× and 1.8× speedups, is particularly well-suited for plasma edge simulations due respectively. On Xeon Haswell processors, the XGC1 pushe to an unstructured mesh used in the Poisson equation solver kernel benefits from the fast L3 cache that is absent on Xeon that allows the simulation volume to encompass the magnetic Phi (See Fig. 11). Therefore, on Xeon Phi the amount of separatrix,theScrape-Off-Layer(SOL)andtheedgepedestal. loads from DRAM is roughly 3× higher than on Haswell. The main building blocks of the code are the particle pusher, The penalty for cache misses can be partially hidden by the collision operator and the Poisson solver. using hyper-threading, Figure 7 shows that the performance The particle pusher advances a large O(108) ensemble improveswithhyper-threadingonXeonPhiwhileonHaswell of marker particles in cylindrical coordinates in time by itactuallyshowsslightlyworseperformance.Asaresultofthe integrating the guiding-center equations of motion [36]. The optimizations, the pushe kernel has gained a speedup of 1.9× equations are integrated with a standard RK4algorithm. Elec- onHaswelland1.7×onXeonPhi.Node-to-nodeperformance tronsarepushedforO(50)timestepsbetweenfieldsolvesand comparison, presented in Figure 7 shows that Haswell node collisions due to their high velocity compared to ions, this is performance is still roughly 25% better, attributed to LLC. referred to as electron sub-cycling. The non-linear collision Our present optimization strategy is to re-order the time step operator [37] operates on the particle distribution function, and particle loops to increase the reuse of grid data in the a velocity space mesh. In order to operate on the velocity interpolation routines. Theoretically, a close to perfect reuse grid, the PDF from the marker particles is mapped to two rateisachievable,sinceonlyroughly1%ofparticletimesteps dimensional velocity space grid, and the Coulomb collision result in movement across grid elements. information is mapped back to the marker particles after the collision operation has been completed. The velocity-grid VII. CASESTUDY5-WARP-PICSAR operationisperformedoneachcellofaregularrealspacegrid, The code WARP is an open-source PIC code dedicated but these operations can be performed independently, making to the modeling of charged particles and electromagnetic it a good candidate for parallellization. The Poisson solve is fields, including the dynamics of charged particle beams in a) Haswell order 1 KNL order 1 80 Haswell order 3 KNL order 3 KNL, SNC4, MCDRAM ns)70 122586 KKKNNNLLL,,, SQQNuuCaadd4ff,ll aaDttD,, RMDDCRDRAM article (5600 11.9 29.9 16.5 51.7 Exec. time (s) 136624 KHNasLw, Qeluladcache me / iteration / p234000 5.0 17.7 4.5 10.0 Ti10 8 0 Scalar kernel without tiling Optimized kernel 4 Roofline model KNL SNC4 flat 2 4 8 16 32 64 128 256 103 b) Roofline (FMA + vect) Tiling/no opt Total number of threads s Tiling/opt No tiling/no opt s/ op102 Fig.7. On-nodethreadscalingoftheXGC1pushekernelondifferentKNL Gfl memorymodes.Thecodescalesidenticallyonallmodesupto1threadper 101 core.Onmultiplethreadspercore,theDDRmemorybandwidthissaturated butMCDRAMprovidesaspeedup. 10-1 100 Arithmetic intensity (flops/bytes) Fig. 8. (a) - PICSAR simulation times per iteration and per particle on a particle accelerators, and the physics of high-intensity laser- Cori phase 1 node and a single KNL for the non-optimized kernel (scalar matter interactions [39]. PICSAR is an open-source Particle- subroutineswithouttiling)andthepresentoptimizedkernelwithorder1and In-Cell FORTRAN+Python library designed to provide high- order3interpolationshapefactors.(b)-Rooflineperformancemodelapplied tothePICSARkernelwithorder1interpolation.Therooflineisrepresented performance subroutines optimized for many-integrated core bytheblackline.Thegreenmarkerrepresentsthenon-optimizedversionof architectures [40], [41] that can be interfaced with WARP. thecode(originalkernel),theredmarkerthekernelwiththetilingandtheblue PICSAR follows the evolution of a collection of charged one the fully optimized code. Marker size is proportional to the simulation time. macro-particlesthatevolveself-consistentlywiththeirelectro- magnetic fields. A macro-particle represents a group of real particles of the same kind with the same kinetic properties (speed and propagation direction). Each particle’s property is reloaded leading to poor cache reuse, especially during in- stored in an aligned and contiguous 1D array for all particles terpolation processes. In order to improve memory locality, (structure of arrays). Field structured grids are stored in 3D cache-blocking of the field arrays has been achieved by arrays for each component. A MPI domain decomposition dividing MPI domains into smaller subsets called tiles. Tile (Cartesian topology in PICSAR) is usually used for the intra- size is carefully chosen to let local field arrays fit in L2 cache node parallelism. The classical PIC loop contains 4 computa- on Haswell and KNL. OpenMP is used to parallelize over tional steps. 1) Field gathering: fields seen by the particles the tiles inside MPI regions. With more tiles than OpenMP is interpolated from the grids. 2) Particle pusher: Particles threads,anaturalload-balancingisachievedwiththeGUIDED are moved forward via the fields. This step is followed or the DYNAMIC schedulers. Furthermore, a particle sorting by the communication of the macro-particles between MPI algorithm has been tested to further improve memory locality. domains. 3) Current/charge deposition: The generated current Because of its cost, the sorting is performed after every set and charge is deposited on the grids. This step is followed by number of iterations. the communication of the guard-cells for the current grids. 4) Tiling has created new particle exchanges between tiles in Maxwellsolver:Thefieldgridsareupdatedfromthecurrents. addition to exchanges between MPI ranks. This operation is Thisstepcontainsthecommunicationoftheguard-cellsforthe done by transferring macro-particle properties between tile field grids. Steps 1) and 3) constitutes interpolation processes arrays all located in the same shared-memory space. A com- between the grids and the particles and uses B-spline shape munication pattern by block using nested parallel regions has factors of given orders. The number of grid cells and vertexes been developed to avoid memory races when several particles involved in these interpolation processes and border guard- are transferred to the same tile. The tiling communication has cells increase with the shape factor order. beenmergedwiththeMPIcommunicationpreprocessingloop Cartesian based PIC codes have a low flop/byte ratio for further gain in performance. that leads non-optimized algorithms to be highly memory- The second phase of optimization focused on vectorization. bound [41]. Large field and particle arrays cannot in cache In the original code, the main hotspots was the interpolation in most simulations. Because two consecutive particles in process whereas the particle pusher was automatically and memory can be localized at very distant physical positions, efficiently vectorized over the particles. Thus, in the cur- different portions of the field arrays are therefore constantly rent/charge deposition, the particle loop has been modified to remove dependencies and the data structure has been The speedups shown in Fig. 9 show the scope of the work changedtoaligngridvertexesinmemoryenablinganefficient performed in the NESAP program as well as the relative vectorization of the deposition process. potentialforspeedupsintypicalscienceapplicationsonKNL. Performance results are shown in Fig. 8(a). For order 1 However, the speedups, alone, don’t provide a measurement shape factor, all performed optimizations lead to a 2.4× of absolute performance on the KNL or of expected relative speedup on Haswell node and 3.7× speedup on KNL in performance between runs of similar scale on the Cori Phase comparison with the scalar version of the code without tiling. 1 (Haswell) nodes vs Cori Phase 2 (KNL) nodes. To address Final times are similar between Haswell and KNL. At order the latter issue, we compare the node level performance of 3 (preferred in production rns), 1.7× speedup on Haswell and the applications on single node runs on Haswell and KNL. 5×speeduponKNLareobserved.KNLoutperformsHaswell We put off for the future a discussion of internode scaling with a speedup of 1.6×. and assume that, to first order, we can expect similar scaling The Roofline model [41], [42], [43], [44] applied to PIC- as the interconnect for the two Cori phases is identical. The SAR is shown in Fig. 8(b). The tiling increases the flop/byte comparison is shown in Fig. 10. ratio as shown by the transition from the green marker (origi- The “speedups” on KNL relative to Haswell shown in Fig. nal code) to the red one. Vectorization and other optimization 10actuallyrangefromslowdownsof50%tospeedupsofnear them mainly increases the flops/s leading to the blue marker 300%whencomparingagainstoptimizedcodes(bluebars)on (fully-optimized code). Haswell.Wealsoshow,wherepossible,acomparison(orange The roofline model provides a useful way to visualize code bars) against the original un-optimized code on Haswell. In performance and frame the optimization conversation with this case, we see speedups of up to 700% (or 8x). code teams. The approach used hear has been used in other When comparing optimized versions on both KNL and cases [41] and is now included in NERSC’s overall app- Haswell, we see that EMGeo, MILC, Chroma and MFDN readiness strategy - as a way to review current code perfor- (SPMM) have the largest gains. These applications all share mance and discuss optimization avenues with code teams. a commonality - they are all known to be memory-bandwidth boundandachievingnearrooflineperformanceonbotharchi- VIII. SUMMARYOFNESAPAPPLICATIONPERFORMANCE tectures.TheratioofperformanceonKNLvsHaswellclosely Application speedups obtained during the NESAP period matches that ratio of the memory bandwidths available on the are shown in Fig. 9 for representative problems or kernels nodes (400-450 GB/s vs approximately 120 GB/s). on a single KNL and Haswell node. The speedups achieved MILC and Chroma are Quantum Chromodynamics (QCD) range from slightly over 1x to well over 10x. In most cases, applications that achieve maximum data-reuse and vectoriza- significantspeedupswereachievedonbothKNLandHaswell. tion on their 4D stencil computations through the use of the Code improvements targeting the many-core Xeon-Phi ar- QPHIX library. As shown in Fig. 11, these codes demonstrate chitecture nearly always end up in code-restructuring (e.g. high speedups (3x) when running out of MCDRAM vs DDR for vectorization and improved data-locality) that essentially on the KNL. improve code performance on all architectures. In most cases, EMGeo and the MFDN SPMM kernel are both sparse the impact of the performance improvements is higher on matrix-vector or matrix-matrix multiply dominated with mea- KNL than Haswell, however. This is because the fewer and sured effective bandwidth near that of that stream benchmark faster cores on Haswell (with shared L3 cache) are more on each platform. In these cases, the main optimization was forgivingtocommoncodeissues:imperfectthread-scalingand performingsolvesonmultipleright-handsides(RHS)concur- vectorization for example. rently in order to reuse data in the sparse matrices. Though, An example of the trend is the WARP code, where tiling the codes remain memory-bandwidth limited even after such optimization targeting data reuse in the L2 caches has a more optimizations. Again, Fig. 11 shows 3× or greater speedups significant impact on KNL than Haswell where the L3 cache running out of MCDRAM vs DDR on the KNL. wasbeingeffectivelyusedevenbeforethetilingoptimization. ThecaseofMFDNisuniqueamongthesefourapplications The BerkeleyGW application provides another scenario for in that the scientific use case does not fit within the 16GB thiseffect.Inthiscase,thecodeismarginallyCPUbound(an of MCDRAM on node. In fact, the team anticipates using AI around 3) but the key kernel was found to be using scalar nearly all the memory on each KNL node. However, the team instructions. Restructuring the code to enable vectorization was able to identify important arrays, those representing the yields larger speedups on KNL than Haswell due primarily vectorswhichthesparsematrixoperateson,toexplicitlyplace to the wider (512 bit vs 256 bit) vector units. into MCDRAM with appropriate FASTMEM directives. This A notable exception to this trend is the Boxlib application. approach allowed the team to outperform runs with the KNL In this case, the speedups obtained are actually larger on booted with the MCDRAM configured as a cache. Haswell than KNL. In this case, the application is memory Applications with higher arithmetic intensities, not fun- bandwidth bound before the tiling optimizations described damentally limited by memory bandwidth, that effectively above. The application, being significantly more bandwidth use the AVX512 ISA are also able to see relatively higher starved on Haswell (which lacks MCDRAM), has more per- performanceonKNLvsHaswell.HACCandBerkeleyGWare formance to gain on Xeon by tiling/blocking optimizations. examples of applications in this category with KNL speedups Fig.9. SpeedupsobtainedinNESAPapplicationsoverthecourseoftheNESAPprogram. Fig.10. PerformancespeeduponKNLvsHaswell.ThebluebarsrepresentacomparisontotheoptimizedcodeonHaswell,whileorangerepresentcomparison tothepre-NESAPcodeonHaswell. of 50-60% on the KNL nodes over the Cori-Phase 1 nodes. however,thatapplicationsthatdependsignificantlyonlibraries In Fig. 12, one can see, that more than any other applica- are expected to be unaffected by this flag even though the tion considered, BerkeleyGW sees speedups using AVX512 compiler may heavily utilize FMA instructions. instructions vs scalar instructions. IX. CONCLUSIONS TheapplicationswhichperformfasteronHaswellthanKNL NERSC partnered with Cray, Intel and 20 application de- typically are often those with additional reuse out of the L3 velopment teams to evaluate and optimize applications for cacheontheXeonarchitecture.Asdescribedaboveandshown the Cori system. While the effort has led to significant inFig.11,thisisevidentespeciallyinBoxlibandXGC1.This performance gains, it has more importantly led to a better isnottosaythatMCDRAMisnecessarilyanineffectivecache understanding of how the NERSC workload is expected to (thoughlatenciesaresignificantlyhigherthanaXeonL3),but perform on the Cori system and to the development of an thatthebandwidthadvantageofMCDRAMisatleastpartially overall optimization strategy. mitigated if codes have good reuse out of cache. The most straightforward beneficiaries of KNL are appli- Noapplicationsshowedasignificantchangeinperformance cations that are memory bandwidth bound and fit within the when fused multiply-add (FMA) instructions were disabled MCDRAM (including via the use of FASTMEM directives). using the “-no-fma” compiler flag, as shown in Fig 12. Note, However, for many applications with arithmetic intensities
Description: