Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget Osman Sarood, Akhil Langer, Abhishek Gupta, Laxmikant Kale Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL, USA {sarood1, alanger, gupta59, kale}@illinois.edu Abstract—Building future generation supercomputers while supercomputertoefficientlyutilizethepowerbudgetedforthe constraining their power consumption is one of the biggest data center. challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale Computer subsystems such as CPU and memory have a (1018 flops) supercomputer. To realize this goal, a lot of research vendor-specified Thermal Design Power (TDP) that corre- is being done to revolutionize hardware design to build power sponds to the maximal power draw by the subsystem. Cur- efficient computers and network interconnects. In this work, we rently, maximum power consumption of an HPC data center propose a software-based online resource management system is determined by the sum of the TDP of its subsystems. This that leverages hardware facilitated capability to constrain the iswasteful,asthesesubsystemsseldomrunattheirTDPlimit. power consumption of each node in order to optimally allocate Nonetheless,nearTDPamountofpowerhastobesetasidefor power and nodes to a job. Our scheme uses this hardware the subsystems so that the circuit breakers do not trip in that capability in conjunction with an adaptive runtime system that rare case when the power consumption reaches TDP. Recent can dynamically change the resource configuration of a running advances in processor and memory hardware designs have job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job madeitpossiblefortheusertocontrolthepowerconsumption terminates. of the CPU and memory through software, e.g., the power consumption of Intel’s Sandy Bridge family of processors can We also propose a performance modeling scheme that esti- be user-controlled through the Running Average Power Limit mates the essential power characteristics of a job at any scale. (RAPL)library[2].Someotherexamplesofsucharchitectures The proposed online resource manager uses these performance areIBMPower6,Power7,andAMDBulldozer.Thisabilityto characteristics for making scheduling and resource allocation constrainthemaximumpowerconsumptionofthesubsystems, decisionsthatmaximizethejobthroughputofthesupercomputer below the vendor-specified TDP value, allows us to add more under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power- machines while ensuring that the total power consumption of response characteristics. We show that with a power budget the data center does not exceed its power budget. Such a of 4.75 MW, we can obtain up to 5.2X improvement in job system is called an overprovisioned system [3]. throughput when compared with the SLURM scheduling policy Earlier work [3], [4] shows that an increase in the power that is power-unaware. We corroborate our results with real experimentsonarelativelysmallscalecluster,inwhichweobtain allowed to the processor (and/or memory) does not yield a a 1.7X improvement. proportional increase in the application’s performance. As a result, for a given power budget, it can be better to run an application on larger number of nodes with each node capped I. INTRODUCTION at lower power than fewer nodes each running at its TDP. The optimal resource configuration for an application can The US Department of Energy has identified four key be determined by profiling an application’s performance for areas that require significant research breakthroughs in order varyingnumberofnodes,CPUpowerandmemorypowerand to achieve exascale computing over the next decade [1]. This then selecting the best performing configuration for the given paper explores a global power monitoring strategy for high power budget [4]. In this work, we address the data center performance computing (HPC) data centers and addresses one scenario in which an additional decision has to be made: how of those four key areas – machine operation under power to distribute available nodes and power amongst the queued envelope of 20MW. Our work focuses on a hardware-assisted jobs. The challenge is to obtain a distribution that is globally software-basedresourcemanagementschemethatintelligently optimal and not individually for the jobs (which was the allocatesnodesandpowertojobswiththeaimofmaximizing focus of the earlier work [4]). Additionally, new job requests the job throughput, under a given power budget. Most of the arrive with time and currently running jobs terminate, which currentandpastworkinHPCisdonefromtheperspectivethat requires re-optimization of scheduling and resource allocation number of nodes is the limiting factor which can freely draw decisions.Withthiscontext,webelievethemajorcontributions anyamountofpower.Aswemoveontolargemachines,power of this paper are: consumption will overtake hardware as the limiting factor. Hence, HPC researchers might be forced to think power as • An online resource manager, PARM, that uses overprovi- the limiting factor. This paper proposes an HPC scheduling sioning, power capping and job malleability along with the system that considers power to be the limiting resource and power-response characteristics of each job for scheduling hence assumes that it is economical to add nodes to the and resource allocation decisions that significantly improve the job throughput of the data center (§ IV). and job malleability to show significant improvements in the • A power aware strong scaling (PASS) performance model throughput of the data center. that estimates an applications performance for a given numberofnodesandCPUpowercap(§V).Wedemonstrate the use of our model by estimating characteristics of five III. DATACENTERANDJOBCAPABILITIES applications having different power-response characteristics (§ VI-C). In this section, we describe some of the capabilities or • An evaluation of our online resource manager on a 38 node features which, according to our understanding, ought to be cluster with two different job data sets. A speedup of 1.7 present in future HPC data centers. In the following sections, was obtained when compared with SLURM (§ VI). we highlight the role these capabilities play for a scheduler, • An extensive simulated evaluation of our scheduling policy whosegoalistoincreasethethroughputofadatacenterwhile for larger machines and its comparison with the SLURM ensuring fairness. Power capping: This feature allows the baseline scheduling policy. We achieve up to 5.2X speedup schedulertoconstraintheindividualpowerdrawofeachnode. operating under a power budget of 4.75 MW (§ VII). Intel’sSandyBridgeprocessorfamilysupportson-boardpower measurement and capping through the RAPL interface [2]. RAPL is implemented using a series of Machine Specific II. RELATEDWORK Registers (MSRs) which can be accessed to read power usage for each power plane. RAPL supports power capping Package Performance modeling using Dynamic Voltage and Fre- and DRAM power planes by writing into the relevant MSRs. quency Scaling (DVFS) has been extensively studied be- Here, ‘Package’ corresponds to the processor chip that hosts fore [5]–[8]. These models estimate execution time based on processing cores, caches and memory controller. In this paper, the CPU frequency. These models cannot be used directly we use package power interchangeably with CPU power, for in the context of CPU power capping. This is because ap- ease of understanding. RAPL can cap power at a granularity plications running under the same CPU power cap can have of milliseconds which is adequate given that the capacitance different CPU frequencies because of the difference in their on the motherboard and/or power supply smoothes out the memory/CPU characteristics. Based on the on-chip activity powerdrawatagranularityofseconds.Overprovisioning:By of the application, the frequency will be adjusted in order to using RAPL to cap the CPU (same as package) power below ensure the CPU power cap. The power aware strong scaling TDP value, it is possible to add more nodes to the data center modelproposedinthispaperdiffersfromthepreviousworkas while staying within the power budget. An overprovisioned it estimates execution time for a given CPU power cap (that system is thus defined as a system that has more nodes than a includes power consumption of cores, caches, and memory conventional system operating under the same power budget. controller present on the chip). Due to the additional nodes, such a system can not enable all ofitsnodestofunctionattheirmaximumTDPpowerlevelssi- Power and energy profiling of parallel scientific work- multaneously. Moldable jobs: In these jobs, the user specifies load [9]–[11] and modeling of their energy efficiency [12] the range of nodes (the minimum and the maximum number has been extensively researched before. Their has been a lot of nodes) on which the job can run. Typically, the range of of work on reducing the energy consumption by applications the nodes in which the job can run is dictated by its memory while minimizing the degradation in performance. For in- usage and strong scaling characteristics. The job scheduler stance, Porterfield et al in [13] conserve energy consumption decides the number of nodes within the specified range to be by using an adaptive runtime system to automatically throttle allocatedtothejob.Oncedecided,thenumberofnodescannot the number of concurrently used threads based on power and be changed during job execution [19]. Malleable jobs: Such energy consumption gathered by using the RAPL interface. jobs can shrink to a smaller number of nodes or expand to Mammela [14] et al have proposed a scheduling scheme to a larger number of nodes upon instruction from an external reduce energy consumption of an entire data center using command with in a specified range. The number of allocated DVFS. Other work on energy optimization can be found nodestoamalleablejobcanbechangedatanytimeduringthe in [15]–[18]. Our work deals with the current scenario in executionofthejob.Toenablemalleablejobs,twocomponents which performance has to be optimized with power as a hard are critical – a smart job scheduler, which decides when and constraint as compared to optimizing energy efficiency. which jobs to shrink or expand, and a parallel runtime system Patki et al [3] proposed the idea of overprovisioning the which provides dynamic shrink and expand capability to the computenodesinpower-constrainedHPCdatacenters.Appli- job. We rely on existing runtime support for malleable jobs cation is profiled at various node levels and CPU power caps. in Charm++ [20]–[22]. In Charm++, malleability is achieved For a given power budget, the best operating configuration by dynamically redistributing compute objects to processors is then selected from the sampled configurations. Sarood at runtime. Applications built on top of such an adaptive et.al. [4] with average system load of 7. included memory system have been shown to shrink and expand with small power capping and proposed a scheme to obtain exhaustive costs [23]. Charm++ researchers are currently working on profiling of an application, thus improving the performance furtherimprovingthesupportformalleablejobs.Priorresearch by selecting the optimal configuration for a given power has also shown how MPI applications can be made malleable. budget. In this work, we propose an online scheduler for an Some notable works are dynamic CPUSET’s mapping and overprovisioned data center, that optimizes the distribution of dynamic MPI [24], dynamic malleability of iterative MPI powerandnodestomultiplejobsselectedbytheschedulerfor applications using PCM (Process Checkpoint and Migration) simultaneous execution. To the best of our knowledge, PARM library [25], and migratable threads-based overdecomposition is the first resource manager that uses CPU power capping and disk-based checkpoint-restart in Adaptive MPI [26]. ObjectiveFunction Power"Aware"Resource"Manager" (cid:88) (cid:88) (cid:88) (PARM)" wj∗sj,n,p∗xj,n,p (1) Profiler" Scheduler" ExecuEon" j∈Jn∈Njp∈Pj Strong"Scaling" Schedule" framework" SelectOneResourceCombinationPerJob Power"Aware"Model" Jobs"(LP)" SLharuinnkcAhE"xJopbasn/d"" (cid:88) (cid:88) xj,n,p≤1 ∀j∈I (2) Job"CharacterisEcs" n∈Njp∈Pj Database" Update" (cid:88) (cid:88) Queue" Ensure"Power" xj,n,p=1 ∀j∈I (3) Cap" n∈Njp∈Pj Boundingtotalnodes (cid:88) (cid:88) (cid:88) Job"Ends/ nxj,n,p≤N (4) Triggers" Job"A`r"rives" Terminates" j∈Jp∈Pjn∈Nj Boundingpowerconsumption Fig. 1: A high level overview of PARM (cid:88) (cid:88) (cid:88) (n∗(p+Wbase))xj,n,p≤Wmax (5) j∈Jn∈Njp∈Pj DisableMalleability(Optional) IV. POWERAWARERESOURCEMANAGER (cid:88) (cid:88) nxj,n,p=nj ∀j∈I (6) Figure 1 shows the block diagram of our online Power n∈Njp∈Pj Aware Resource Manager, or PARM. It has two major mod- ules:theschedulerandtheexecutionframework.Thescheduler Fig. 2: Integer Linear Program formulation of PARM scheduler is responsible for identifying which jobs should be scheduled andexactlywhichresourcesshouldbedevotedtoeachjob.We refer to the resource allocation for each job by the resource combination tuple, (n,p), where n is the number of nodes TABLE I: Integer Linear Program Terminology and p is the CPU power cap for each of the n nodes. The scheduling decision is made based on the Integer Linear Symbol Description Program (ILP), and the job profiles generated by our strong N total number of nodes in the data center scaling power aware model described in § V. The scheduler’s J set of all jobs decisions are fed as input to the execution framework which I set of jobs that are currently running implements/enforces them by launching new jobs, shrink- I set of jobs in the pending queue ing/expanding running jobs, and/or setting the power caps on J set of jobs which have already arrived the nodes. and have not yet been completed i.e they are either pending or currently running, J =I∪I The scheduler is triggered whenever a new job arrives or N set of node counts on which job j can be run j when a running job ends or abruptly terminates due to an Pj set of power levels at which job j should be run or errororanyotherreason(‘Triggers’boxinFigure1).Ateach in other words, the power levels at which job j’s trigger, the scheduler tries to re-optimize resource allocation performance is known n number of nodes at which job j is currently running to the set of pending as well as currently running jobs with j w weighing factor to set job priorities theobjectiveofmaximizingoverallthroughput.Ourscheduler j α a constant in w used to tradeoff job fairness/priority vs uses both CPU power capping and moldability/malleability j data center throughput features for throughput maximization. We formulate this x binary variable, 1 if job j should run j,n,p resource optimization problem as an Integer Linear Program on n nodes at power p, otherwise 0 (ILP). The relevant terminology is described in Table I. Our t current time now scheduling scheme can be summarized as: ta arrival time of job j j Input: A set of jobs that are currently executing or are Wbase base machine power that includes everything ready to be executed (J) with their expected execution time other than CPU and memory corresponding to a set of resource combinations (n,p), where tj,n,p execution time for job j running on n nodes with power cap of p n∈N and p∈P . j j s strong scaling power aware speedup of application j Objective: Maximize data center throughput. j,n,p running on n nodes with power cap of p Output: Allocation of resources to jobs at each trigger event, i.e., identifying the jobs that should be executed along with their resource combination (n,p). • We do not include cooling power of the data center in our calculations. • Job characteristics do not change significantly during the A. Integer Linear Program Formulation course of its execution. By relaxing this assumption we can We make the following assumptions and simplifications in benefitfromthedifferentphasesinanapplication.However, the formulation: that is out of the scope of this study. • The network power consumption stays constant. It is a rea- • Allnodesallocatedtoagivenjoboperateatthesamepower. sonable assumption since network power does not fluctuate much for most interconnect technologies. TABLE II: Different versions of PARM • Expected wall clock time represents a good estimate of the actual execution time that the scheduler uses for decision Acronym Description making. • W ,thatincludespowerforallthecomponentsofanode noMM Jobs are neither Moldable nor Malleable base other than the CPU and memory subsystems, is assumed to noSE Jobs are moldable but not malleable wSE Jobs are both moldable and malleable be constant. • A job once selected for execution is not stopped until its completion,althoughtheresourcesassignedtoitcanchange favors job fairness. We now explain the constraints of the ILP during its execution. formulation (Figure 2): • All jobs are from a single user (or have the same priority). Thisisassumedjusttokeepthefocusofthepaperonother • Select one resource combination per job (Eq. 2,3): x j,n,p issues.Thisassumptioncanbeveryeasilyrelaxedbysetting is a binary variable indicating if job j should run using wj proportional to the user/job-queue priority. resource combination (n,p). This constraint ensures that at most one of the variables x is set to 1 for any job j,n,p SchedulingproblemsareframedasILPsandILPsareNP-hard j. The jobs which are already running (set I) continue problems. Maximizing throughput in the objective function to run although they can be assigned a different resource requiresintroducingvariablesforthestartandendtimeofjobs. combination (Eq. 3). The jobs in the pending queue (I), These variables make the ILP computationally very intensive for which at least one of the variables x is equal to 1 j,n,p and thus impractical for online scheduling in many cases. (Eq. 2), are selected for execution and moved to the set of Therefore in the objective function, instead of maximizing jobs currently running (I). the overall throughput of all the jobs currently in queue, we • Bounding total nodes (Eq. 4): This constraint ensures that proposeagreedyobjectivefunctionthatmaximizesthesumof the number of active nodes do not exceed the maximum the power-aware speedup (described later) of the jobs selected number of nodes available in the overprovisioned data for immediate execution. This objective function improves the center. job throughput while keeping the ILP optimization computa- • Bounding power consumption (Eq. 5): This constraint en- tionally tractable for online scheduling. sures that power consumption of all the nodes does not exceed the power budget of the data center. Wedefinethestrongscalingpowerawarespeedupofajob • Disable Malleability (Eq. 6): To quantify the benefits of j as follows: malleable jobs, we consider two versions of our scheduler. s = tj,min(Nj),min(Pj) (7) The first version supports only moldable jobs and is called j,n,p t noSE (i.e. no Shrink/Expand). The second version allows j,n,p both moldable and malleable jobs and is called as wSE (i.e. where sj,n,p is the speedup of job j executing using resource with Shrink/Expand). Malleability can be disabled by using combination (n,p) with respect to its execution with resource Eq.6.Thisconstraintensuresthatnumberofnodesassigned combination (min(Nj),min(Pj)). Objective function (Eq. 1) to running jobs does not change during the optimization of the ILP maximizes the sum of the power aware speedups process.However,itallowschangingthepowerallocatedto of the jobs selected for execution at every trigger event. This runningjobs.Inreal-worldsituations,thejobssubmittedtoa leads to improvement in FLOPS/Watt (or power efficiency, datacenterwillbeamixtureofmalleableandnon-malleable as we define it). Improved power efficiency implies better jobs. The scheduler can apply Eq. 6 to disable malleability job throughput (results discussed in § VI,§ VII). Oblivious for non-malleable jobs. In addition to the noSE and wSE, maximization of power efficiency may lead to starvation for wealsomeasuretheperformanceofnoMM(noMoldability jobs with low strong scaling power aware speedup. Therefore, and Malleability) version of PARM in which the jobs are toensurefairness,weintroducedaweighingfactor(wj)inthe neither moldable nor malleable. In this version, besides job objective function, which is defined as follows: selection, the only degree of freedom available to PARM is the CPU power allocated to the nodes of the selected jobs. wj =(trj,emmin(Nj),min(Pj)+(tnow−taj))α (8) ThethreeversionsofthePARMaresummarizedinTableII for ease of reference. w artificially boosts the strong scaling power aware speedup j of a job by multiplying it by the job’s completion time, where completion time is the sum of the time elapsed since job’s V. POWERAWARESTRONGSCALING arrival and the job’s remaining execution time with resource PERFORMANCEMODEL combination (min(Nj),min(Pj)) i.e. (trj,emmin(Nj),min(Pj)) . PARM’s optimal resource allocation decisions depend on The percentage of a running job completed between two the availability of jobs performance data. Performance data successive triggers is determined by the ratio of the time corresponding to a large number of resource combinations interval between the two triggers and the total time required (n,p) can be crucial to the quality of solution PARM pro- to complete the job using its current resource combination. vides. Since exhaustive profiling can be impractical for large Percentage of the job that has been completed so far can then number of resource combinations, we need a model to predict beusedtocomputetrem .Theconstantα(α≥0) job performance. One of the significant contributions of our j,min(Nj),min(Pj) in Eq. 8 determines the priority given to job fairness against work is the proposed performance model that can predict an its strong scaling power aware speedup i.e. a smaller value application’s performance for any given resource combination of α favors job throughput maximization while a larger value (n,p). We call it a Power Aware Strong Scaling performance B. Adding Power Awareness to Strong Scaling Model TABLE III: Power Aware Strong Scaling Model Terminology The effect of changing frequency on the execution time Symbol Description variesfromapplicationtoapplication[28].Inthissection,we A Average parallelism in the application model execution time as a function of CPU frequency. Since, σ fraction of the duration when application parallelism CPU frequency can be expressed as a function of CPU power, is not A, parallelism is 2A−1 for σ fraction and 1 for we can finally express execution time as a function to CPU 2 σ fraction of the duration power. 2 T Application execution time on 1 node 1 f CPU Frequency 1)Execution Time as a Function of Frequency: Existing f Threshold frequency beyond which application work[4],[28]indicatesthatincreaseinCPUfrequencybeyond h execution time does not reduce a certain threshold frequency (let us call it f ) does not h fl/fmin Minimum CPU frequency supported by vendor reduce the execution time. The value of fh depends on the fmax Maximum CPU frequency supported by vendor memorybandwidthbeingusedbytheapplication.Forf <fh, Tl Execution time at CPU frequency fl executiontimedependsontheCPU-boundedandmemory(off- T Execution time at CPU frequency f h h chip)boundedworkoftheapplicationandcanthusbemodeled W on-chip workload in terms of CPU cycles cpu as [5]–[7], [29]: T Time for off-chip work in the application that is unaffected mem by CPU frequency W cpu +T , for f <f (12) t(f)= f mem h T , for f ≥f (13) h h model or PASS model. The model parameters are specific to theapplicationandtheinputdatasetwithwhichtheapplication where, Wcpu and Tmem are defined in Table III, and Th willbeexecuted.Applyingmathematicalregressiontoapplica- is the execution time at frequency fh. Let Tl be the execution tion’s profile data for different resource combinations enables time at frequency fl where fl is the minimum frequency at PASStoestimateimportantpowercharacteristics.PASSmodel which the CPU can operate. Parameter β characterizes the extends Downey’s [27] strong scaling model by making it frequency-sensitivity of an application and can be expressed power aware. Table III gives the terminology used in this as: T −T section. β = l h (14) T l Range of β depends on the frequency range supported by A. Strong Scaling Model the CPU vendor. Given the frequency range of (f ,f ), l max β ≤ 1 − fl . Typically, CPU-bound applications have An application can be characterized by an average par- fmax higher values for β whereas memory-intensive applications allelism of A. The application’s parallelism remains equal have smaller β values. to A, except for some fraction σ of the duration. Available parallelism is 2A−1 for σ fraction of the duration and just UsingEq.14andapplyingboundaryconditions,t(f )=T 2 l l 1 for the remaining σ fraction of the duration. We adjust and t(f )=T , to Eq. 12, we get: 2 h h Downey’s [27] model to satisfy the boundary conditions - ta(p1p)lic=atioTn1,tiamnedotn(nn)n=odeTAs1,afnodrTn i≥sthAe,awpphleirceatito(nn)timisetohne Wcpu = (1−Tβh)β(fflfh−f ) (15) 1 h l a single node. According to Downey’s model, the execution T βf time, t(n), of an application executing on n nodes can be T =T − h l (16) modeled as: mem h (1−β)(fh−fl) t(n)= σT(T11−n−T212TAσA1)++T2T1Aσ1,− T1σ A1<≤nn≤≤2AA−1 (1(09)) hcpoaosnws2eun)rmoFtcrpeatriqpeou,lneeaintiscseyhdeaansscsouabmreeFepdunlnettochetiibnodetneedtboaefitllChsoawPtoUftihtPheooiswwuseaerctr:hhesAieplveCtehcPdiofiUueugdshpinoCInwgPteUearl T1,n A 2A n>2A−1 (11) comLbeintaptiodnenooftDeVthFeSCaPnUd CpoPwUerthcrootrtrleinspgo[n2d]i,n[g30to].f , where A l l f is the minimum frequency the CPU can operate at using l DVFS. To cap power below p (p < p ), other architectural- l l The first equation in this group represents the range of n level mechanisms such as CPU throttling are used. We have where the application is most scalable i.e. when the number empirically observed that for p < p , the application perfor- l of nodes is less than A. The application’s scalability declines mance degrades significantly even for very small savings in significantly once n becomes larger than A because of lack power. Therefore, we restrict our study to power caps greater of parallelism for most of the duration. Finally, for n ≥ 2A, than p . The value of p can be easily determined by setting l l the execution time t(n) equals T /A and does not decrease the CPU frequency at f . CPU or the package power includes 1 l further. Given application characteristics σ, A, and T , this the power consumption by its various components such as 1 model can be used to estimate execution time for any number cores, caches, memory controller, etc. The value of p varies l of nodes n. depending on an application’s usage of these components. In a CPU-bound application, a processor might be able to simultaneously. It can even be used in data centers where cap power to lower values using DVFS, since only the cores performanceofrunningverysmallnumber(orjust1)oflarge- are consuming power. In contrast, for a memory intensive scaleapplicationsiscritical.Forexample,whilerunningjust1 application, p might be higher, since the caches and memory largejob,PARMoptimizerwilldeterminetheoptimalnumber l controller are also consuming significant power in addition to of nodes and the cpu power cap of the nodes, on which the the cores. job should be executed for optimal performance. The major part of the dynamic CPU power consumption A. Applications can be attributed to the cores, on-chip caches and memory controller. Power consumption of the core, pcore, is often We used five applications, namely, Wave2D, Jacobi2D, modeled as pcore = Cf3 +Df, where C and D are some LeanMD, Lulesh [33], and Adaptive Mesh Refinement or constants [31]. Power consumption due to cache and memory AMR [34], [35]. These applications have different CPU and accesses is modeled as, (cid:80)3 g L + g M, where, L is memory usage: i=1 i i m i accesses per second to level i cache, g is the cost of a level i i cacheaccess,M isthenumberofmemoryaccessespersecond, • Wave2D and Jacobi2D are 5-point stencil applications that g is the cost per memory access. The total CPU power can are memory-bound. Wave2D has higher FLOPS than Ja- m then be expressed as [32]: cobi2D. • LeanMDisacomputationallyintensivemoleculardynamics (cid:88)3 application. p=p + g L +g M +p (17) core i i m base • CPUandmemoryusageofLuleshandAMRliesinbetween i=1 the stencil applications and LeanMD. where, p is the base/static package power consumption. base Since number of cache and memory accesses is proportional B. Testbed to the CPU frequency, Eq. 17 can be written as: We conducted our experiments on a 38-node Dell Pow- p=F(f)=af3+bf +c (18) erEdge R620 cluster (which we call the Power Cluster). Each node contains an Intel Xeon E5-2620 Sandy Bridge with 6 where a, b, and c are constants. bf corresponds to the cores’ physical cores at 2GHz, 2-way SMT with 16 GB of RAM. leakage power and power consumption of caches and memory These machines support on-board power measurement and controller. The term af3 represents the dynamic power of the capping through the RAPL interface [36]. The CPU power cores,whereas,c=pbase representsthebaseCPUpower.The for our testbed can be capped in the range [25−95]W, while constantsaandbareapplicationdependentsincethecacheand the capping range for memory power is [8−35]W. memory behavior can be different across applications. Eq. 18 canberewrittenasadepressedcubicequationandsolvedusing C. Obtaining Model Parameters of Applications Fermat’s Last Theorem to get F−1: Application characteristics depend on the input type, e.g., (cid:115) (cid:114) p−c (p−c)2 b3 grid size. We fix the respective input types for each ap- f =F−1(p) = 3 + + plication. Each application needs to be profiled for some 2a 4a2 27a3 (n,p) combinations to obtain data for curve fitting. A single (cid:115) (cid:114) c−p (p−c)2 b3 time step (iteration) of an application is sufficient to get the 3 + + + (19) performanceforagivenresourcecombination.Forapplications 2a 4a2 27a3 having time steps (iteration) in order of milliseconds, the cost of profiling several resource combinations is negligible 3)Execution Time as Function of CPU power & Number compared to the overall execution time of the application. ofNodes: Toexpresstintermsofp,weuseEq.19toreplace Each time step (iteration) will include the different phases of f, f , and f in Eqs. 12, 15, 16. To obtain the PASS model, l h an application such as IO, communication, solvers, etc. and that estimates execution time as a function of n and p, we therefore the overall characteristics of the job can be captured combineourpowerawaremodelwiththestrongscalingmodel in a step/iteration. This approach works best for iterative describedin§V-A,byreplacingT inEqs.12,13,15,16with h applications and other applications whose characteristics do t(n) from Eqs. 9, 10, 11. not change significantly over time, which is true for majority of the scientific applications. VI. EXPERIMENTALRESULTS We use linear and non-linear regression tools provided by In this section, we first describe our experimental setup MATLAB to determine the application parameters by fitting that includes applications, testbed, and job datasets. Next, we ourperformancemodelproposedin§Vtothesampledperfor- obtain the application characteristics using the PASS perfor- mancedataobtainedbyrunningtheparallelapplicationson20 mancemodelandfinallycomparetheperformanceofdifferent nodes. The obtained parameter values for all the applications versions of PARM with SLURM. PARM can be used in are listed in Table IV and are discussed here: conjunction with most parallel programming models. While programming models such as CHARM++ can benefit by using • The parameter c (CPU base power) lies in the range [13− wSEschemethatusesjobmalleability,othermodelslikeMPI, 14]W for all applications can use the noSE scheme to benefit from power awareness • p was 30W for LeanMD and 32W for rest of the applica- l and job moldability. Usage of PARM is not restricted to data tions. For LeanMD, it is possible to cap the CPU power to centers with focus on running large number of applications alowervaluejustbydecreasingthefrequencyusingDVFS. center.Themaximumpowerconsumptionofanode,thus,adds 1.8 LeanMD up to 60W +18W +38W =116W, where 38W is the base 1.7 AMR powerofanode.Therefore,thetotalnumberofnodesthatcan Lulesh be installed in a traditional data center with a power budget Wave2D edup 1.6 Jacobi2D poofw3e3r0b0eWlowwi6ll0Wbe,t(cid:98)h3e131o060v(cid:99)erp=ro2v8isinoondeeds.daBtaycceanptperinwgiltlhbeeCaPblUe e1.5 sp to power more than 28 nodes. e ar1.4 w er−a1.3 E. Job Datasets w o We constructed two job datasets by choosing a mix of P1.2 applications from the set described in § VI-A. All these appli- 1.1 cations are written using the Charm++ parallel programming modelandhencesupportjobmalleability.Application’spower- 1 30 35 40 45 50 55 60 response characteristics can influence the benefits of PARM. CPU power (W) Therefore,inordertobettercharacterizethebenefitsofPARM, Fig.3:Modeled(lines)andobserved(markers)powerawarespeedups these two job datasets were constructed such that they have of the applications at 20 nodes very different average values of β. We name these datasets as SetL and SetH, with average β value of 0.1 and 0.27, respectively. For instance, SetH has 3 LeanMD, 3 Wave2D, 2 Lulesh, 1 Jacobi, and 1 AMR job, that gives us an average TABLE IV: Obtained model parameters β value of 0.27. A mix of short, medium and long jobs were constructed by randomly generating wall clock times with a Application a b pl ph β mean value of 1 hour. Similarly, the job arrival times were LeanMD 1.65 7.74 30 52 0.40 generatedrandomly.Eachdatasetspansover5hoursofcluster Wave2D 3.00 10.23 32 40 0.16 timeandapproximately20schedulingdecisionsweretaken(a Lulesh 2.63 8.36 32 54 0.30 scheduling decision is taken whenever a new job arrives or AMR 2.45 6.57 32 54 0.33 Jacobi2D 1.54 10.13 32 37 0.08 a running job terminates). The minimum and the maximum number of nodes on which a job can run was determined by the job’s memory requirements. We used 8 node levels (i.e. |N |=8) that are uniformly distributed between the minimum This is because LeanMD is a computationally intensive j andmaximumnumberofnodesonwhichthejobcanrun.The applicationandthereforemostofthepowerisconsumedby memory power is capped at the fixed value of 18W whereas the cores rather than caches and memory controller. On the we used 6 CPU power levels - [30,32,34,39,45,55]W. contrary, for other applications, CPU throttling kicks in at a higher power level because of their higher cache/memory F. Performance Metric usage. • value of ph lies in the range of [37 − 54]W for the We compare our scheduler with SLURM [37]: an open- applications under consideration. source resource manager that allocates compute nodes to • value of β lies in the range [0.08−0.40]. Higher value of jobs and provides a framework for starting, executing and β means higher sensitivity to CPU power. monitoring jobs on a set of nodes. SLURM provides resource • Wave2D and Jacobi2D have the largest memory footprint management on many of the most powerful supercomputers that results in high CPU-cache-memory traffic. Therefore of the world including Tianhe-1A, Tera 100, Dawn, and the value of b is high for these two applications. Stampede. We deployed both PARM and SLURM on the testbed.Forcomparisonpurpose,weuseSLURM’sschemein Figure3showsthemodeled(lines)aswellastheobserved which the user specifies the exact number of nodes requested (markers) power-aware speedups for all applications with for the job and SLURM uses FIFO + backfilling for making varying CPU power cap at 20 nodes. Power-aware speedup scheduling decisions. We call this as the SLURM baseline is calculated with respect to the execution time at p=p and l scheme or just the baseline scheme. The number of nodes the same number of nodes. LeanMD has the highest power- requested for a job submitted to SLURM is the minimum aware speedup whereas Jacobi2D has the lowest. number of nodes on which PARM can run that job. We use response time and completion time as the metric D. Power Budget for comparing PARM and SLURM. A job’s response time, We assume a power budget of 3300W to carry out experi- t , is the time interval between its arrival and the beginning res ments using our Power Cluster. Although the vendor-specified of its execution. Execution time, t , is the time from start to exe TDP of CPU and memory of the Dell nodes was 95W and finishofajob’sexecution.Completiontime,t ,isthetime comp 35W, respectively, the actual power consumption of CPU and between job’s arrival and the time it finished execution, i.e., memory never went beyond 60W and 18W while running any t =t +t .Jobthroughputistheinverseoftheaverage comp res exe of the applications. Therefore, instead of the vendor-specified completion time of jobs. In this study, we emphasize on TDP, we consider 60W and 18W as the maximum CPU and completion time as the performance comparison metric, even memory power consumption and use them to calculate the though typically response time is the preferred metric. This number of nodes that can be installed in a traditional data is because unlike conventional data centers, where resources allocated to a job and hence the jobs execution time are fixed, response times. These can be further improved by using job our scheduler dynamically changes job configuration during moldability and malleability features. execution which can vary job execution time significantly. Hence, response time is not a very appropriate metric for VII. LARGESCALEPROJECTIONS comparison in this study. Completion time includes both the response time and the execution time and is therefore the After experimentally showing the benefits of PARM on a preferred metric of comparison. realcluster,wenowanalyzeitsbenefitonverylargemachines. Since it was practically infeasible for us to do actual job G. Results scheduling on very large machine, we use the SLURM simu- lator [38] which is a wrapper around SLURM. This simulator Figure4(a)showstheaveragecompletiontimesofthetwo gives us information about SLURM’s scheduling decisions datasetswithSLURMandnoMM,noSE,andwSEversionsof withoutactuallyexecutingthejobs.TomakeanalysisofPARM PARM. In noMM experiments, the number of nodes allocated more reliable, we develop a model to estimate the cost of to a job were the minimum number of nodes on which the shrinking and expanding jobs. We then give the experimental noSE and wSE versions can run the same job. The maximum setup and present a comparison of PARM scheduling with numberofnodesonwhichnoSEandwSEcanrunthejobwere baselineschedulingpolicy.SincenoMMversionofPARMwas sameasthenumberofnodesonwhichSLURMrunsthesame inferior to both wSE and noSE, we concentrate on wSE and job. The completion times for noMM, wSE and noSE include noSE schemes in this section. alloverheadcostsincludingtheILPoptimizationtimeandthe costs of constriction and expansion of jobs. All versions of A. Modeling Cost of Shrinking and Expanding Jobs PARM significantly reduce the average completion time for both the data sets compared to SLURM (Figure 4(a)). This Constrictionandexpansionofjobshasanoverheadassoci- improvement can mainly be attributed to the reduced average ated with it. These overheads come from data communication response times shown in Figure 4(b). Our scheduler intelli- done to balance the load across the new set of processors gently selects the best power levels for each job which allows assigned to the job and from the boot time of nodes. it to add more nodes to benefit from strong scaling and/or For demonstrating our system using real experiments schedulingmorejobssimultaneously.noMMversionofPARM (§ VI), we used the existing malleability support in significantly reduces the completion times by intelligently Charm++[23].However,theapproachin[23]ispracticalonly selecting jobs and allocating power to them. Allowing job for small clusters as it starts processes on as many nodes moldability and malleability in noSE and wSE, respectively, as the job can run on. Inter-job interference and security further reduces the completion times. For example, there is concerns make that approach impractical for large-clusters, an improvement of 7.5% and 13.9% in average completion where many jobs run simultaneously. Charm++ researchers timeofSetLwithnoSEandwSE,respectivelyoverthenoMM have recently proposed a new approach which eliminates the scheme.Aspeedupof1.7XisachievedwithwSEschemeover needofspawningprocessesonallnodesanddoesnotleaveany the baseline scheme. Ability to shrink and expand the jobs for residual processes after shrink. Hence, for more practical and wSEversionofPARM,givesadditionalflexibilitytotheILPto accurate large-scale projections, we model an approach which re-optimizetheallocationofnodestotherunningandpending would require dynamic process spawning when expanding. jobs. noSE reduces the solution space of ILP (compared to Hence, we consider boot times in our model. wSE) by not allowing running jobs to change the number of nodes allocated to them during their execution. The noMM A scheduler typically makes two decisions: 1) how many version reduces the search space even further by fixing the nodes to assign to each job, and 2) which nodes to assign number of nodes allocated to a job both at the start time and to each job. We address the first decision in this paper and duringitsexecution.ThewSEversionfurtherbenefitsfromthe defer the second for future work. Let us say that job j factthatitcanexpandtherunningjobstorunontheunutilized with a total memory of m MB, has to expand from n j f machines. e.g. when there are not enough jobs to utilize all nodes to n nodes. For simplification of analysis, we assume t the nodes. These factors reduce both the average completion that each job is initially allocated a cuboid of nodes (with √ √ √ and the average response time in wSE (Figure 4(a), 4(b)). As dimensions- 3nf× 3nf× 3nf)interconnectedthrougha3D shown by Figure 4(c), wSE scheme utilizes an average of 36 torus. After the expand operation, size of the cuboid becomes √ √ navoedreasgeduorfin3g3tnhoedeenstiurseeddaitnastehteecxaesceuotifonnoaSsEcofomrpSaerteLd.to an 3nf × 3nf × 23√nntf. For load balance, the data in memory (m MB) will be distributed equally amongst the n nodes. j t Asmallervalueofβ meansthattheeffectofdecreasingthe Hence, the communication cost for the data transfer can be CPU power on application performance is small. When β is expressed as (secs): small,theschedulerwillprefertoallocatelessCPUpowerand (mj − mj)∗n use more nodes. When β is large, the benefits of adding more t = nf nt f (20) nodesatthecostofdecreasingtheCPUpoweraresmaller.The c 2 2∗b∗n3 flexibilitytoincreasethenumberofnodesgivesPARMhigher f benefitoverSLURMwhenβ issmallascomparedtothecase where b is the per link bandwidth in MB/sec. The numerator when β is large. This is corroborated with the observation in Eq. 20 represents the total data to be transferred whereas (Figure 4) that the benefits of using PARM as compared to the denominator represents the bisection bandwidth of the SLURM are much higher with dataset SetL (β = 0.1) as cuboid. Similarly, the cost of shrinking a job is determined by compared to dataset SetH (β = 0.27). PARM’s intelligent computing the cost of distributing the data of n −n nodes f t allocation of power can significantly improve completion and equally across the final n nodes. t 300 SLURM 300 SLURM 45 Avg. nodes noMM Avg. CPU power noMM 50 Average completion time (mins)11220505500000 nnwooSMSEEM Average response time (mins)11220505500000 nnwooSMSEEM Average num. of nodes 11223340505050 AAvvgg. .n nooddeess n woSSEE AAvvgg C CPPUU p poowweer rn woSSEE 12340000 Average CPU power (W) 5 0 SetL SetH 0 SetL SetH 0 0 SetL SetH (a) Averagecompletiontime (b) Averageresponsetime (c) Averagenodesandaveragepowerpernode Fig. 4: Comparing performance of SLURM with noMM, noSE, and wSE versions of PARM. Boot times can be significant for some supercomputers. 2)ApplicationCharacteristics: Sincemostdatacentersdo Since many supercomputers in Top500 [39] belong to the not report power characteristics of running jobs, application Blue Gene family, we include their boot time when evaluating characteristics, (σ,T ,A,f ,f ,β,a and b), of the jobs in the 1 l h our scheme. We adopt the following simple linear model to Intrepid logs are not known. Hence, we sampled parameter calculate the boot time (t ) for expand operation based on values from the range defined applications in § VI-C and b Intrepid boot time data [40]: randomly assign them to jobs in our datasets. Since these parameters are chosen from a diverse set of applications that t (in seconds)=(n −n )∗0.01904+72.73 (21) b t f range from CPU intensive to memory intensive applications, In an expand operation, communication phase can start only we believe them to be representative of real applications. after additional nodes become available. These additional 3)NodeRangeforMoldable/MalleableJobs: Intrepiddoes nodes might have to be booted. Therefore the total cost of not allow moldable/malleable jobs, and therefore logs only a shrink or expand operation is the sum of boot time and data specify the fixed number of nodes requested by the jobs. transfer time, i.e., t =t +t . A job set for expansion might se c b For jobs submitted to the PARM scheduler, we consider this receive additional nodes from a job undergoing constriction number as the maximum nodes that the job can be run in the same scheduling decision. Therefore, an expanding (max(N )), and set min(N ) = θ ∗ max(N ), where θ is j j j job has to wait until the shrinking job has released the randomly selected from the range [0.2−0.6]. additional resources. To simplify this analysis, we determine the maximum t from amongst the shrinking/expanding jobs 4)PowerBudgetandCPUPowerlevels: Informationabout se (tmax) and add 2tmax to the execution times of all the jobs powerconsumptionofIntrepidnodesisnotpubliclyavailable. se se shrinkingorexpandingduringthecurrentschedulingdecision. Therefore, we use the power values from our testbed cluster Tocontrolthefrequencyofconstrictionorexpansionofajob, (described in § VI-B). With 116W as the maximum power and consequently its cost, we define a parameter f (in secs). consumption per node, the maximum power consumption of se f is the time after which a job can shrink or expand. i.e. if 40,960 nodes equals 4751360W. The SLURM scheduler will se a job was shrunk or expanded at t secs, then it can be shrunk scheduleon40960nodeswitheachnoderunningatmaximum orexpandedonlyafter t+f secs.Thisconditionisenforced power level. As in § VI-E, PARM uses 6 CPU power levels, se using Eq. 6. Pj ={30,33,36,44,50,60}W. B. Experimental Setup C. Performance Results 1)Job Datasets: The results presented in this section are Figure 5 shows that both noSE and wSE significantly based on the job logs [41] of Intrepid [42]. Intrepid is a reduceaveragecompletiontimescomparedtoSLURM’sbase- IBM BG/P supercomputer with a total of 40,960 nodes and line scheduling policy. As γ decreases from 0.8 to 0.2, the is installed at Argonne National Lab. The job trace spans over average completion time increases in all the schemes because 8 months and has 68,936 jobs. We extracted 3 subsets of the jobs arrive at a much faster rate and therefore have to 1000 successive jobs each from the trace file to conduct our wait in the queue for longer time before they get scheduled. experiments. These subsets will be referred to as Set1, Set2, However, this increase in the average completion times with andSet3andthestartingjobidsforthesesubsetsare1,10500, bothourschemesisnotassignificantasitiswiththeSLURM and27000,respectively.TomeasuretheperformanceofPARM baseline scheme. The average completion times includes the in the wake of diverse job arrival rates, we generated several cost of all overheads. In all the job datasets, the average otherdatasetsfromeachofthesesetsbymultiplyingthearrival overhead for shrinking and expanding the jobs was less than times of each job by γ, where γ ∈[0.2−0.8]. Multiplication 1% of the time taken to execute the dataset. We controlled ofthearrivaltimeswithγ increasesthejobarrivalratewithout thesecostsbysettingf =500secs,i.e.,theschedulerwaited se changing the distribution of job arrival times and tells us how foratleast500secsbetweentwosuccessiveshrinkandexpand the data center throughput would change in case the load is operations for a job. Cost of optimizing the ILP was also very increased, i.e., faster job arrival rate. small. In the worst case, it took 15 secs to optimize an ILP, 700 baseline 700 baseline 700 baseline noSE noSE noSE mins)600 wSE mins)600 wSE mins)600 wSE me (500 me (500 me (500 Average completion ti234000000 Average completion ti234000000 Average completion ti234000000 100 100 100 0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Arrival time scaling factor (γ) Arrival time scaling factor (γ) Arrival time scaling factor (γ) (a) Set1 (b) Set2 (c) Set3 Fig. 5: Comparing average completion times of baseline, noSE, and wSE on several datasets. which is negligible as compared to the frequency at which the scheduler was triggered. impTroavbeledVavsehroawges trheastpobnosthe ntiomSeEs,anwditwhSwEShEavoeustipgenriffiocrmanitnlgy e (mins) 112400 AverMagaex 11240000 mins) ccnoooornmSesExpis.plteaEetnnixodtelniycnugtotiiumtohtnpeesetrijfmoionberm(ni§nsoSVnwEoISSI-EEaAni)idn.ncDlawuledlSseEdpsai,ttteahdeetshseciptsosi.tsoetBsveehortatfhveesrianhadgrvi,newpkraoiSngoEger Completion Tim 1 680000 6810000000 mpletion Time ( abveecraaugseeeoxfetchuetifoanctttihmaett(hTeayblceanVr)u,nasmcoormepjoabresdsitmouSlLtaUneRoMusliys vergare 2400 240000 Max Co by intelligently selecting the number of nodes for each job. A 0 0 Speedups in Table V are the improvement in average com- 0.28 0.98 1.28 1.48 pletion time compared to the baseline scheme. A speedup of α 5.2X is obtained in the best case. We make the following two Fig.6:Average(leftaxis)andmaximum(rightaxis)completiontimes observations in the obtained speedups: for Set 1 for different values of (α) • Higher speedups for smaller values of γ: This implies that as the job arrival rate increases relative increase in average completion time of wSE is much less than the relative tradeoffbetweenmaximumcompletiontimeofanyjob(i.e.job increase in the average completion time of the baseline fairness)andtheaveragecompletiontimeofthejobs(i.e.data scheme. center throughput). Figure 6 shows the results from several • Smaller speedups in Set3 as compared to Set2: This is experiments where we varied α and measured its impact on becauseSet3doesnothaveenoughjobstokeepthemachine average and maximum completion times. As the value of α fully utilized for first half of Set3. (TableI)increases,themaximumcompletiontimedecreasesat thecostofincreaseinaveragecompletiontime.Therefore,the D. Comparison with Naive Overprovisioning parameter α can be be tuned by the data center administrator to control fairness and throughput. To show the benefits of using a sophisticated optimiza- tion methodology for intelligent power allocation in PARM F. How Much Profile Data is Sufficient? (§ IV-A), we compare PARM with a naive strategy in which allnodesintheoverprovisionedsystemareallocatedthesame Number of binary variables in the PARM ILP formulation CPU power and the jobs are scheduled using the SLURM (xj,n,p) are proportional to the number of CPU power levels baseline scheduling policy. For instance, with CPU power of of the jobs (Pj). Larger number of such power levels increase 30W, the naive strategy can use up to (cid:98) 4751360 (cid:99) = 55248 the solution space of the ILP and hence the quality of the 30+18+38 nodes. Table VI gives the speedup of wSE over the naive solution. However, larger the number of variables, longer it strategy executed with different CPU power caps. Significant takes to solve the ILP. As mentioned earlier in § VII-C, the speedupsinTableVIclearlydemonstratethatintelligentpower cost of solving the ILP in all of our experiments was 15 secs allocation to the jobs is essential and benefits of PARM are in the worst case when |Pj|= 6 and the job queue J had not coming merely from overprovisioning of nodes. 200 jobs. In this section, we show the impact of |Pj| on the completion times. Figure 7 shows the average and maximum completion times of job dataset Set1(γ = 0.5) as the number E. Analyzing Tradeoff Between Fairness and Throughput of CPU power levels (P ) are increased from 2 to 8. For j We introduced the term w in the objective function of example, |P |=2 means that a job can execute either at 30W j j the scheduler’s ILP (Eq. 1) to prevent starvation of jobs with orat60WCPUpower.Theaverageandmaximumcompletion low power-aware speedup. In this section, we analyze the times decreases as |P | goes from 1 to 6 and the improvement j
Description: