ABSTRACT AGRAWAL,ABHINAVRAJIV. ReducingCheckpoint/RestartOverheadusingNearDataProcessing forExascaleSystem.(UnderthedirectionofJamesTuck.) Withincreasingsizeandcomplexityofhigh-performancecomputing(HPC)systemstoachieve exascaleperformance,thesystemmeantimetointerrupt(systemMTTI)isprojectedtodecrease. Tomaintaintheperformanceefficiencyofthesystem,checkpointsneedtobestoredatafasterrate whenusingcheckpoint/restartformitigation.Inadditionitrequiresalowercheckpointcommit andrestoretime.Thelowercheckpointcommitandrestoretimerequirementisaggravatedbythe increasingcheckpoint-sizetoIO-bandwidthratio.Toovercomethis,priorworkshaveproposed multilevel(hierarchical)checkpointschemesthatinvolvefrequentcheckpointwritestofasternode- localstoragewithoccasionalwritestoslowerglobalI/O-basedstorage(e.g.,disk).However,dueto increasingcostofwriting/readingcheckpointsto/fromglobalI/Obasedstorage,thistechnique maynotscalewellwithsystemsapproachingexaflopsperformance.WhileI/Oorstoragehierarchy alleviatestheperformancecostbyreducingI/Oaccesstimes(includingforcheckpoint/restart), movinglargedatabetweenstorageindifferentlevelsofhierarchyaddsoverhead.Neardatapro- cessing(NDP)hasbeenshowntobeeffectiveinreducingtheamountofdatamovementinmany applicationsbyperformingcomputationsclosertodata,thusreducingtheoverhead.Inaddition, offloadingcomputationsofsomeapplicationsfromthehostprocessorstoNDPhasshowntoim- proveperformance.InthisworkweshowhowNDPcanbeleveragedtoimproveC/Rperformance. WeproposeoffloadingtheprocessofwritingcheckpointstoglobalI/Ofromthemaincompute cores to NDP. We also explore opportunities for additional optimizations using NDP to further reducecheckpointoverheads.Overall,ourapproacheliminatestheperformancecostofwriting checkpointstoI/OastheseoperationsareperformedbyNDP. WeevaluatetheperformanceofournovelapplicationofNDPtoreducecheckpoint/restartcost andcompareittoexistingcheckpoint/restartoptimizations.Fortwo-levelcheckpointschemes(i.e., checkpointssavedtolocalstorageandremoteI/Onodes),ourevaluationforaprojectedexascale systemshowsthatabaselinesystem(withoutNDP)spendsnearlyhalfitstimewritingcheckpoints toI/Oorrestoringfromacheckpointorre-executinglostwork.WithNDPforoffloadingcheckpoint managementandcompression,thehostprocessorisabletoincreaseitsprogressratefrom51%to 78%(i.e.,a>50%speedupintheapplicationperformance). Wefurtherexplorehowcheckpointcompressioncanbecombinedwithmultilevelcheckpointing. Weperformacompressionstudyanddiscussthecompressionperformancerequirementformaking it beneficial to add compression to all levels of multilevel checkpointing. We analyze the C/R performanceandotherbenefitsofthistechnique.Ourdatashowsthatmultilevelcheckpointing combined with compression at all levels improves the efficiency of a system with C/R to 73% comparedto35%formultilevelcheckpointingwithoutcompression.Theefficiencyofmultilevel checkpointingwithcompressionisfurtherimprovedto89%whenusingNDPtooffloadcertainC/R tasks. Finally,weexplorehowthetwoapproachesofcompressionatalllevelsofmultilevelcheck- pointing and the use of NDP can be combined. Adding compression to all levels of multilevel checkpointingwillresultincompressedcheckpointdatabeingavailableinlocalstorage.Therefore theroleandbenefitofNDPforfurthercheckpointdatacompressionbeforewritingittoglobal storageisevaluated.Inadditiontoevaluatingtheperformanceoverhead,wealsoestimatetheen- ergyandhardwarecostofthevariousC/Rconfigurationswediscussed.Ourcostefficiencyanalysis showsthataddingcheckpointcompressiontoimproveprogressrateisamoreefficientsolution thanincreasingbandwidthofnodelocalstorage.Wealsoshowthataconfigurationthatleverages NDPtooffloadthetaskofwritingdatatoglobalI/Ohashighercostefficiencythanaconfiguration thatperformscheckpointcompressionateachlevelofmultilevelcheckpointing. ©Copyright2017byAbhinavRajivAgrawal AllRightsReserved ReducingCheckpoint/RestartOverheadusingNearDataProcessingforExascaleSystem by AbhinavRajivAgrawal AdissertationsubmittedtotheGraduateFacultyof NorthCarolinaStateUniversity inpartialfulfillmentofthe requirementsfortheDegreeof DoctorofPhilosophy ComputerEngineering Raleigh,NorthCarolina 2017 APPROVEDBY: GregoryByrd EricRotenberg FrankMueller JamesTuck ChairofAdvisoryCommittee DEDICATION Tomyparents-RajniandRajivAgrawal. ii ACKNOWLEDGEMENTS Thisresearchwasmadepossibleduetosupportandguidanceofmanypeople-myadvisor,research groupmembers,collaborators,familyandfriends. Foremost,IwouldliketoexpressmysinceregratitudetomyadvisorDr.JamesTuckforhis constantsupportduringmyPh.Dstudies.Iwouldliketothankhimforhisguidanceandpatience whilementoringmeinmyresearchwork.IamgratefultoDr.Tuckforallowingmetoworkonmy researchwithenoughindependenceandflexibility. Iwouldliketothankmydissertationcommitteemembers:Dr.GregoryByrd,Dr.EricRotenberg andDr.FrankMuellerfortheirserviceonmycommitteeaswellasfortheirinsightfulcomments, feedbackandadvice. IwouldalsoliketothankGabrielLohforcollaboratingwithmeonthisworkandforhisadvice duringmyinternship. MysincerethanksalsogoestoBagusWibowoforhelpingwithmyresearchaswellasforthe manystimulatingdiscussionsandlatenightsbeforedeadlines.Manythankstomyfellowlabmates- JoonmooHuh,AmroAwad,HusseinElnawawy,VineshSrinivasanandSeungheeShin.Thanksto GayatriPowarforproofreadingmanypaperandreportdrafts. LastlyIwouldliketothankmyparentsforinstillinginmetheimportanceofeducationfroma youngageandsupportingmethroughoutmyacademicjourney.Thisaccomplishmentisasmuch theirsasitismine. iii TABLEOFCONTENTS LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 ExistingC/ROptimizationTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 AddingCheckpointCompressiontoMultilevelCheckpointing . . . . . . . . . . . . . . . . . 4 1.4 LeveragingNDPtoImproveC/REfficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 OrganizationofThisThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter2 BACKGROUNDANDRELATEDWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Checkpoint/Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 CoordinatedCheckpoint/Restart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Checkpoint/RestartOverhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 FailureRate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 CheckpointSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 ProgressRateorC/REfficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Checkpoint/RestartOptimizationTechniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 IncreaseCheckpointCommitBandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 ReduceCheckpointDataSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 NearDataProcessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter3 SCALINGSTUDY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 ExascaleSystemProjection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 MTTIProjection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Checkpoint/RestartOverheadwithnoOptimization . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter4 MULTILEVELCHECKPOINTINGWITHCOMPRESSION . . . . . . . . . . . . . . . . . 15 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.2 MultilevelCheckpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.3 AddingCheckpointCompressiontoMultilevelC/R . . . . . . . . . . . . . . . . . . . 17 4.2 CompressionStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.1 ToolsandMethodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.2 CheckpointCompressionSpeedAndFactor. . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.3 SelectingUtilityforCheckpointCompression . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.2 Checkpoint/RestartOverheadComponents. . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.3 ProgressRateComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 4.3.4 C/ROverheadBreakdown(byLocalandI/OLevel) . . . . . . . . . . . . . . . . . . . 28 4.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter5 LEVERAGINGNDPFORCHECKPOINT/RESTART . . . . . . . . . . . . . . . . . . . . . . 34 5.1 ComputeNodewithNDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1.1 OperationofMultilevelCheckpointingwithNDP . . . . . . . . . . . . . . . . . . . . . 35 5.1.2 NDPforCheckpointDataCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 NDPPerformanceRequirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.1 ConfiguringNDPforCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3.2 Checkpoint/RestartOverheadComponents. . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3.3 ProgressRateComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3.4 C/ROverhead-Breakdown(4%I/ORecovery). . . . . . . . . . . . . . . . . . . . . . . 46 5.3.5 C/ROverhead-SensitivityStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter6 PERFORMANCE,POWERANDCOSTANALYSISFORCOMBINATIONOFCHECK- POINT/RESTARTOPTIMIZATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 CompressionStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.1 ToolsandMethodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.2 Data:CompressionSpeedandFactor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.3 SelectingUtilityforCheckpointCompressionusingNDP . . . . . . . . . . . . . . . 54 6.3 PerformanceEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3.2 ProgressRateComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.3.3 C/ROverhead-Breakdown(15%I/ORecovery) . . . . . . . . . . . . . . . . . . . . . . 58 6.4 Methodology-CostAnalysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4.1 EnergyCost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4.2 HardwareCost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.5 Results-CostAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.5.1 AbsoluteCostBreakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.5.2 CostPerformanceRatio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter7 CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 v LISTOFTABLES Table3.1 ExascalesystemprojectionscaledfromtheTitanCrayXK7supercomputer. . . 14 Table4.1 CheckpointDataDetails.Secondcolumnshowsthesizeoftotalcheckpoint datacollectedforeachmini-appingigabytes.Furthercolumnsshowcompres- sionspeedforcheckpointdatausingdifferentutilitiesandcompressionlevels ontheHDDandSSDsystem.Compressionspeedisforasinglethreadofeach utility.Valueinside()isthecompressionlevel.. . . . . . . . . . . . . . . . . . . . . . . . 20 Table4.2 Checkpointcommitandrestoretimeinsecondsforallcompressionutilities. Checkpointsizeforallmini-appsissetto112GBpercomputenode.‘I/O’ columncontainscheckpointtimeswhencheckpointsarecompressedand savedtoglobalI/Ostorage.‘L/S’and‘L/F’containscheckpointtimeswhen checkpointsarecompressedandsavedtoslowcomputenodelocalstorage(5 GB/s)andfastcomputenodelocalstorage(15GB/s)respectively.Notethat thecheckpointtimevaluesinthe“Average"rowarenottheaveragevalues ofthesevenmini-apps,butthecheckpointtimeiftheperformancemodel issimulatedusingaveragecompressionfactorandcompressionspeedfrom Figure4.1.Notethatcheckpointcommit/restoretimeintheabsenceofcom- pressionwouldbeI/O:1120s,L/S:22.4sandL/F:7.47s. . . . . . . . . . . . . . . . . . 22 Table4.3 C/Rparametersforevaluationusingperformancemodel. . . . . . . . . . . . . . . . 26 Table5.1 Therequiredcompressionspeed,requirednumberofprocessorcoresinNDP andthesmallestpossiblecheckpointintervaltoI/Obasedonaveragecom- pressionfactorandspeed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table5.2 C/Rparametersforevaluationusingperformancemodel. . . . . . . . . . . . . . . . 44 Table6.1 Checkpointcompressiondata.lz4-compresseddataof7mini-appsiscom- pressedagainusingvariouscompressionutilities.Thefirstcolumnshowsthe sizeoflz4-compressedcheckpointdatausedtocollectcompressionparame- ters.Columnswithheader’F’containcompressionfactorandcolumnswith header ’S’ contain compression speed in MB/s. Compression speed is the speedatwhichlz4-compresseddataiscompressedusingvariousutilities. . . . 52 Table6.2 Cumulativeorequivalentcheckpointcompressiondataforcompressionafter lz4compression.lz4-compresseddataof7mini-appsiscompressedagainus- ingvariouscompressionutilities.Compressionfactorinthistableisameasure ofthecumulativereductionincheckpointsizeaftercompressionusinglz4and theutilityinthefirstrowofthecorrespondingcolumn.Compressionspeed isanequivalentcompressionspeed,iftheuncompressedcheckpointdata werebeingcompressedinthesameamountoftimeasthelz4-compression checkpointdataisbeingcompressed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 vi Table6.3 Checkpointcommittimeinsecondsforallcompressionutilitiesfor2scenar- ios.UnC:UncompressedcheckpointdatacompressedbyNDP(Scenario-1); Uncompressedcheckpointsizeforallmini-appsissetto112GBpercompute node.Comp:lz4-compressedcheckpointdatacompressedbyNDP(Scenario- 2).Checkpointsizeisthesizeif112GBofcheckpointdataofthecorresponding mini-appiscompressedusinglz4.Notethatthecheckpointtimevaluesinthe “Average"rowarenottheaveragevaluesofthesevenmini-apps,butthecheck- pointtimeiftheperformancemodelissimulatedusingaveragecompression factorandcompressionspeedfromFigure4.1. . . . . . . . . . . . . . . . . . . . . . . . 55 Table6.4 C/R parameters for performance, power and cost evaluation of multilevel checkpointingcombinedwithcompressionandNDP . . . . . . . . . . . . . . . . . . 58 Table6.5 Powerandcostparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 vii
Description: