ebook img

Cache and Interconnect Architectures in Multiprocessors PDF

285 Pages·1990·14.447 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Cache and Interconnect Architectures in Multiprocessors

CACHE AND INTERCONNECT ARCHITECTURES IN MULTIPROCESSORS CACHE AND INTERCONNECT ARCHITECTURES IN MULTIPROCESSORS edited by Michel Dubois University of Southern California and Shreekant S. Thakkar Sequent Computer Systems . ., ~ KLUWER ACADEMIC PUBLISHERS Boston/Dordrecbt/London Dlstrib.tonfor North America: KluwerAcademicPublishers 101 PhilipDrive Assinippi Park Norwell, Massachusetts02061 USA Distributorsforallotbercountries: KluwerAcademicPublishersGroup DistributionCentre PostOfficeBox322 3300AH Dordrecht, THE NETHERLANDS UbnryofCongressCataloging-in-PublicatloDData Cacheandinterconnectarchitecturesinmultiprocessors / [edited) by MichelDuboisandShreekantS. Thakkar. p. em. Paperspresentedata workshoptitledCacheand Interconnect Architecturesin Multiprocessors, held inEilat, Israel, May25-26, 1989. Includesindex. ISBN-13:978-1-4612-8824-4 e-ISBN-13:978-1-4613-1537-7 DOl 10.1007/978-1-4613-1537-7 1. Computernetwork protocols-Congresses. 2. Multiprocessors Congresses. 3. Computernetworkarchitectures-Congresses. I. Dubois, Michel, 1953- . II. Thakkar, S. S. III. Title: Interconnect architecturesinmultiprocessors. TK5IOS.5.C33 1990 004.S-dc20 90-37022 CIP Copyrigbt © 1990byKIuwerAcademicPublishers Softcoverreprintofthehardcover 1stedition Allrightsreserved. Nopartofthispublicationmaybereproduced,storedinaretrievalsystemor transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission ofthe publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts02061. Contents Preface vii TLB CONSISTENCY AND VIRTUAL CACHES The Cost of TLB Consistency PatriciaJ. Teller Virtual-Address Caches in Multiprocessors MichelCekleov, MichelDubois, Jin-Chin Wang, andFayeA. Briggs 15 SIMULATION AND PERFORMANCE STUDIES· CACHE COHERENCE A Critique of Trace-Driven Simulation for Shared-Memory Multiprocessors PhilipBitar 37 Performance of Symmetry Multiprocessor System Shreekant Thakkar 53 Analysis of Cache Invalidation Patterns in Shared-Memory Multiprocessors AnoopGuptaand Wolf-Dietrich Weber 83 Memory-Access Penalties in Write-Invalidate Cache Coherence Protocols Jin-Chin WangandMichelDubois 109 Performance of Parallel Loops using Alternate Cache Consistency Protocols on a Non-Bus Multiprocessor RussellM. Clapp, TrevorMudge, andJamesE. Smith 131 Predicting the Performance of Shared Multiprocessor Caches HendrikGoosenandDavidCheriton 153 vi CACHE COHERENCE PROTOCOLS The Cache Coherence Protocol of the Data Diffusion Machine ErikHagersten, SelfHaridi, andDavidH.D. Warren 165 SCI (Scalable Coherent Interface) Cache Coherence David V. James 189 INTERCONNECT ARCHITECTURES Performance Evaluation of Wide Shared Bus Multiprocessors AndyHopper, AlanJones, andDimitrisLioupis 209 Crossbar-Multi-processor Architecture Vason Srini 223 "CHESS" Multiprocessor-A Processor-Memory Grid for Parallel Programming DimitrisLioupisandNikosKanellopoulos 245 SOFTWARE CACHE COHERENCE SCHEMES Software-directed Cache Management in Multiprocessors HoichiCheongandAlexander Veidenbaum 259 Index 277 Preface Cache And Interconnect Architectures In Multiprocessors Eilat,Israel May 25-261989 Michel Dubois University ofSouthern California Shreekant S. Thakkar Sequent Computer Systems Theaim oftheworkshop was to bring togetherresearchers workingon cachecoherence protocols for shared-memory multiprocessors with various interconnect architectures. Shared-memorymultiprocessorshavebecomeviablesystemsfor manyapplications. Bus based shared-memory systems (Eg. Sequent's Symmetry, Encore's Multimax) are currentlylimitedto32processors. ThefIrstgoalofthe workshop was to learn about the performanceofapplicationsoncurrentcache-basedsystems.Thesecondgoal wastolearn about new network architectures and protocols for future scalable systems. These protocols and interconnects would allow shared-memory architectures to scale beyond currentimitations. The workshop had20 speakerswho talkedabouttheircurrentresearch. Thediscussions werelivelyandcordialenoughtokeeptheparticipantsawayfrom thewonderfulsandand sun for two days. The participants got to know each other well and were able to share theirthoughts inan informalmanner.The workshop wasorganizedintoseveralsessions. Thesummaryofeach session isdescribed below. Thisbook presentsrevisions ofsome ofthepaperspresentedattheworkshop. Session 1:CacheandTLBConsistencyProtocols Michael Carlton talkedon "EfficientCacheCoherency for Multiple Bus Multiprocessor Architectures."HedescribedtheworkinprogressatBerkeleyonascalableshared-memory architecture for Parallel Prolog. This proposed architecture is a multiple bus multiprocessor, an extension ofcurrentbus-based shared-memory multiprocessors. Itis similarto,butmoregeneral, thantheWisconsin Multicube.Thecoherencyprotocol uses bothsnoopinganddirectoryschemes.Theirarchitecturetakesadvantageofthelocalityof processorreferencesonasinglebusandsupportsbroad-castmessagesoverabus usinga viii snoopingcachecoherencyprotocol. Adirectorystylecachecoherencescheme is used to ensurecorrectnessamongbuses.Mikesparkedalivelydiscussionwhenhereviewedsome ofthedesigndecisionsinvolvedinthedevelopmentoftheprotocol. GurindarSohiwas the nextspeakerand hetalkedon "CacheCoherenceMechanismsfor MultiprocessorswithArbitraryInterconnects."Thebasicmechanismisadistributedcache directory that is maintained as a doubly-linked list across the system. The proposed coherence mechanism requires much less memory than an equivalent main memory directory based scheme. The scheme obviates the need for multi-level inclusion in hierarchical multiprocessors; it works well incluster-basedsystems where the individual clustersarebus-basedmultiprocessors. PatTellerwas the final speakerin thesession. Pat's talk was refreshingly differentfrom themajorityofthe talkssincesheaddressed theproblemof"Consistency-EnsuringTLB Management and Its Scalability," a rarely discussed topic. She described several consistency-ensuring methods ofmanaging TLBs in a shared-memory multiprocessor system. These methods differ not only in strategy but also in their generality, performance, and scalability. The performance ofsuch a management scheme can be quantifiedbyexaminingitseffectonTLBmissrates,pagefaultrates,memory traffic,and executiontime.ShediscussedtheprosandconsofeachofthedescribedTLBmanagement schemesandoutlinedamethodologyforcomparingthem. Session2: System Architectures Erik Hagersten gave an interesting talk on "The Data Diffusion Machine" which is another architecture to support Parallel Prolog. This is a hierarchically-organized architecturewherethememoryisphysicallydistributedandgloballyaddressed.Ablockof memory may reside in any processor memory and there maybe multiple copies of the sameblock,justasinacache-basedmultiprocessor.Theprocessorsand theirmemoryare attheleavesofatree-likehierarchyand thebranchesform theclustersofprocessors. The clustersinterfacethroughdirectorycaches.Erikdescribedthecoherencyprotocolfor this system. RaeMcLellandescribedtheimplementationoftheISMmultiprocessor,supporting up to sixteenCRISPprocessors on asinglebackplane. Among the features ofthis system are multi-levelcachesaccessed with virtualaddresses. Anew term,"snarfing",wascoinedto refer to a bus-watching mechanism which reduces contention to synchronization primitives. VasonSrini talkedaboutthe"XbarMultiProcessor(XMP)Architecture." Vasonoutlined thedesign ofamassivelyparallel,cachecoherentsharedmemory system. Itis basedon bus-based shared-memory multiprocessors interconnected by a low latency crossbar switch.Histalkedfocusedontheimplementationofthecrossbarswitch. ix Session3:Bus/NetworkArchitectures Trevor Mudge was supposed to talk on "Cache Behavior in a Logical Shared-Bus Multiprocessor." However, he had to postpone hisjourney to Israel at the last moment. Wemissedhim. Alan Jones talked on "Multiprocessor for high-density Interconnects." Alan described simulation studies to evaluate the performance of multiple bus and wide bus multiprocessorsarchitectures.Thecoherenceprotocolusedin thestudywasbasedon the Berkeleymodel.Theconclusion ofthestudywasthatthemultiplenarrowbusesperform betterthanwiderbuses. PaulSweazygaveadescriptionofthe "Directory-basedCacheCoherenceon SCI." This is the work ofthe IEEE Scalable CoherentInterconnect standards committee. The SCI project was started to overcome the scalability limits of bus-based shared-memory multiprocessors. The interconnect standard allows a system to connect an arbitrary numberofnodes. The interconnectstandard is topology independent. Paul described a linked-listbaseddirectorycoherenceprotocolthatisindependentoftheinterconnect.This issimilartotheschemedescribedearlierinthedaybyGurindarSohi. Tocapofftheday, DimitrisLioupis madeashortimpromptupresentationon the "Chess Multiprocessor", an architecture in which groups ofprocessors share caches. Dimitris's presentationwas mostlyonthepackagingofhismachine. On hisslide,thealternationof processorsandcacheslookedlikeacheckerboard. Session4:Performance Philip Bitar gave a "A Critique of Trace-Driven Simulations for Shared-Memory Multiprocessors." Philip'scontention was that itisdifficultfor trace-driven simulations to produce a valid representation ofinteracting processes in a multiprocessor system. Trace-driven simulations, like high-level modeling, must be verified by low-level simulation, or by actual execution. His talk sparked a lively discussion oftrace-driven simulation techniques used in several currentstudies. Someofthe researchers ofthese studieswereintheaudienceanddefendedtheirapproach. ShreekantThakkardescribedthe"PerformanceofCacheCoherenceProtocols." Hetalked ontheperformanceoftheSequent'sSymmetrywrite-throughandcopybackprotocols for severaldifferent(parallel,databaseand multi-user)applications. Theperformancestudy relatedbusutilizationandcachecoherencetrafficwiththeapplicationperformance.These statistics were collected on a 30 processor Symmetry multiprocessor using embedded hardwaremonitoringtechnique.Thestatisticsrevealedthatthecopybackprotocolallowed thesystem to be scaled to large numberofprocessors for many applications. The talk alsodescribed theperformanceofthecurrenthardware synchronization mechanism and compareditwithseveralsoftwaresynchronizationmechanisms. x MichelDuboisdescribedhis"Experienceusinganalyticalprogrammodelstopredictcache overheadinparallelalgorithms." Ananalytical modelfor thesharingbehaviorofparallel programswas derivedand the model predictionswerecomparedwithexecution-driven simulationsoffiveconcurrentprogramsfordifferentnumberofprocessorsanddifferent blocksizes. Wen-HannWangtalkedon"Tracereductionsandtheirapplicationstoefficienttrace-driven simulationforwrite-backcaches." Heapproachedtheproblemofthelargetimeandspace demands ofcache simulationsin two ways. First, the program tracesare reduced to the extentthatexactperformancecanstillbeobtainedfrom thesetraces.Second,analgorithm isdevised toproduce performanceresults for many set-associativewrite-backcaches in justonesimulationrun. Thetracereductionandtheefficientsimulation techniques were extended to multiprocessor cache simulation. His simulation results show that this approachcansignificantlyreducethediskspaceneededtostoretheprogramtraces. Itcan alsodramatically speedupcache simulationsand still produce the same results as non reducedtraces. Wolf-Dietrich Webber presented his study on "Cache Invalidation Patterns in Shared memory Multiprocessors." This work wasdone tostudy writeinvalidationsbehaviorof parallel homogeneousapplications.Theresultswereextrapolatedtoseehow they would affectacluster-basedsharedmemory multiprocessor withadirectory basedscheme. He observedthatthewriteinvalidationpatternsweredifferentforsynchronizationobjectsand dataobjects. This was aresultofthecoarse-grain process-basedparallel programming model used for these applications. The study also showed that cache line size is an importantfactorindetermininginvalidationdistributions. Susan Eggers described her study of "The effect of Sharing on the Cache and Bus Performance ofParallel Programs." Susan's work is based on trace-driven simulations from traces taken on three parallel CAD applications. These applications are homogeneous applications using the coarse-grain process-based parallel programming model. Herstudiesshowedthatparallelprogramsincursignificantly highermissratios and bus utilization than comparable uniprocessorprograms. The sharing componentof these metrics proportionally increases with both cache and block size. Some cache configurationsdetermineboththeirmagnitudeandtrend.Theamountofoverheaddepends on the memory reference pattern to the shared data. Programs that exhibit good per processor locality perform better than those with fine-grain sharing. This suggests that parallelsoftwarewritersandbettercompilertechnologycanimproveprogram performance throughbettermemoryorganizationofshareddata. Session5:Synchronization,VirtualAddressCachesandHierarchy JamesGoodman'stalkwason "Synchronization,Serialization,andFalseSharing". "False sharing" refers to the sharing of memory blocks by processes even in the absence of shareddataintheblock.Itoccurswhendifferentwordsofamemoryblockareaccessedby differentprocesses.Afterdemonstratingtheeffectsoffalsesharing,Jamesthenpresenteda xi synchronizationprimitivecalledQOSB (QueueOn Sync-Bit) which has beenadopted in theWisconsin Multicubemultiprocessor. Faye Briggs addressed the problem of "Virtual-Address Caches" in multiprocessors. Virtual-addresscacheshaveanadvantageoverphysicaladdresscachesin thatnotimeis lost to translation in accessing the cacheddata. However virtual caches causeproblems due in part to synonyms, which are multiple virtual addresses pointing to the same physical address. In his talk, Fayecompared several solutionsbased on theirfeasibility and theirtransparency to thesoftware in both uniprocessorand multiprocessorsystems. Alltheseproblemscanbesolvedefficientlyatthecostofmorecomplexhardwareand/or non-transparencyfrom thesoftware. HendrikGoosen talkedon "TheRoleofAShared2ndLevelCachein aScalableShared Memory Multiprocessor." This work was done in the context of the VMP multiprocessor,aresearchprojectatStanford.TheoriginalVMPdesign hasbeenextended from a2-level to a 3-level memory hierarchy ofcaches. This was done to allow ahigh degreeofscalabilitybytheadditionofanintermediatesharedsecond-levelcache.Thefirst levelper-processorcachecachescodeanddatalocaltothecurrentexecutioncontextwithin aprogram.Thethirdlevelcacheisa virtual memory pagecache,cachingprogram files and datafiles betweenprogram executions. Thetalkoutlinedsome possibleroles ofthe secondlevelcache,thedesign implicationsandopenissues. Session6: Compiler-AidedCacheCoherence Alex Veidenbaumtalkedon "Compiler-assistedCacheManagementin Multiprocessors." Hediscussedthreedifferentsoftware-assistedcachecoherenceenforcementschemes for large shared-memory multiprocessor systems using interconnection networks. All three schemesrely on acompilertodetectpotentialcoherenceproblemsand generatecode to enforce coherence in a parallel program. The main goals are to maintain coherence withoutany interprocessorcommunicationand tokeepcoherenceenforcementoverhead low.Theformerisachievedbyusingcompile-timeknowledgeoftheparallelismanddata dependenciesinaprogram.Thelatterisachievedby usingspecialhardware to invalidate stale cache blocks in time independent of the number such blocks. Cache words are allowedtobecomeinconsistentwithmemoryaslongasthecompilerdecides itissafeto doso.Thisallowsinvalidationstobedelayedbeyondthetimeanewcopyofcacheword hasbeengenerateduntil thetimethewordhastobeinvalidated.Thethreeschemesdiffer in thecomplexityandpowerofthecompilerdetectionalgorithms, thecomplexity ofthe additional hardware,andtherun-timesupportthehardwareprovidesfordeciding whatto invalidate. Each scheme improves over the previous one in terms of the amount of unnecessary invalidationsdue to imprecision ofcompile-timedetection, and achieves a higher hitratio. Thelast speakerofthe workshop was Jean-Loup Baer who talkedon "Self-invalidating cachecoherenceprotocols."Hereviewedbrieflythecachecoherenceprotocolsthatdonot

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.