RequestCombininginMultiprocessors withArbitrary Interconnection Networks Alvin R.Lebeck and Gurindar S.Sohi ComputerSciencesDepartment UniversityofWisconsin-Madison 1210W.Daytonstreet Madison,WI53706 <alvy,sohi>@cs.wisc.edu ToappearinIEEETransactionsonParallelandDistributedSystems Abstract Severaltechniqueshavebeenproposedtoallowparallelaccesstoasharedmemorylocationbycombiningrequests; they have one or more of the following attributes: requirement for a priori knowledge of the requests to combine, restrictions on the routing of messages in the network, or the use of sophisticated interconnection network nodes. We present a new method of combining requests that does not have the above requirements. We obtain this new methodforrequestcombiningbydevelopingaclassificationschemefortheexistingmethodsofrequestcombining. This classification scheme is facilitated by separating the request combining process into a two part operation: (i) determiningthecombiningsetwhichisthesetofrequeststhatparticipateinacombinedaccess,and(ii)distributing the results of the combined access tothe members of the combiningset. The classificationofcombiningstrategies is based upon which system component, processor elements or interconnection network, performs each of these tasks. Our approach, which uses the interconnection network toestablishthecombiningset,andtheprocessor ele- mentstodistributetheresults,liesinanunexplored area ofthedesignspace. We also present simulationresults to assessthebenefitsoftheproposedapproach. 1. Introduction Arvind and Iannucci state that the design of a large-scale, shared memory multiprocessor must address two basicissues[2]: (1) itmusttoleratelonglatenciesformemoryrequests, (2) itmustachieveunconstrained,yetsynchronized,accesstoshareddata. Whileseveraltechniques,forexamplecachesandprefetching[24],andlowlevelcontextswitching[25],havebeen proposedtotoleratethelatencyofmemoryrequests,heretoforetheonlyknownmethodsofallowingunconstrained, yetsynchronized, access toshared dataare implementationsof request combining. The earliest publishedproposal for request combining was in the CHoPP system [28], where several read requests to a common memory location are combined in the interconnection network and are satisfied with only a single access of the memory location. Whentworead requestsdestinedforthesamememorylocationmeetatanodeofthenetwork,thesourceofoneof the requests is saved and only one read request is forwarded to memory. When the response of the read request arrivesatthenodewherecombiningtookplace,tworesponsesaresentbacktowardtheprocessors. The idea of combiningread requests, or read combining inCHoPPwas extended inthe NYU Ultracomputer toallow several types of requests tocombine [6]. The Ultracomputer usestheFetch&F primitive,where F isany associative and commutative operator. An enhanced interconnection network with the topology of an Omega net- workisproposedtoperformcombiningontheFetch&F primitive. The Ultracomputer style of request combining is illustrated in Figure 1. When a Fetch&F (X, e) request meetsaFetch&F (X, j)requestatanetworknode,combiningtakesplace:eissavedinawaitbuffer,anALUinthe node computes eF j, and the request Fetch&F (X, eF j) is forwarded to memory. When a response V is received from memory for the Fetch&F (X, eF j) request, decombining takes place: V is forwarded as a response to the Fetch&F (X, e) request, eF V is forwarded as a response to the Fetch&F (X, j) request, and e is removed from the wait buffer. Correct operation is guaranteed if the combining of requests satisfies the serialization principle: the finalstateofthesystemmustbeconsistentwiththeservicingofallrequestsinsome(unspecified)serialorder[6]. TherearethreedistinctfeaturesoftheUltracomputerstyleofrequestcombining: (1) Requestsarecombinedontheforwardtripthroughthenetwork. (2) Stateissavedinthenetworkwhenrequestsarecombined. (3) Requestsaredecombinedonthereturntripthroughthenetwork. Combininginanetworknodefirstrequiresacomparisontodetermineiftworequestsarecombinable:thetwo requests mustbe Fetch&F requests tothe same memory location. Thisrequires the use of comparators inthenet- work nodes. When tworequests are combined, each request isremoved from the queue, the value of e is saved in the wait buffer, the operation eF j is carried out in the ALU, and the request Fetch&F (X, eF j) is placed in the queue to forward to memory. The wait buffer must be large enough to hold as many values as there are requests that can be combined at this node, else combining will not take place. On the return trip, each returning request searches thewaitbufferand,ifdecombiningmusttakeplace, appropriate actionsare initiated.Thisimpliesthatthe returnpathmustbeidenticaltotheforwardpathfordecombiningtotakeplace,orthereturnpathmusthaveatleast onenodeincommonwiththeforwardpath—thenodewherethecombiningstateisstored. AlmasiandGottlieb[1] -2- giveseveralexamplesofhowsuchhardwarecombiningcaneliminateserialbottlenecks. Severalalternativeproposalsforrequestcombininghaveappeared intheliterature [8,11,19,21,29,32]. The primary focus of these efforts is on reducing the cost of the combining network. This is accomplished either by alteringthetopologyofthecombiningnetworkorbyrequiringthesystemsoftwaretoreducetheamountofconten- tionforshareddata. This paper has two purposes. The first is to develop a taxonomy that can be used to categorize combining methods proposed to date. This allows us to enumerate the issues involved, and to make a comparison of known techniques for combiningrequests. The second is to propose a new approach to request combining, one which can beusedwitharbitraryinterconnectiontopologies. Wedevelopataxonomyinsection2,andclassifyexistingmethodsusingthistaxonomy. Weobservethatone areaofthedesignspace,whichwecallInterconnect-Processorcombining,hasnotbeenexploredforarbitraryinter- connection networks. We explore this in section 3, where we present a new scheme for request combining. The potential of the new combining scheme is evaluated in section 4. Section 5 summarizes the paper and suggests directionsforfurtherresearch. 2. ATaxonomyforRequestCombining 2.1. ParallelPrefixComputationandRequestCombining Kruskal, Rudolph, and Snir [14] observed that request combining is very similar to the problem of parallel prefixcomputation[15]. Giventheelementsx ,x ,...,x aprefixcomputationproducestheresults: 1 2 n (cid:0) (cid:1)(cid:2) r = x (cid:3)(cid:4) 1 1 rn = rn- 1 F xn(cid:5) whereF isanyassociativeoperator. Computingtheresultsinparallelistermedaparallelprefixcomputation[15]. To examine the similarity between request combining and parallel prefix consider an example in which four processors add a constant, C, to a shared variable X and receive the previous value of X. Assume the processors simultaneouslyexecute the atomic operation Fetch&Add(X,C). The values returned tothe processors are X,X+C, X+2C,andX+3C. Regardlessoftheorder thatrequestsare serviced, memoryhasthefinalvalueofX+4C. Thisis simplythesetofresults{r r r r r }producedbyaprefixcomputationonthesetofelements{X,C,C,C,C}with 1, 2, 3, 4, 5 -3- theadditionoperator(+). Based on this observation, we see that arbitrary request combining is a two part operation. The first part, or taskofthecombiningmethod,istodeterminethesetofrequeststhataredestinedforthesamememorylocationand needtobe combined. We call thisset of requests the combining set. The second task istodistributethe results of thecombinedaccesstotheappropriateprocessorsbyperformingaprefixcomputationonthecombiningset. A prefix computation network, such as the one proposed for scan primitives by Blelloch [3], can be used to distributetheresultsofthecombinedaccess. Insuchanetwork,stateissavedontheforwardtripthroughtheprefix computation network, and the results are distributed on the return trip, very similar to the Ultracomputer approach towardscombining. However, theuseofaprefixcomputationnetworkrequires aprioriknowledgeofthecombin- ing set. Blelloch proposed the scan primitives for a Single Instruction Multiple Data (SIMD) paradigm where the elements on which to perform the prefix computation, in our case the combining set, are stored in an array. The array is distributed across the processors and participation in the prefix computation is based on the processors’ activationstatus. Therefore, touseaprefixcomputationnetworktodistributetheresultsofacombinedaccess, the combiningset mustbe established prior toinsertioninthe prefix computationnetwork. Alternatively, inthe Ultra- computerapproachtowardscombining,thecombiningsetisdetermineddynamicallyintheinterconnectbycompar- ing the addresses of requests on the forward trip through the network. The results are then distributed to the appropriateprocessorsonthereturntripthroughthenetwork. The two techniques use different system components to establish the combining set: the Ultracomputer uses the interconnect, whereas Blelloch uses a priori knowledge in the processor elements. We can therefore obtain a taxonomy for request combining by specifying which system component, processor elements or interconnect, per- formseachofthetasksinvolvedincombiningrequests. 2.2. ClassificationofExistingRequestCombiningStrategies Baseduponhowthecombiningsetisdetermined,andhowtheresultsoftheprefixcomputationarecomputed and distributed, the design space for request combining can be divided into four regions (Figure 2). Interconnect- Interconnect Combining (IIC) covers schemes in which the interconnection network determines the combining set and distributes the results. In Processor-Interconnect Combining (PIC) the processors establish the combining set, and the interconnection network distributes the results. Schemes in which the processors perform both tasks are -4- classifiedasProcessor-ProcessorCombining(PPC). Finally,Interconnect-ProcessorCombining(IPC)indicatesthat theinterconnectionnetworkdeterminesthecombiningset,whiletheprocessorsdistributetheresults. Inthefollow- ing subsections we discuss existing methods of request combining according to which region of the design space theybelongto. 2.2.1. Interconnect-InterconnectCombining(IIC) The CHoPP [28] and the NYU Ultracomputer [6] methods of request combining are instances of IIC: the interconnection network determines the combining set and distributes the results. The IBM RP3 [21] researchers proposed the basic ideas of the Ultracomputer method of combining for their implementation. However, the RP3 has two interconnection networks, one network which combines requests and one that services non-combining requests. Adistinctionismade between non-combinable and potentiallycombinable requests (typically synchroni- zation requests), and the interconnect dynamically determines the combining set of the potentially combinable requests. Two alternative techniques for IIC are presented byTzeng [29] and Hsu and Yew [11]. Tzeng separates the interconnectintoaroutingsectionandacombiningsection. Itisassumedthatrequestswhichmaycombinearedis- tinguishedfromnon-combiningrequestsbyexaminationoftheopcode. Suchrequestsaredirectedtothecombining section of the network. The combining section of the network then determines the combining set. Hsu and Yew propose a single stage shuffle-exchange combining network in addition to a non-combining network. These two proposals are similar to the IBM RP3 method of combining, but the difference in topology of the combining net- work reduces the hardware cost of the networks. However, the basic technique of request combining in all three schemes,theIBMRP3,HsuandYew,andTzeng,isthesameasthetechniqueusedintheUltracomputer. An interesting example of IIC read combining can be found in schemes with hierarchical cache/memory structures, such as Cache-Only Memory Architecture (COMA) machines [7,27], or Non-Uniform Memory Access (NUMA)machineswithhierarchicalcaches[18,23,30]. Herereadcombiningcanbeimplementedbyusingatech- nique similar to the CHoPPmethod of read combining. A read missof a cache block at one level of the hierarchy causes a request to be propagated to the next higher level in the hierarchy. Subsequent read misses of the same cache block at the same level in the hierarchy cause state tobe saved; thisstate allows the data tobe forwarded to theappropriaterequestingprocessorswhentheresponsearrivesfromthehigherlevelofthehierarchy. -5- 2.2.2. Processor-InterconnectCombining(PIC) Analternativeapproachtoreducingthecostofthecombiningnetworkistodeterminethecombiningsetprior toinsertingtheelementsintothenetwork. Thisistheapproachtakenbyschemesthatfallundertheclassificationof PICinthedesignspace. Aswithschemesthatperform IIC,schemesperforming PICusetheinterconnection network todistributethe resultsofcombinedrequests. However,theprocessorelements,andnottheinterconnectionnetwork,determinethe combiningset. Blelloch’s prefix computationnetwork [3], discussed inSection2.1,and the control network of the ThinkingMachinesCM-5[17]fallintothiscategory. Another form of PIC in a SIMD paradigm is proposed by Lipovski and Vaughan [19]. This implementation uses a modified carry-lookahead circuit to implement a prefix computation network which distributes the results. The combining set is determined by which processing elements are currently active. The prefix computation net- work may be extended foroperationinaMultipleInstructionMultipleData(MIMD) paradigm,thoughtheauthors donotexplicitlystatehowthismightbedone. Analternative techniqueforcombiningrequestsinaMIMDparadigm,proposedbyHarrison [8],usesasyn- chronousprefixcomputationnetwork. Allrequestsatthesamestageinthenetworkcombine. Therefore, theentire combiningsetmustbeinsertedintothenetworkinthesametimeslot. Thisisaccomplished bybroadcasting infor- mation about the combinable locations to the processors. Based onthisinformation each processor determines the correcttimeslotforitsrequest. In addition to being restricted to use a priori knowledge of the combining set, the methods discussed in this section make use of a (parallel) prefix computation network to distribute results. Such networks are conceptually very similartothe Ultracomputer combiningnetwork: state mustbe saved onthe forward tripthrough the network and requests are decombined on the return trip. Recall that this requires sophisticated interconnect nodes and res- tricts the routes of the return messages. These potentially undesirable features (associated with the result distribu- tion)canbeeliminatedifweusetheprocessorelements,toperformthe(parallel)prefixoperationanddistributethe results. -6- 2.2.3. Processor-ProcessorCombining(PPC) InschemesclassifiedasIICorPIC,theinterconnect,asviewedfromasharedmemorylocation,formsatree. The memory moduleisthe root ofthetree andtheprocessors are theleaves ofthetree. Thenodesofacombining tree are realized by the implicit storage in the interconnect nodes, i.e., wait buffers. Another alternative is to use explicitstorageinmemorytoconstructthecombiningtree. Thismethodofrequestcombiningcalledsoftwarecom- biningintheliterature[5,31,32],isclassifiedasPPCsincetheprocessorsbearfullresponsibilityforthecombining ofrequests:theprocessorsestablishthecombiningsetanddistributetheresultsandtherearenodemandsofthenet- workatall. In software combining, one shared location is divided into L locations which constitute the storage for the nodes of the combining tree. Requests are combined as the processors traverse the combining tree. The result is S that processors access each of the Llocations, rather than S processors accessing the single location. However, (cid:0)(cid:1)L(cid:0) the L locations (nodes of the combining tree) must be distributed across the memory modules in order to alleviate excessivecontentionforasinglememorymodule. Yew, Tzeng, and Lawrie show how software combining can be used for barrier operations [32]. Goodman, Vernon and Woest [5] and Johnson [12] extend the work of Yew, Tzeng, and Lawrie to carry out arbitrary Fetch&F operationswithasoftware combiningtree. TangandYewalsoprovideseveral algorithmsfor traversing acombiningtreewherethetypeofmemoryaccessdetermineswhichalgorithmischosen(e.g.,barriersynchroniza- tion,semaphore,readcombining)[31]. Aconsequenceofimplementingthecombiningtreewithexplicitmemorylocationsisflexibilityinthetypeof memoryaccess. Inadditiontovariable typesofmemoryaccesses, softwarecombiningpermitstheuseofnetworks witharbitrarytopologiesandrelativelyunsophisticatednodes. A requirement of implementing a software combining tree for access to a shared location is that the shared location must be known prior to program execution. Furthermore, if the latency of the combined access is to be minimized, the combining tree must be balanced; this requires a priori knowledge of the number of requests that may combine [31]. Moreover, since the combining tree is created based on the maximum number of requests that maycombine,thelatencytocompletethecombiningoperation isinfluenced bythismaximumnumber: if onlyone request is accessing the shared location, it must traverse the entire combining tree. For example, in a balanced -7- (software combining) tree of height H, the single request must perform H memory accesses, each of which must traversetheinterconnectionnetwork. If the processors are responsible for establishing the combining set, they require a priori knowledge. If the burdenofdeterminingthecombiningsetisplacedbackontheinterconnect,theneedforaprioriknowledgeiselim- inated. 2.2.4. Interconnect-ProcessorCombining(IPC) Heretofore,therearenoproposedmethods(withtheexceptionofsomespecialcaseswhichwediscussinsec- tion 3) of request combining that use an arbitrary interconnection network to determine the combining set and the processor elementstodistributetheresults. Inorder todecide ifsuchascheme isworthy of investigation,the next sectioncomparestheissuesofimplementingcombiningundereachoftheclassifications. 2.3. IssuesinCombiningRequests There are several aspectstorequestcombiningthatwehavetouched uponinthe previous discussionthatwe reiteratetomotivateIPC. Theyare: Aprioriknowledgeofthecombiningset. (cid:0) Restrictionsplacedontheroutesofmessages. (cid:0) Sophisticationoftheinterconnectionnetwork. (cid:0) Latencyofthecombiningoperation. (cid:0) The need for a priori knowledge of the combining set requires the programmer to specify this information. Restrictionsplacedontheroutingofmessageslimitsthechoicesfortheinterconnectionnetworktopology. Sophis- tication of the interconnection network impacts both the cost and performance of the system: a high degree of sophistication might increase the design time and may either increase the latency for non-combining requests or requiretheadditionofasecondnetwork. Thelatencyofthecombiningoperationisthetimefromwhenaprocessor generatesacombinablerequesttothetimewhentheresultoftherequestisreceived. In the next two sections we look at how each of these issues is affected by which system component estab- lishesthecombiningsetandwhichcomponentdistributestheresults. Table1summarizesthefollowingdiscussion. -8- 2.3.1. DeterminingtheCombiningSet Theprocessorelementsrequireaprioriknowledgeofthecombinablelocationsinordertoestablishthecom- biningset. Forexample, the nodes of a software combiningtree [31] are defined during algorithmdesign. In con- trast the interconnect determines the combining set dynamically by comparing destination addresses of messages. The consequence of introducing comparators is a small increase in the sophistication of the nodes of the intercon- nect,whichmayresultinaslightincreaseinthelatencytocompletethecombiningoperation. 2.3.2. ResultComputationandDistribution Placing the responsibility of result distribution on the interconnect has two disadvantages. The first is an increaseinthesophisticationoftheinterconnectasaresultofwaitbuffers,anddecombininglogicineachintercon- nect node. The second, and perhaps more important disadvantage, is that the route which a return message may travelisrestrictedbecauseitmustvisitthenodeswherestatewassavedontheforwardtripthroughthenetwork. Theprimaryadvantage ofusingtheprocessor elementstodistributethe results of the combiningoperation is that no restrictions are placed on the routes that messages may travel (no requirement to visit a particular node). Another advantage is that the sophistication of the interconnect nodes is not increased (no need for wait buffers). However, the sophistication of the processor elements (more accurately the processor-network interface) may increasesomewhattohandletheprotocolneededtodistributeresults. Thelatencyofthecombiningoperation,measuredinthenumberofstepsneededtocarryouttheoperation,is adisadvantageofusingtheprocessorstodistributeresults. InSection3weshowthatthelatencyofdistributingthe results is logarithmic with respect to the number of requests in the combining set, assuming the system does not havebroadcastcapability. Based on the above discussion we feel further investigation of IPC, the as yet unexplored area of the design space, isworthwhile. Suchschemeswouldusetheinterconnect todeterminethecombiningsetandusetheproces- sor elements to distribute the results. The above discussion also points out the following potential advantages of IPCcombining:(i)noaprioriknowledgeofthecombiningsetisrequiredsincetheinterconnectdynamicallydeter- mines the combining set, (ii) no restrictions on the routes of messages since the processor elements distribute the results,and(iii)thenodesoftheinterconnectionnetworkrequireonlyasmallamountofsophistication. Thepoten- tialdrawbackofsuchascheme,ascomparedtoIIC,isthelatencyofthecombiningoperation. -9- 3. Interconnect-ProcessorCombining We now consider implementations of request combining that fall into the unexplored region of the design space, Interconnect-ProcessorCombining(IPC). Weinitiallyconsidertwoflavorsofcombinableoperations:ares- tricted form of Fetch&Add (F&A), or Fetch&Increment [9,26], and the general F&A operation. In Fetch&Increment, or simply F&I, all participants add the same, constant value. In the general F&A, each partici- pantcouldbeaddingadifferentvalue. The rationale for the simplerF&I operation isthe following: if, in the process of determining the combining set, it is also possible for a participant to determine its overall positioninthe combiningset, i.e., itspositioninthe serial order, then each participant can compute its value without further interactions with other participants. The followingexampleillustratesthispoint. Figure 3 shows a system with six processors (P0- P5) connected to a shared (synchronization) bus. Each processor is assigned one channel inthe ‘‘bus’’ (the channel could be a wire inan electronic bus[26] or a specific frequencyinanopticalbus[9]). Agivenprocessorcanreadallchannels,butcanwritetoonlyitschannel. A processor generates a combinable request and broadcasts its intentions on the bus, by putting a "1" on its channel. Allotherprocessors thatwishtoparticipate intheaccess writea"1"ontheirrespectivechannels. Atthis pointallprocessorscan,bymonitoringallthechannelsonthebus,determinethenumberandtheidentityofthepro- cessorswhichare goingtoparticipate inthecombinedaccess. Thecombiningsetisestablishedbydeterminingthe participatingprocessors;theorderinginthecombiningsetisstaticallydefinedbyprioritiesassignedtothechannels. Supposefourprocessors{P0,P1,P3,P5}wouldliketoperformaF&Ioperation(incrementbyaconstantC) onmemorylocationX. Atthepointthatthefourprocessors haveindicatedtheirintentionstoaccessX,thepriority chain {S0,S1,S2,S3,S4,S5} is: {110101}. Each of the four processors then determines how many processors ahead of itintheprioritychainare alsoparticipatinginthecombiningoperation. Forexample,P5determinesthat there are three processors ahead of it in the priority chain. P3 sees two, P1 sees one, and P0 sees zero since it is the highest priority processor participating in the combined access. P0 takes responsibility for accessing X from memory. When P0 accesses X the other processors also read the value from the memory bus. Each processor is restrictedtoaddingthesameconstant,C,tothesharedlocation,andthereforeeachprocessormaycomputeitsvalue locally based on the number of participants preceding it in the priority chain. P0 receives X, P1 computes X+C, P3 computes X+2C, P5 computes X+3C, and memory receives X+4C. One processor (or the memory controller -10-
Description: