Ananta: Cloud Scale Load Balancing Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, Naveen Karri Microsoft ABSTRACT Layer-4 load balancing is fundamental to creating scale-out web services.WedesignedandimplementedAnanta,ascale-outlayer-4 loadbalancerthatrunsoncommodityhardwareandmeetstheper- formance,reliabilityandoperationalrequirementsofmulti-tenant cloud computing environments. Ananta combines existing tech- niquesinroutinganddistributedsystemsinauniquewayandsplits thecomponentsofaloadbalancerintoaconsensus-basedreliable controlplaneandadecentralizedscale-outdataplane.Akeycom- ponentofAnantaisanagentineveryhostthatcantakeoverthe packet modification function from the load balancer, thereby en- ablingtheloadbalancertonaturallyscalewiththesizeofthedata center. Duetoitsdistributedarchitecture, Anantaprovidesdirect Figure1:AnantaDataPlaneTiers serverreturn(DSR)andnetworkaddresstranslation(NAT)capa- bilities across layer-2 boundaries. Multiple instances of Ananta havebeendeployedintheWindowsAzurepubliccloudwithcom- tobeatleastashighasapplications’SLA,butoftensignificantly bined bandwidth capacity exceeding 1Tbps. It is serving traffic highertoaccountforfailuresinotherinfrastructureservices. needsofadiversesetoftenants,includingtheblob,tableandrela- Asacloudprovider,wehaveseenthatcloudservicesputhuge tionalstorageservices. Withitsscale-outdataplanewecaneasily pressureontheloadbalancer’scontrolplaneanddataplane. In- achievemorethan100GbpsthroughputforasinglepublicIPad- boundflowscanbeintense—greaterthan100Gbpsforasingle dress. Inthispaper,wedescribetherequirementsofacloud-scale IPaddress—witheverypackethittingtheloadbalancer.Thepay- loadbalancer,thedesignofAnantaandlessonslearntfromitsim- as-you-gomodelandlargetenantdeploymentsputextremelyhigh plementationandoperationintheWindowsAzurepubliccloud. demands on real-time load balancer configuration — in a typical CategoriesandSubjectDescriptors:C.2.4[Computer-Communi- environmentof1000hostsweseesixconfigurationoperationsper cationNetworks]:DistributedSystems minuteonaverage,peakingatoneoperationpersecond.Inourex- GeneralTerms:Design,Performance,Reliability perience,thedataplaneandcontrolplanedemandsdroveourhard- wareloadbalancersolutionintoanuntenablecornerofthedesign Keywords: Software Defined Networking; Distributed Systems; space,withhighcost,withSLAviolationsandwithloadbalancing ServerLoadBalancing devicefailuresaccountingfor37%ofalllivesiteincidents. Thedesignproposedinthispaper,whichwecallAnanta(mean- 1. INTRODUCTION inginfiniteinSanskrit),resultedfromexaminingthebasicrequire- The rapid rise of cloud computing is driving demand for large ments, and taking an altogether different approach. Ananta is a scalemulti-tenantclouds. Amulti-tenantcloudenvironmenthosts scalablesoftwareloadbalancerandNATthatisoptimizedformulti- manydifferenttypesofapplicationsatalowcostwhileproviding tenant clouds. It achieves scale, reliability and any service any- highuptimeSLA—99.9%orhigher[3,4,19,10].Amulti-tenant where(§2)viaanoveldivisionofthedataplanefunctionalityinto loadbalancerserviceisafundamentalbuildingblockofsuchmulti- three separate tiers. As shown in Figure 1, at the topmost tier tenantcloudenvironments.Itisinvolvedinalmostallexternaland routersprovideloaddistributionatthenetworklayer(layer-3)based halfofintra-DCtraffic(§2)andhenceitsuptimerequirementsneed ontheEqualCostMultiPath(ECMP)routingprotocol[25].Atthe secondtier,ascalablesetofdedicatedserversforloadbalancing, calledmultiplexers(Mux),maintainconnectionflowstateinmem- Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor oryanddolayer-4loaddistributiontoapplicationservers. Athird personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenot madeordistributedforprofitorcommercialadvantageandthatcopiesbear tierpresentinthevirtualswitchoneveryserverprovidesstateful thisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponents NATfunctionality. Usingthisdesign, nooutboundtrafffichasto ofthisworkownedbyothersthanACMmustbehonored.Abstractingwith pass through the Mux, thereby significantly reducing packet pro- creditispermitted.Tocopyotherwise,orrepublish,topostonserversorto cessing requirement. Another key element of this design is the redistributetolists,requirespriorspecificpermissionand/orafee.Request ability to offload multiplexer functionality down to the host. As [email protected]. discussedin§2, thisdesignenablesgreaterthan80%oftheload SIGCOMM’13,August12–16,2013,HongKong,China. Copyright2013ACM978-1-4503-2056-6/13/08...$15.00. balancedtraffictobypasstheloadbalancerandgodirect,thereby 207 eliminatingthroughputbottleneckandreducinglatency. Thisdivi- sionofdataplanescalesnaturallywiththesizeofthenetworkand introducesminimalbottlenecksalongthepath. Ananta’sapproachisanexampleofSoftwareDefinedNetwork- ing (SDN) as it uses the same architectural principle of manag- ingaflexibledataplaneviaacentralizedcontrolplane. Thecon- trollermaintainshighavailabilityviastatereplicationbasedonthe Paxos[14]distributedconsensusprotocol. Thecontrolleralsoim- plementsreal-timeportallocationforoutboundNAT,alsoknown asSourceNAT(SNAT). AnantahasbeenimplementedasaserviceintheWindowsAzure cloudplatform.WeconsideredimplementingAnantafunctionality inhardware.However,withthisinitialAnantaversioninsoftware, wewereabletorapidlyexplorevariousoptionsinproductionand determine what functions should be built into hardware, e.g., we Figure2: FlatDataCenterNetworkoftheCloud. Allnetwork realizedthatkeepingper-connectionstateisnecessarytomaintain devicesrunasLayer-3devicescausingalltrafficexternaltoarack application uptime due to the dynamic nature of the cloud. Sim- to be routed. All inter-service traffic — intra-DC, inter-DC and ilarly, weightedrandomloadbalancingpolicy, whichreducesthe Internet—goesviatheloadbalancer. need for per-flow state synchronization among load balancer in- stances,issufficientfortypicalcloudworkloads. Weconsiderthe evaluationofthesemechanisms,regardlessofhowtheyareimple- mented,tobeakeycontributionofthiswork. Morethan100instancesofAnantahavebeendeployedinWin- dows Azure since September 2011 with a combined capacity of 1Tbps. Ithasbeenserving100,000VIPswithvaryingworkloads. IthasprovenveryeffectiveagainstDoSattacksandminimizeddis- ruptionduetoabusivetenants. Comparedtotheprevioussolution, Ananta costs one order of magnitude less; and provides a more scalable,flexible,reliableandsecuresolutionoverall. There has been significant interest in moving middlebox func- tionalitytosoftwarerunningongeneral-purposehardwareinboth research[23,24,5]andindustry[8,2,27,21,31]. Mostofthese architecturesproposeusingeitherDNSorOpenFlow-enabledhard- Figure 3: Internet and inter-service traffic as a percentage of wareswitchesforscaling. TothebestofourknowledgeAnantais totaltrafficineightdatacenters. thefirstmiddleboxarchitecturethatrefactorsthemiddleboxfunc- tionalityandmovespartsofittothehosttherebyenablinguseof network routing technologies — ECMP and BGP — for natural isacollectionofvirtualornativemachinesthatismanagedasone scalingwiththesizeofthenetwork.Themaincontributionsofthis entity. Weusethetermstenantandserviceinterchangeablyinthis papertotheresearchcommunityare: paper.EachmachineisassignedaprivateDirectIP(DIP)address. Typically,aserviceisassignedonepublicVirtualIP(VIP)address • Identifying the requirements and design space for a cloud- andalltrafficcrossingtheserviceboundary,e.g.,totheInternetor scalesolutionforlayer-4loadbalancing. toback-endserviceswithinthesamedatacentersuchasstorage, • Providingdesign,implementationandevaluationofAnanta usestheVIPaddress.Aserviceexposeszeroormoreexternalend- thatcombinestechniquesinnetworkinganddistributedsys- pointsthateachreceiveinboundtrafficonaspecificprotocoland temstorefactorloadbalancerfunctionalityinanovelwayto port on the VIP. Traffic directed to an external endpoint is load- meetscale,performanceandreliabilityrequirements. balanced to one or more machines of the service. All outbound trafficoriginatingfromaserviceisSourceNAT’ed(SNAT)using • Providing measurements and insights from running Ananta the same VIP address as well. Using the same VIP for all inter- inalargeoperationalCloud. service traffic has two important benefits. First, it enables easy upgradeanddisasterrecoveryofservicessincetheVIPcanbedy- 2. BACKGROUND namically mapped to another instance of the service. Second, it Inthissection,wefirstconsiderourdatacenternetworkarchi- makesACLmanagementeasiersincetheACLscanbeexpressed tectureandthenatureoftrafficthatisservicedbytheloadbalancer. intermsofVIPsandhencedonotchangeasservicesscaleupor Wethenderiveasetofrequirementsfortheloadbalancer. downorgetredeployed. 2.1 DataCenterNetwork 2.2 NatureofVIPTraffic Figure2showsthenetworkofatypicaldatacenterinourcloud. Publiccloudservices[3,4,19]hostadiversesetofworkloads, Amediumsizeddatacenterhosts40,000servers, eachwithone suchasstorage,webanddataanalysis.Inaddition,thereareanin- 10Gbps NIC. This two-level Clos network architecture [11] typi- creasingnumberofthird-partyservicesavailableinthecloud.This callyhasanover-subscriptionratioof1:4atthespinelayer. The trendleadsustobelievethattheamountofinter-servicetrafficin border routers provide a combined capacity of 400Gbps for con- thecloudwillcontinuetoincrease. Weexaminedthetotaltraffic nectivitytootherdatacentersandtheInternet. Acloudcontroller in eight data centers for a period of one week and computed the managesresourcesinthedatacenterandhostsservices. Aservice ratio of Internet traffic and inter-service traffic to the total traffic. 208 The result is shown in Figure 3. On average, about 44% (with a minimumof18%andmaximumof59%)ofthetotaltrafficisVIP traffic—iteitherneedsloadbalancingorSNATorboth. Outof this,about14%trafficonaverageistotheInternetandtheremain- ing30%isintra-DC.Theratioofintra-DCVIPtraffictoInternet VIPtrafficis2 : 1. Overall,wefindthat70%oftotalVIPtraffic isinter-servicewithinthesamedatacenter. Wefurtherfoundthat on average the ratio of inbound traffic to outbound traffic across ourdatacentersis1 : 1. Majorityofthistrafficisread-writetraf- ficandcross-datacenterreplicationtraffictoourstorageservices. Insummary,greaterthan80%ofVIPtrafficiseitheroutboundor containedwithinthedatacenter.Asweshowinthispaper,Ananta offloadsallofthistraffictothehost,therebyhandlingonly20%of thetotalVIPtraffic. 2.3 Requirements Figure4:Componentsofatraditionalloadbalancer.Typically, Scale,ScaleandScale: Themoststringentscalerequirements theloadbalancerisdeployedinanactive-standbyconfiguration. canbederivedbyassumingthatalltrafficinthenetworkiseither Theroutemanagementcomponentensuresthatthecurrentlyactive loadbalancedorNAT’ed. Fora40,000servernetwork,builtus- instancehandlesallthetraffic. ingthearchitectureshowninFigure2,400Gbpsofexternaltraffic and100Tbpsofintra-DCtrafficwillneedloadbalancingorNAT. itisimportantthattheloadbalancerprovideseachserviceitsfair Basedonthetrafficratiospresentedin§2.2,at100%networkuti- shareofresources. lization, 44Tbps traffic will be VIP traffic. A truly scalable load balancerarchitecturewouldsupportthisrequirementwhilemain- 3. DESIGN taininglowcost.Whilecostisasubjectivemetric,inourcloud,less than1%ofthetotalservercostwouldbeconsideredlowcost,so 3.1 DesignPrinciples anysolutionthatwouldcostmorethan400general-purposeservers istooexpensive.AtthecurrenttypicalpriceofUS$2500perserver, Scale-outIn-networkProcessing:Figure4illustratesthemain thetotalcostshouldbelessthanUS$1,000,000. Traditionalhard- componentsofatraditionalloadbalancer. Foreachnewflow,the ware load balancers do notmeet this requirement as their typical load balancer selects a destination address (or source for SNAT) list price is US$80,000 for 20Gbps capacity without considering dependingonthecurrentlyactiveflowsandremembersthatdeci- bulkdiscounts,supportcostsorredundancyrequirements. sioninaflowtable. Subsequentpacketsforthatflowusethestate Therearetwomoredimensionstothescalerequirement. First, createdbythefirstpacket. TraditionalNATandloadbalancingal- thebandwidthandnumberofconnectionsservedbyasingleVIP gorithms(e.g.,round-robin)requireknowledgeofallactiveflows, arehighlyvariableandmaygoupto100Gbpsand1millionsimul- hencealltrafficforaVIPmustpassthroughthesameloadbalancer. taneousconnections. Second,therateofchangeinVIPconfigura- Thisforcestheloadbalancerintoascale-upmodel. Ascale-upor tiontendstobeverylargeandbursty,onaverage12000configura- verticalscalingmodelisonewherehandlingmorebandwidthfor tionsperdayfora1000nodecluster,withburstsof100sofchanges aVIPrequiresahighercapacitybox. perminuteascustomerservicesgetdeployed,deletedormigrated. Networkrouters,ontheotherhand,followascale-outmodel.A Reliability: The load balancer is a critical component to meet scale-outorhorizontalscalingmodelisonewheremorebandwidth theuptimeSLAofapplications.Servicesrelyontheloadbalancer can be handled by simply adding more devices of similar capac- tomonitorhealthoftheirinstancesandmaintainavailabilityduring ity. Routers scale out because they do not maintain any per-flow plannedandunplanneddowntime. Overmanyyearsofoperation state that needs synchronization across routers and therefore one ofourcloudwefoundthattraditional1+1redundancysolutions canaddorremoveadditionalrouterseasily.Anantadesignreduces deployedasactive/standbyhardwareloadbalancersareunableto thein-networkfunctionalityneededforloadbalancingtobesuch meetthesehighavailabilityneeds.Theloadbalancermustsupport thatmultiplenetworkelementscansimultaneouslyprocesspackets N +1redundancymodelwithauto-recovery,andtheloadbalanc- forthesameVIPwithoutrequiringper-flowstatesynchronization. ingservicemustdegradegracefullyinthefaceoffailures. Thisdesignchoiceisenabledbecausewecanmakecertainas- Any Service Anywhere: In our cloud, applications are gener- sumptionsaboutourenvironment. Oneofthekeyassumptionsis allyspreadovermultiplelayer-2domainsandsometimesevenspan that load balancing policies that require global knowledge, e.g., multiple data centers. The load balancer should be able to reach weightedroundrobin(WRR),arenotrequiredforlayer-4loadbal- DIPslocatedanywhereonthenetwork. Traditionalloadbalancers ancing. Instead,randomlydistributingconnectionsacrossservers providesomeoftheirfunctionality,e.g.,NAT,toDIPsonlywithin basedontheirweightsisareasonablesubstituteforWRR.Infact, a layer-2 domain. This fragments the load balancer capacity and weightedrandomistheonlyloadbalancingpolicyusedbyourload makesthemunusableinlayer-3basedarchitectures. balancerinproduction. Theweightsarederivedbasedonthesize Tenant Isolation: A multi-tenant load balancer is shared by oftheVMorothercapacitymetrics. thousands of services. It is critical that DoS attacks on one ser- Offload to End Systems: Hypervisors in end systems can al- vicedonotaffecttheavailabilityofotherservices. Similarly, an ready do highly scalable network processing, e.g., ACL enforce- abusiveserviceshouldnotbeabletoaffecttheavailabilityofNAT ment,ratelimitingandmetering. Anantaleveragesthisdistributed for other services by creating large number of outbound connec- scalable platform and offloads significant data plane and control tions. Inourcloudenvironment,weoftenseeserviceswithlarge planefunctionalitydowntothehypervisorinendsystems.Thehy- numberofoutboundNATconnectionsduetobugsandpoorappli- pervisorneedstohandlestateonlyfortheVMshostedonit. This cationdesign. Furthermore,whentheloadbalancerisunderload, designchoiceisanotherkeydifferentiatorfromexistingloadbal- 209 Figure6:JSONrepresentationofasimpleVIPConfiguration. Figure 5: The Ananta Architecture. Ananta consists of three components — Ananta Manager, Ananta Mux and Host Agent. Eachcomponentisindependentlyscalable. Managercoordinates stateacrossAgentsandMuxes. Muxisresponsibleforpacketfor- wardingforinboundpackets.AgentimplementsNAT,whichallows Figure7:LoadBalancingforInboundConnections. alloutboundtraffictobypassMux.Agentsareco-locatedwithdes- tinationservers. usingIP-in-IPprotocol[18]settingtheselectedDIPasthedestina- tionaddressintheouterheader(step2). Itthensendsitoutusing ancers.Whileononehanditenablesnaturalscalingwiththesizeof regularIProutingattheMux(step3).TheMuxandtheDIPdonot thedatacenter;ontheotherhand,itpresentssignificantchallenges needtobeonthesameVLAN,theyjustneedtohaveIP(layer-3) inmanagingdistributedstateacrossallhostsandmaintainingavail- connectivity between them. The HA, located on the same physi- abilityduringfailuresofcentralizedcomponents. calmachineasthetargetDIP,interceptsthisencapsulatedpacket, 3.2 Architecture removestheouterheader,andrewritesthedestinationaddressand port(step4)andremembersthisNATstate.TheHAthensendsthe Anantaisalooselycoupleddistributedsystemcomprisingthree rewrittenpackettotheVM(step5). maincomponents(seeFigure5)—AnantaManager(AM),Multi- WhentheVMsendsareplypacketforthisconnection,itisin- plexer(Mux)andHostAgent(HA).Tobetterunderstandthedetails terceptedbytheHA(step6).TheHAdoesareverseNATbasedon ofthesecomponents,wefirstdiscusstheloadbalancerconfigura- thestatefromstep4andrewritesthesourceaddressandport(step tion and the overall packet flow. All packet flows are described 7). Itthensendsthepacketouttotheroutertowardsthesourceof usingTCPconnectionsbutthesamelogicisappliedforUDPand thisconnection. ThereturnpacketdoesnotgothroughtheMuxat otherprotocolsusingthenotionofpseudoconnections. all,therebysavingpacketprocessingresourcesandnetworkdelay. Thistechniqueofbypassingtheloadbalanceronthereturnpathis 3.2.1 VIPConfiguration knownasDirectServerReturn(DSR).Notallpacketsofaconnec- The load balancer receives a VIP Configuration for every VIP tionwouldendupatthesameMux,howeverallpacketsforasingle thatitisdoingloadbalancingandNATfor. AsimplifiedVIPcon- connectionmustbedeliveredtothesameDIP.Muxesachievethis figuration is shown in Figure 6. An Endpoint refers to a specific viaacombinationofconsistenthashingandstatemanagementas transportprotocolandportontheVIPthatisloadbalancedtoaset explainedlaterinthissection. ofDIPs. PacketsdestinedtoanEndpointareNAT’edtotheDIP addressandport. SNATspecifiesalistofIPaddressesforwhich 3.2.3 OutboundConnections outboundconnectionsneedtobeSourceNAT’edwiththeVIPand A unique feature of Ananta is a distributed NAT for outbound anephemeralport. connections.EvenforoutboundconnectionsthatneedsourceNAT (SNAT), Ananta ensures that outgoing packets do not need to go 3.2.2 InboundConnections throughMux. Figure8showshowpacketsforanoutboundSNAT Figure 7 shows how packets destined for a VIP are load bal- anced and delivered to the DIP of a VM. When a VIP is con- figured on Ananta, each Mux advertises a route to its first-hop routerannouncingitselfasthenexthopforthatVIP1.Thiscauses the routers to distribute packets destined for the VIP across all the Mux nodes based on Equal Cost MultiPath Routing Protocol (ECMP)[25](step1). Uponreceivingapacket,theMuxchooses aDIPfortheconnectionbasedonitsloadbalancingalgorithm,de- scribedlaterinthissection.Itthenencapsulatesthereceivedpacket 1Inreality,routesareadvertisedforVIPsubnetsduetosmallrout- ingtablesincommodityroutersbutthesamelogicapplies. Figure8:HandlingOutboundSNATConnections. 210 connectionarehandled. AVMsendsapacketcontainingitsDIP asthesourceaddress,port astheportandanexternaladdressas d thedestinationaddress(step1).TheHAinterceptsthispacketand recognizes that this packet needs SNAT. It then holds the packet in a queue and sends a message to AM requesting an externally routableVIPandaportforthisconnection(step2). AMallocates a(VIP,port )fromapoolofavailableportsandconfigureseach s Muxwiththisallocation(step3).AMthensendsthisallocationto theHA(step4). TheHAusesthisallocationtorewritethepacket sothatitssourceaddressandportarenow(VIP,port ).TheHA s sendsthisrewrittenpacketdirectlytotherouter. Thereturnpack- ets from the external destination are handled similar to inbound connections. Thereturnpacketissentbytheroutertooneofthe Muxnodes(step6). TheMuxalreadyknowsthat DIP2should receivethispacket(basedonthemappinginstep3), soitencap- sulates the packet with DIP2 as the destination and sends it out (step7). TheHAinterceptsthereturnpacket, performsareverse Figure9: FastpathControlandDataPackets. Routersarenot translationsothatthepacket’sdestinationaddressandportarenow shown for brevity. Starting with step 8, packets flow directly (DIP,port ).TheHAsendsthispackettotheVM(step8). d betweensourceanddestinationhosts. 3.2.4 Fastpath jacktraffic. HApreventsthisbyvalidatingthatthesourceaddress Inordertoscaletothe100softerabitbandwidthrequirementof of redirect message belongs to one of the Ananta services in the intra-DCtraffic,Anantaoffloadsmostoftheintra-DCtraffictoend data center. This works in our environment since the hypervisor systems. This is done by a technique we call Fastpath. The key preventsIPspoofing. IfIPspoofingcannotbeprevented, amore ideaisthattheloadbalancermakesitsdecisionaboutwhichDIP dynamicsecurityprotocolsuchasIPSECcanbeemployed. a new connection should go to when the first packet of that con- nectionarrives.Oncethisdecisionismadeforaconnectionitdoes 3.3 Mux notchange. Therefore,thisinformationcanbesenttotheHAson thesourceanddestinationmachinessothattheycancommunicate TheMultiplexer(Mux)handlesallincomingtraffic.Itisrespon- directly. Thisresultsinthepacketsbeingdelivereddirectlytothe sibleforreceivingtrafficforalltheconfiguredVIPsfromtherouter DIP,bypassingMuxinbothdirections,therebyenablingcommu- andforwardingittoappropriateDIPs.EachinstanceofAnantahas nicationatfullcapacitysupportedbytheunderlyingnetwork.This oneormoresetsofMuxescalledMuxPool. AllMuxesinaMux changeistransparenttoboththesourceanddestinationVMs. Poolhaveuniformmachinecapabilitiesandidenticalconfiguration, ToillustratehowFastpathworks,considertwoservices1and2 i.e.,theyhandlethesamesetofVIPs.HavinganotionofMuxPool thathavebeenassignedvirtualaddressesVIP1andVIP2respec- allowsustoscalethenumberofMuxes(dataplane)independent tively.ThesetwoservicescommunicatewitheachotherviaVIP1 ofthenumberofAMreplicas(controlplane). andVIP2usingthealgorithmsforloadbalancingandSNATde- 3.3.1 RouteManagement scribedabove. Figure9showsasimplifiedversionofpacketflow foraconnectioninitiatedbyaVMDIP1(belongingtoservice1) EachMuxisaBGPspeaker[20]. WhenaVIPisconfiguredon toVIP2. ThesourcehostofDIP1SNATstheTCPSYNpacket Ananta,eachMuxstartsadvertisingarouteforthatVIPtoitsfirst- using VIP1 and sends it to VIP2 (step 1). This packet is de- hoprouterwithitselfasthenexthop.AllMuxesinaMuxPoolare liveredtoaMux2,whichforwardsthepackettowardsdestination equalnumberofLayer-3networkhopsawayfromtheentrypoint DIP2(step2). WhenDIP2repliestothispacket,itisSNAT’ed ofthedatacenter.Thisensuresthattheroutersdistributetrafficfor bythedestinationhostusingVIP2andsenttoMux1(step3).This agivenVIPequallyacrossallMuxesviatheEqualCostMultiPath MuxusesitsSNATstateandsendsthispackettoDIP1(step4). (ECMP)routingprotocol[25]. RunningtheBGPprotocolonthe Subsequentpacketsforthisconnectionfollowthesamepath. Muxprovidesautomaticfailuredetectionandrecovery. IfaMux For Fastpath, Ananta configures Mux with a set of source and failsorshutsdownunexpectedly,therouterdetectsthisfailurevia destination subnets that are capable of Fastpath. Once a connec- the BGP protocol and automatically stops sending traffic to that tionhasbeenfullyestablished(e.g.,TCPthree-wayhandshakehas Mux. Similarly,whentheMuxcomesupandithasreceivedstate completed)betweenVIP1andVIP2,Mux2sendsaredirectmes- fromAM,itcanstartannouncingtheroutesandtherouterwillstart sagetoVIP1,informingitthattheconnectionismappedtoDIP2 forwardingtraffictoit.MuxesusetheTCPMD5[13]protocolfor (step5).ThisredirectpacketgoestoaMuxhandlingVIP1,which authenticatingtheirBGPsessions. looksupitstabletoknowthatthisportisusedbyDIP1. Mux1 3.3.2 PacketHandling thensendsaredirectmessagetowardsDIP1andDIP2(steps6 and7respectively). HAonthesourcehostinterceptsthisredirect TheMuxmaintainsamappingtable,calledVIPmap,thatdeter- packetandremembersthatthisconnectionshouldbesentdirectly mineshowincomingpacketsarehandled. Eachentryinthemap- toDIP2.SimilarlyHAonthedestinationhostinterceptstheredi- pingtablemapsaVIPendpoint,i.e.,three-tuple(VIP,IPprotocol, rectmessageandremembersthatthisconnectionshouldbesentto port),toalistofDIPs.ThemappingtableiscomputedbyAMand DIP1.Oncethisexchangeiscomplete,anyfuturepacketsforthis senttoalltheMuxesinaMuxPool. WhenMuxreceivesapacket connectionareexchangeddirectlybetweenthesourceanddestina- fromtherouter,itcomputesahashusingthefive-tuplefrompacket tionhosts(step8). headerfields—(sourceIP,destinationIP,IPProtocol,sourceport, ThereisonesecurityconcernassociatedwithFastpath–arogue destinationport). ItthenusesthishashtolookupaDIPfromthe hostcouldsendaredirectmessageimpersonatingtheMuxandhi- listofDIPsintheassociatedmap. Finally,itencapsulates[18]the 211 packetwithanouterIPheader—withitselfasthesourceIPand theHostAgent,itdecapsulatesthemandthenperformsaNATas theDIPasthedestinationIP—andforwardsthispackettotheDIP. pertheNATrulesconfiguredbyAnantaManager. TheNATrules The encapsulation at the Mux preserves the original IP header describe the rewrite rules of type: (VIP,protocol ,port ) ⇒ v v andIPpayloadofthepacket,whichisessentialforachievingDi- (DIP,protocol ,port ). In this case, the Host Agent identifies v d rectServerReturn(DSR).AllMuxesinaMuxPoolusetheexact packetsthataredestinedto(VIP,protocol ,port ),rewritesthe v v samehashfunctionandseedvalue.SinceallMuxeshavethesame destinationaddressandportto(DIP,port )andcreatesbi-directional d mappingtable,itdoesn’tmatterwhichMuxagivennewconnec- flow state that is used by subsequent packets of this connection. tiongoesto,itwillbedirectedtothesameDIP. When a return packet for this connection is received, it does re- verseNATbasedonflowstateandsendsthepackettothesource 3.3.3 FlowStateManagement directlythroughtherouter,bypassingtheMuxes. Muxsupportstwotypesofmappingentries–statefulandstate- 3.4.2 SourceNATforOutboundConnections less. Statefulentriesareusedforloadbalancingandstatelessen- triesareusedforSNAT.Forstatefulmappingentries,onceaMux For outbound connections, the Host Agent does the following. haschosenaDIPforaconnection, itremembersthatdecisionin It holds the first packet of a flow in a queue and sends a mes- a flow table. Every non-SYN TCP packet, and every packet for sage to Ananta Manager requesting a VIP and port for this con- connection-lessprotocols, ismatchedagainstthisflowtablefirst, nection. Ananta Manager responds with a VIP and port and the and if a match is found it is forwarded to the DIP from the flow HostAgentNATsallpendingconnectionstodifferentdestinations table. ThisensuresthatonceaconnectionisdirectedtoaDIP,it usingthisVIPandport.Anynewconnectionstodifferentdestina- continuestogotothatDIPdespitechangesinthelistofDIPsinthe tions(remoteaddress,remoteport)canalsoreusethesameport mappingentry. Ifthereisnomatchintheflowtable,thepacketis as the TCP five-tuple will still be unique. We call this technique treatedasafirstpacketoftheconnection. portreuse. AMmayreturnmultipleportsinresponsetoasingle GiventhatMuxesmaintainper-flowstate,itmakesthemvulner- request. TheHAusestheseportsforanysubsequentconnections. abletostateexhaustionattackssuchastheSYN-floodattack. To Anyunusedportsarereturnedbackafteraconfigurableidletime- counterthistypeofabuse,Muxclassifiesflowsintotrustedflows out.IftheHAkeepsgettingnewconnections,theseportsarenever anduntrustedflows. AtrustedflowisoneforwhichtheMuxhas returnedbacktoAM,however,AMmayforceHAtoreleasethem seenmorethanonepacket. Theseflowshavealongeridletime- atanytime. Basedonourproductionworkload, wehavemadea out.UntrustedflowsaretheonesforwhichtheMuxhasseenonly numberofoptimizationstominimizethenumberofSNATrequests onepacket.Theseflowshaveamuchshorteridletimeout.Trusted aHostAgentneedstosend,includingpreallocationofports.These anduntrustedflowsaremaintainedintwoseparatequeuesandthey andotheroptimizationsarediscussedlaterinthispaper. havedifferentmemoryquotasaswell. OnceaMuxhasexhausted itsmemoryquota,itstopscreatingnewflowstatesandfallsbackto 3.4.3 DIPHealthMonitoring lookupinthemappingentry. ThisallowsevenanoverloadedMux AnantaisresponsibleformonitoringthehealthofDIPsthatare tomaintainVIPavailabilitywithaslightlydegradedservice. behindeachVIPendpointandtakeunhealthyDIPsoutofrotation. DIPhealthcheckrulesarespecifiedaspartofVIPConfiguration. 3.3.4 HandlingMuxPoolChanges Onfirstlook,itwouldseemnaturaltorunhealthmonitoringonthe WhenaMuxinaMuxPoolgoesdown,routerstakeitoutofro- Muxnodessothathealthmonitoringtrafficwould taketheexact tationonceBGPholdtimerexpires(wetypicallysetholdtimerto samenetworkpathastheactualdatatraffic.However,itwouldput 30seconds).WhenanychangetothenumberofMuxestakesplace, additionalloadoneachMux,couldresultinadifferenthealthstate ongoingconnectionswillgetredistributedamongthecurrentlylive on each mux and would incur additional monitoring load on the Muxes based on the router’s ECMP implementation. When this DIPsasthenumberofMuxescanbelarge.Guidedbyourprinciple happens,connectionsthatreliedontheflowstateonanotherMux ofoffloadingtoendsystems,wechosetoimplementhealthmon- maynowgetmisdirectedtoawrongDIPiftherehasbeenachange itoringontheHostAgents. AHostAgentmonitorsthehealthof in the mapping entry since the connection started. We have de- localVMsandcommunicatesanychangesinhealthtoAM,which signedamechanismtodealwiththisbyreplicatingflowstateon thenrelaysthesemessagestoallMuxesintheMuxPool. Perhaps twoMuxesusingaDHT.Thedescriptionofthatdesignisoutside surprisingtosomereaders,runninghealthmonitoringonthehost the scope of this paper as we have chosen to not implement this makesiteasytoprotectmonitoringendpointsagainstunwarranted mechanismyetinfavorofreducedcomplexityandmaintaininglow traffic–anagentintheguestVMlearnsthehostVM’sIPaddress latency.Inaddition,wehavefoundthatclientseasilydealwithoc- viaDHCPandconfiguresafirewallruletoallowmonitoringtraffic casional connectivity disruptions by retrying connections, which only from the host. Since a VM’s host address does not change happenforvariousotherreasonsaswell. (wedon’tdoliveVMmigration),migrationofaVIPfromonein- stanceofAnantatoanotherorscalingthenumberofMuxesdoes 3.4 HostAgent notrequirereconfigurationinsideguestVMs.Webelievethatthese AdifferentiatingcomponentoftheAnantaarchitectureisanagent, benefitsjustifythisdesignchoice.Furthermore,inafullymanaged calledHostAgent,whichispresentonthehostpartitionofevery cloudenvironmentsuchasours,out-of-bandmonitoringcandetect physicalmachinethatisservedbyAnanta. TheHostAgentisthe network partitions where HA considers a VM healthy but some keytoachievingDSRandSNATacrosslayer-2domains. Further- MuxesareunabletocommunicatewithitsDIPs,raiseanalertand more, the Host Agent enables data plane scale by implementing eventakecorrectiveactions. Fastpath and NAT; and control plane scale by implementing VM 3.5 AnantaManager healthmonitoring. TheAnantaManager(AM)implementsthecontrolplaneofAnanta. 3.4.1 NATforInboundConnections ItexposesanAPItoconfigureVIPsforloadbalancingandSNAT. Forloadbalancedconnections,theHostAgentperformsstateful BasedontheVIPConfiguration,itconfigurestheHostAgentsand layer-4NATforallconnections. Asencapsulatedpacketsarriveat Mux Pools and monitors for any changes in DIP health. Ananta 212 ManagerisalsoresponsibleforkeepingtrackofhealthofMuxes thenumberofVMs.Furthermore,therearelimitsonthenumberof andHostsandtakingappropriateactions.AMachieveshighavail- portsallocatedandrateofallocationsallowedforanygivenVM. abilityusingthePaxos[14]distributedconsensusprotocol. Each instanceofAnantarunsfivereplicasthatareplacedtoavoidcor- 3.6.2 PacketRateFairness related failures. Three replicas need to be available at any given MuxtriestoensurefairnessamongVIPsbyallocatingavailable time to make forward progress. The AM uses Paxos to elect a bandwidthamongallactiveflows. Ifaflowattemptstostealmore primary,whichisresponsibleforperformingallconfigurationand thanitsfairshareofbandwidth,Muxstartstodropitspacketswith statemanagementtasks. WenowlookatsomekeyAMfunctions aprobabilitydirectlyproportionaltotheexcessbandwidthitisus- indetail. ing. WhilebandwidthfairnessworksforTCPflowsthataresend- inglarge-sizedpackets, itdoesnotworkforshortpacketsspread 3.5.1 SNATPortManagement acrossflows, orflowsthatarenon-TCP(e.g., UDP)orflowsthat aremalicious(e.g.,aDDoSattack). Akeycharacteristicofthese AMalsodoesportallocationforSNAT(§3.2.3). WhenanHA flows is that they do not back off in response to packet drops, in makesanewportrequestonbehalfofaDIP,AMallocatesafree factwesometimesseetheexactoppositereaction. Sincedropping portfortheVIP,replicatestheallocationtootherAMreplicas,cre- packetsattheMuxisnotgoingtohelpandincreasesthechances atesastatelessVIPmapentrymappingtheporttotherequesting ofoverload,ourprimaryapproachhasbeentobuildarobustdetec- DIP,configurestheentryontheMuxPoolandthensendstheallo- tionmechanismforoverloadduetopacketrate. EachMuxkeeps cationtotheHA.TherearetwomainchallengesinservingSNAT trackofitstop-talkers–VIPswiththehighestrateofpackets.Mux requests — latency and availability. Since SNAT request is done continuouslymonitorsitsownnetworkinterfacesandonceitde- onthefirstpacketofaconnection,adelayinservingSNATrequest tectsthatthereispacketdropduetooverload,itinformsAMabout directlytranslatesintolatencyseenbyapplications.Similarly,AM the overload and the top talkers. AM then identifies the topmost downtimewouldresultinfailureofoutboundconnectionsforappli- top-talkerasthevictimofoverloadandwithdrawsthatVIPfrom cationsresultingincompleteoutageforsomeapplications.Ananta allMuxes,therebycreatingablackholefortheVIP.Thisensures employs a number of techniques to reduce latency and increase thatthereisminimalcollateraldamageduetoMuxoverload. De- availability. First, it allocates a contiguous port range instead of pending on the policy for the VIP, we then route it through DoS allocatingoneportatatime. Byusingfixedsizedportranges,we protectionservices(thedetailsareoutsidethescopeofthispaper) optimizestorageandmemoryrequirementsonboththeAMandthe and enable it back on Ananta. We evaluate the effectiveness of Mux.OntheMuxdriver,onlythestartportofarangeisconfigured overloaddetectionandroutewithdrawalin§5. andbymakingtheportrangesizeapowerof2,wecanefficiently maparangeofportstoaspecificDIP.Second,itpreallocatesaset 3.7 DesignAlternatives of port rangesto DIPs when it firstreceives a VIP configuration. Third,ittriestododemandpredictionandallocatesmultipleport 3.7.1 DNS-basedScaleOut rangesinasinglerequest. Weevaluatetheeffectivenessofthese techniquesin§5. AnantausesBGPtoachievescaleoutamongmultipleactivein- stancesofMux. Atraditionalapproachtoscalingoutmiddlebox 3.6 TenantIsolation functionalityisviaDNS.Eachinstanceofthemiddleboxdevice, e.g., load balancer, is assigned a public IP address. The author- Ananta is a multi-tenant load balancer and hence tenant isola- itative DNS server is then used to distribute load among IP ad- tionisanessentialrequirement. Thegoaloftenantisolationisto dresses of the instances using an algorithm like weighted round- ensurethattheQualityofService(QoS)receivedbyonetenantis robin. Whenaninstancegoesdown,theDNSserverstopsgiving independent of other tenants in the system. However, Ananta is outitsIPaddress. Thisapproachhasseverallimitations. First,it an oversubscribed resource and we do not guarantee a minimum ishardertogetgooddistributionofloadbecauseitishardtopre- bandwidthorlatencyQoS.Assuch,weinterpretthisrequirement dicthowmuchloadcanbegeneratedviaasingleDNSresolution asfollows–thetotalCPU,memoryandbandwidthresourcesare request. Forexample,loadfromlargeclientssuchasamegaproxy dividedamongalltenantsbasedontheirweights. Theweightsare is always sent to a single server. Second, it takes longer to take directlyproportionaltothenumberofVMsallocatedtothetenant. unhealthy middlebox nodes out of rotation due to DNS caching Wemakeanothersimplifyingassumptionthattraffic forallVIPs –manylocalDNSresolversandclientsviolateDNSTTLs. And is distributed equally among all Muxes, therefore, each Mux can third,itcannotbeusedforscaleoutofstatefulmiddleboxes,such independentlyimplementtenantisolation.Wefindthisassumption asaNAT. toholdtrueinourenvironment.Ifthisassumptionweretonothold trueinthefuture,aglobalmonitorcoulddynamicallyinformeach 3.7.2 OpenFlow-basedLoadBalancing Muxtoassigndifferentweightstodifferenttenants. Memoryfair- An alternative to implementing Mux functionality in general- ness is implemented at Muxes and AM by keeping track of flow purposeserversistouseOpenFlow-capableswitches[16]. How- stateandenforcinglimits. ever, currently available OpenFlow devices have insufficient sup- portforgeneral-purposelayer-4loadbalancing. Forexample,ex- 3.6.1 SNATFairness istingOpenFlowswitchesonlysupport2000to4000flows,whereas AM is a critical resource as it handles all SNAT port requests. Mux needs to maintain state for millions of flows. Another key ExcessiverequestsfromonetenantshouldnotslowdowntheSNAT primitive lacking in existing OpenFlow hardware is tenant isola- responsetimeofanothertenant.AMensuresthisbyprocessingre- tion.Finally,inapureOpenFlow-basednetwork,wewillalsoneed questsinafirst-come-first-serve(FCFS)orderanditensuresthatat toreplaceBGPwithcentralizedroutinginAnantaManager,which anygiventimethereisatmostoneoutstandingrequestfromaDIP. hascertaindrawbacksasdiscussedin§6. Otherresearchershave If a new request isreceived while another request from the same alsoattemptedtobuildloadbalancersusingOpenFlowswitches[28], DIPispending, thenewrequestisdropped. Thissimplemecha- however,thesesolutionsdonotyetmeetalltherequirements,e.g., nismensuresthatVIPsgetSNATallocationtimeinproportionto tenantisolationandscalablesourceNATacrosslayer-2domains. 213 Figure 10: Staged event-driven (SEDA) Ananta Manager. Anantamanagersharesthesamethreadpoolacrossmultiplestages and supports priority event queues to maintain responsiveness of VIPconfigurationoperationsunderoverload. 4. IMPLEMENTATION Figure11:CPUusageatMuxandHostswithandwithoutFast- We implemented all three Ananta components from Figure 5. path. OnceFastpathisturnedon, thehoststakeovertheencap- Ananta Manager and Mux are deployed as a tenant of our cloud sulationfunctionfromMux.ThisresultsinlowerCPUatMuxand platformitself. TheHostAgentisdeployedoneveryhostinour CPUincreaseateveryhostdoingencapsulation. cloudandisupdatedwheneverweupdatetheAnantatenant. Ina typicaldeploymentfivereplicasofAMmanageasingleMuxPool. initsVIPMapwith1GBofmemory.Eachmuxcanmaintainstate MostMuxPoolshaveeightMuxesinthembutthenumbercanbe formillionsofconnectionsandisonlylimitedbyavailablemem- basedonload. oryontheserver.CPUperformancecharacteristicsoftheMuxare AnantaManager: AMperformsvarioustime-criticaltasks— discussedin§5.2.3. configurationofVIPs,allocationofports,coordinationofDIPhealth HostAgent: TheHostAgentalsohasadrivercomponentthat across Muxes. Therefore, its responsiveness is very critical. To runsasanextensionoftheWindowsHyper-Vhypervisor’svirtual achieveahighdegreeofconcurrency,weimplementedAMusing switch. Beinginthehypervisorenablesustosupportunmodified alock-freearchitecturethatissomewhatsimilartoSEDA[29].As VMsrunningdifferentoperatingsystems.Fortenantsrunningwith showninFigure10,AMisdividedintothefollowingstages—VIP native(non-VM)configuration,werunastripped-downversionof validation, VIP configuration, Route Management, SNAT Man- thevirtualswitchinsidethehostnetworkingstack.Thisenablesus agement, Host Agent Management and Mux Pool Management. toreusethesamecodebaseforbothnativeandVMscenarios. Ananta implementation makes two key enhancements to SEDA. Upgrading Ananta: Upgrading Ananta is a complex process First, inAnanta, multiplestagessharethesamethreadpool. This that takes place in three phases in order to maintain backwards- allowsustolimitthetotalnumberofthreadsusedbythesystem. compatibility between various components. First, we update in- Second, Ananta supports multiple priority queues for each stage. stancesoftheAnantaManager,oneatatime. Duringthisphase, Thisisusefulinmaintainingresponsivenessduringoverloadcon- AMalsoadaptsitspersistentstatefrompreviousschemaversion ditions.Forexample,SNATeventstakelesspriorityoverVIPcon- to the new version. Schema rollback is currently not supported. figurationevents. ThisallowsAnantatofinishVIPconfiguration Second,weupgradetheMuxes;andthird,theHostAgents. tasksevenwhenitisunderheavyloadduetoSNATrequests. AnantamaintainshighavailabilityusingPaxos[14].Wetookan 5. MEASUREMENTS existingimplementationofPaxosandaddeddiscoveryandhealth monitoringusingtheSDKofourcloudplatform.TheSDKnotifies Inthissectionwefirstpresentafewmicro-benchmarkstoevalu- AMofaddition,removalormigrationofanyreplicasofthePaxos ateeffectivenessandlimitationsofsomeofourdesignchoicesand cluster. Thisallowsustoautomaticallyreconfiguretheclusteras implementation,andthensomedatafromrealworlddeployments. instancesareaddedorremoveddynamically.Thesefeaturesofthe 5.1 Micro-benchmarks platformSDKresultinsignificantlyreducedoperationaloverhead forAnanta. Theplatformalsoprovidesaguaranteethatnomore 5.1.1 Fastpath thanoneinstanceoftheAMroleisbroughtdownforOSorapplica- tionupgrade.Thisnotionofinstance-by-instanceupdatemaintains TomeasuretheeffectivenessofFastpathanditsimpactonhost availabilityduringupgrades. AnantausesPaxostoelectaprimary CPU,weconductedthefollowingexperiment. Wedeployeda20 andonlytheprimarydoesallthework. VM tenant as the server and two 10 VM tenants as clients. All Mux: Muxhastwomaincomponents—akernel-modedriver theclientVMscreateuptotenconnectionstotheserverandup- andauser-modeBGP[20]speaker. Thekernel-modedriverisim- load1MBofdataperconnection. WerecordedtheCPUusageat plementedusingtheWindowsFilteringPlatform(WFP)[30]driver thehostnodesandoneMux. TheresultsareshowninFigure11. model. ThedriverinterceptspacketsattheIPlayer,encapsulates We found the host CPU usage to be uniform across all hosts, so andthensendsthemusingthebuilt-inforwardingfunctionofthe weonlyshowmedianCPUobservedatarepresentativehost. As OS. Delegating routing to the OS stack has allowed Mux code expected, as soon as Fastpath is turned on, no new data transfers to remain simple, especially when adding support for IPv6, as it happenthroughtheMux. Itonlyhandlesthefirsttwopacketsof does not need to deal with IP fragmentation and next-hop selec- anynewconnection. OncetheMuxisoutoftheway,italsostops tion. Thedriverscalestomultiplecoresusingreceivesidescaling beingabottleneckfordatatransferandVMscanexchangedataat (RSS)[22].SincetheMuxdriveronlyencapsulatesthepacketwith thespeedallowedbytheunderlyingnetwork. anewIPheaderandleavestheinnerIPheaderanditspayloadin- 5.1.2 TenantIsolation tact,itdoesnotneedtorecalculateTCPchecksumandhenceitdoes not need any sender-side NIC offloads. For IPv4, each Mux can Tenantisolationisafundamentalrequirementofanymulti-tenant hold20,000loadbalancedendpointsand1.6millionSNATports service. In the following we present two different experiments 214 Figure 12: SYN-flood Attack Mitigation. Duration of impact showsthetimeAnantatakestodetectandblack-holetraffictothe Figure 13: Impact of heavy SNAT user H on a normal user victimVIPonallMuxes. N.HeavyuserHseeshigherlatencyandhigherSYNretransmits. NormaluserN’sperformanceremainsunchanged. thatshowAnanta’sabilitytoisolateinboundpacketandoutbound SNAT abuse. For these experiments, we deployed five different tenants,eachwithtenvirtualmachines,onAnanta. SYN-floodIsolation: TomeasurehowquicklyAnantacaniso- lateaVIPunderaSYN-floodattack,weranthefollowingexperi- ment(otherpacketratebasedattacks,suchasaUDP-flood,would showsimilarresult.) WeloadAnantaMuxeswithabaselineload andlaunchaSYN-floodattackusingspoofedsourceIPaddresses on one of the VIPs. We then measure if there is any connection lossobservedbyclientsoftheothertenants. Figure12showsthe maximumdurationofimpactobservedovertentrials. Asseenin the chart, Mux can detect and isolate an abusive VIP within 120 secondswhenitisrunningundernoload,minimumtimebeing20 seconds.However,undermoderatetoheavyloadittakeslongerto detectanattackasitgetshardertodistinguishbetweenlegitimate Figure14: Connectionestablishmenttimeexperiencedbyout- andattacktraffic.WeareworkingonimprovingourDoSdetection boundconnectionswithandwithoutportdemandprediction. algorithmstoovercomethislimitation. SNATPerformanceIsolation: SNATportallocationatAnanta Managercouldbeasubjectofabusebysometenants. Itispossi- dataispartitionedintobucketsof25ms.Theminimumconnection blethatatenantmakesalotofSNATconnectionscausingimpact establishmenttimetotheremoteservice(withoutSNAT)is75ms. on other tenants. To measure the effectiveness of per-VM SNAT Figure14showsconnectionestablishmenttimesforthefollowing isolationatAM,weconductedthefollowingexperiment. Asetof twooptimizationswhenthereisnootherloadonthesystem. normalusetenants(N)makeoutboundconnectionsatasteadyrate SinglePortRange: Inresponsetoaportrequest,AMallocates of150connectionsperminute. Whereas,aheavySNATuser(H) eightcontiguousportsinsteadofasingleportandreturnsthemto keepsincreasingitsSNATrequests. WemeasuretherateofSYN therequestingHA.HAusestheseportsforanypendingandsub- retransmits and the SNAT response time of Ananta at the corre- sequentconnections. TheHAkeepsanyunusedportsuntilapre- spondingHAs.Figure13showstheaggregateresultsovermultiple configured timeout before returning them to AM. By doing this trials. Asseeninthefigure,thenormaltenants’connectionskeep optimization, only one in every eight outbound connections ever succeeding at a steady rate without any SYN loss; and its SNAT resultsinarequesttoAM.ThisisevidentfromFigure14—88% port requests are satisfied within 55ms. The heavy user, on the connectionssucceedintheminimumpossibletimeof75ms. The otherhand,startstoseeSYNretransmitsbecauseAnantadelaysits remaining 12% connections take longer due to the round-trip to SNATrequestsinfavorofN.ThisshowsthatAnantarewardsgood AM.Withouttheportrangeoptimizationeverynewconnectionre- behaviorbyprovidingfasterSNATresponsetime. questthatcannotbesatisfiedusingexistingalreadyallocatedports willmakearound-triptoAM. 5.1.3 SNATOptimizations DemandPrediction: Whenthisoptimizationisturnedon,AM attemptstopredictportdemandofaDIPbasedonitsrecenthis- AkeydesignchoiceAnantamadeistoallocateSNATportsin tory. IfaDIPrequestsnewportswithinaspecifiedintervalfrom AMandthenreplicatetheportallocationstoallMuxesandother itspreviousrequests,AMallocatesandreturnsmultipleeight-port AMreplicastoenablescaleoutandhighavailabilityrespectively. portrangesinsteadofjustone.AsshowninFigure14,withthisop- Since port allocation takes place for the first packet of a connec- timization96%ofconnectionsaresatisfiedlocallyanddon’tneed tion, it can add significant latency to short connections. Ananta tomakearound-triptoAM.Furthermore,sinceAMhandlesfewer overcomes these limitations by implementing three optimizations requests,itsresponsetimeisalsobetterthanallocatingasingleport asmentionedin§3.5.1. Tomeasuretheimpactoftheseoptimiza- rangeforeachrequest. tions we conducted the following experiment. A client continu- ouslymakesoutboundTCPconnectionsviaSNATtoaremoteser- viceandrecordstheconnectionestablishmenttime. Theresulting 215 Figure15: CDFofSNATresponselatencyforthe1%requests handledbyAnantaManager. Figure17: DistributionofVIPconfigurationtimeovera24-hr period. over this period across all test tenants was 99.95%, with a mini- mum of 99.92% for one tenant and greater than 99.99% for two tenants.FiveofthelowavailabilityconditionsbetweenJan21and Jan26happenedduetoMuxoverload. TheMuxoverloadevents wereprimarilycausedbySYN-floodattacksonsometenantsthat arenotprotectedbyourDoSprotectionservice. Twoavailability drops were due to wide-area network issues while the rest were falsepositivesduetoupdateofthetesttenants. 5.2.3 Scale For control plane, Ananta ensures that VIP configuration tasks are not blocked behind other tasks. In a public cloud environ- Figure 16: Availability of test tenants in seven different data ment,VIPconfigurationtaskshappenatarapidrateascustomers centersoveronemonth. add, delete or scale their services. Therefore, VIP configuration timeisanimportantmetric. SinceAnantaneedstoconfigureHAs 5.2 RealWorldData andMuxesduringeachVIPconfigurationchange,itsprogramming timecouldbedelayedduetoslowHAsorMuxes.Figure17shows SeveralinstancesofAnantahavebeendeployedinalargepub- the distribution of time taken by seven instances Ananta to com- liccloudwithacombinedcapacityexceeding1Tbps. Ithasbeen pleteVIPconfigurationtasksovera24-hrperiod.Themediancon- servingInternetandintra-datacentertrafficneedsofverydiverse figuration time was 75ms, while the maximum was 200seconds. setoftenants,includingblob,tableandqueuestorageservicesfor These time vary based on the size of the tenant and the current overtwoyears.Herewelookatsomedatafromrealworld. healthofMuxes.ThesetimesfitwithinourAPISLAforVIPcon- 5.2.1 SNATResponseLatency figurationtasks. For data plane scale, Ananta relies on ECMP at the routers to Basedonproductiondata,Anantaserves99%oftheSNATre- spreadloadacrossMuxesandRSSattheNICtospreadloadacross questslocallybyleveragingportreuseandSNATpreallocationas multipleCPUcores.Thisdesignimpliesthatthetotaluploadthrough- describedabove.Theremainingrequestsfortenantsinitiatingalot putachievedbyasingleflowislimitedbywhattheMuxcanachieve of outbound requests to a few remote destinations require SNAT usingasingleCPUcore. Onourcurrentproductionhardware,the portallocationtohappenattheAM.Figure15showsthedistribu- Mux can achieve 800Mbps throughput and 220Kpps using a sin- tion of latency incurred for SNAT port allocation over a 24 hour glex64,2.4GHzcore. However,Muxescanachievemuchhigher windowinaproductiondatacenterfortherequeststhatgotoAM. aggregatethroughputacrossmultipleflows. Inourproductionde- 10%oftheresponsesarewithin50ms,70%within200msand99% ployment, wehavebeenabletoachievemorethan100Gbpssus- within2seconds. Thisimpliesthatintheworstcase,oneinevery taineduploadthroughputforasingleVIP.Figure18showsband- 10000connectionssuffersaSYNlatencyof2secondsorhigher; widthandCPUusageseenoveratypical24-hrperiodby14Muxes however,veryfewscenariosrequirethatnumberofconnectionsto deployedinoneinstanceofAnanta. HereeachMuxisrunningon thesamedestination. a 12-core 2.4Ghz intel(cid:13)R Xeon CPU. This instance of Ananta is 5.2.2 Availability serving12VIPsforblobandtablestorageservice. Asseeninthe figure,ECMPisloadbalancingflowsquiteevenlyacrossMuxes. Aspartofongoingmonitoringforourcloudplatform,wehave Each Mux is able to achieve 2.4Gbps throughput (for a total of multiple test tenants deployed in each data center. A monitoring 33.6Gbps)usingabout25%ofitstotalprocessingpower. serviceconnectstotheVIPofeverytesttenantfrommultiplegeo- graphiclocationsandfetchesawebpageonceeveryfiveminutes. 6. OPERATIONALEXPERIENCE Figure16showsaverageavailabilityoftesttenantsinsevendif- ferentdatacenters. Iftheavailabilitywaslessthan100%forany Wehavebeenofferinglayer-4loadbalancingaspartofourcloud five minute interval, it makes up a point in the above graph. All platformforthelastthreeyears.Theinabilityofhardwareloadbal- theotherintervalshad100%availability. Theaverageavailability ancerstodealwithDoSattacks,networkelasticityneeds,thegrow- 216
Description: