ebook img

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games PDF

0.42 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games Prasad H.L. ∗1, Prashanth L A †2 and Shalabh Bhatnagar‡3 5 1 1 0 AstromeTechnologiesPvtLtd,Bangalore,INDIA 2 2InstituteforSystemsResearch,UniversityofMaryland,CollegePark,US 3DepartmentofComputerScienceandAutomation,IndianInstituteofScience,Bangalore,INDIA l u J 2 ] Abstract T G WeconsidertheproblemoffindingstationaryNashequilibria(NE)inafinitediscountedgeneral-sumstochas- . ticgame. Wefirstgeneralizeanon-linearoptimizationproblemfromFilarandVrieze[2004]toaN-playerset- s tingandbreakdownthisproblemintosimplersub-problemsthatensurethereisnoBellmanerrorforagivenstate c [ andanagent.Wethenprovideacharacterizationofsolutionpointsofthesesub-problemsthatcorrespondtoNash equilibriaoftheunderlyinggameandforthispurpose,wederiveasetofnecessaryandsufficientSG-SP(Stochas- 2 ticGame-Sub-Problem)conditions.Usingtheseconditions,wedeveloptwoactor-criticalgorithms:OFF-SGSP v (model-based)andON-SGSP(model-free). Bothalgorithmsuseacriticthatestimatesthevaluefunctionfora 6 fixedpolicyandanactorthatperformsdescentinthepolicyspaceusingadescentdirectionthatavoidslocalmin- 8 ima. Weestablishthatbothalgorithmsconverge, inself-play,totheequilibriaofacertainordinarydifferential 0 equation(ODE),whosestablelimitpointscoincidewithstationaryNEoftheunderlyinggeneral-sumstochastic 2 . game.Onasinglestatenon-genericgame(seeHartandMas-Colell[2005])aswellasonasynthetictwo-player 1 gamesetupwith810,000states,weestablishthatON-SGSPconsistentlyoutperformsNashQ[HuandWellman, 0 2003]andFFQ[Littman,2001]algorithms. 4 1 Keywords: General-sumdiscountedstochasticgames,Nashequilibrium,multi-agentreinforcementlearning, : v multi-timescalestochasticapproximation. i X r 1 Introduction a Traditionalgametheoreticdevelopmentshavebeenforsingle-shotgameswhereallagentsparticipateandperform their preferredactions, receive rewards and the game is over. However, several emerging applications have the conceptofmultiplestagesofactionormostoftentheconceptoftime inthem. Oneintermediateclassofgames to handle multiple stages of decision is called repeated games. However, repeated games do not provide for characterizingtheinfluenceofdecisionsmadeinonestagetofuturestages. MarkovchainsorMarkovprocesses are a popular and widely applicable class of random processes which are used for modeling practical systems. Markov chains allow system designers to model states of a given system and then model the time behavior of thesystembyconnectingthosestatesviasuitableprobabilitiesfortransitionfromcurrentstatetoanextstate. A popularextensiontoMarkovchainswhichisusedformodelingoptimalcontrolscenariosistheclassofMarkov Decision Processes (MDPs). Here, in a given state, an action is allowed to be selected from a set of available actions. Based onthechoiceoftheaction, a suitablereward/costis obtained/incurred. Also, theactionselected influencestheprobabilitiesoftransitionfromonestatetoanother.Shapley[1953]mergedtheseconceptsofMDPs ∗[email protected][email protected][email protected] 1 Environment Rewardr= r1,r2,...,rN , Actiona= a1,a2,...,aN nextstatey (cid:10) (cid:11) (cid:10) (cid:11) 1 2 ... N Agents Figure1:Multi-agentRLsetting (orbasicallyMarkovbehavior)andgamestocomeupwithanewclassofgamescalledasstochasticgames. Ina stochastic game,allparticipatingagentsselecttheir actions, eachofwhichinfluencetherewardsreceivedbyall agentsaswellasthetransitionprobabilities.SincetheinceptionofstochasticgamesbyShapley[1953],theyhave beenanimportantclassofmodelsformulti-agentsystems. Acomprehensivetreatmentofstochasticgamesunder various pay-off criteria is given by FilarandVrieze [2004]. Many problems like fishery games, advertisement gamesandseveraloligopolisticsituationscanbemodelledasstochasticgames citepbreton1986computation,filar- vrieze,olley1992dynamics,pakes1998empirical,pakes2001stochastic,bergemann1996learning. Weconsiderafinitestochasticgame(alsoreferredtoasMarkovgame(cf.Littman[2001])settingthatevolves over discrete time instants. As illustrated in Fig. 1, at each stage and in any given state x , all agents act ∈ X simultaneouslywithanactionvectora (x) resultinginatransitiontothenextstatey accordingtothe ∈ A ∈ X transitionprobabilityp(y x,a) aswellasa rewardvectorr(x,a). No agentgetstoknowwhattheotheragents’ | actionsarebeforeselectingitsownactionandtherewardri(x,a)obtainedbyanyagentiineachstagedepends onbothsystemstatex(commontoallagents)andtheaggregateactiona(whichincludesotheragents’actions). Eachindividualagent’ssoleobjectiveismaximizationofhis/hervaluefunction,i.e.,theexpecteddiscountedsum ofrewards.However,thetransitiondynamicsaswellastherewardsdependontheactionsofallagentsandhence, the dynamics of the game is coupled and not independent. We assume that r(x,a) and the action vector a are madeknowntoallagentsaftereveryagentihaspickedhis/heractionai-thisisthecanonicalmodel-freesetting1. However,wedonotassumethateachagentknowstheotheragents’policies,i.e.,thedistributionfromwhichthe actionsarepicked. ThecentralconceptofstabilityinastochasticgameisthatofaNashequilibrium.AtaNashequilibriumpoint (withacorrespondingNashstrategy),eachagentplaysabest-responsestrategyassumingalltheotheragentsplay their equilibriumstrategies (see Definition 2 for a precise statement). This notion of equilibriummakes perfect senseinagamesettingwhereagentsdonothaveanyincentivetounilaterallydeviatefromtheirNashstrategies. Breton [1991], FilarandVrieze [2004] establish that finding the stationary NE of a two-player discounted stochastic gameis equivalenttosolvinganoptimizationproblemwith a non-linearobjectivefunctionandlinear constraints. WeextendthisformulationtogeneralN-playerstochasticgamesandobservethatthisgeneralization causestheconstraintstobenon-linearaswell. Previousapproachestosolvingtheoptimizationproblemhavenot beenabletoguaranteeconvergencetoglobalminima,evenforthecaseofN =2. Inthislight,ourcontributionis significantaswedevelopanalgorithmtofindaglobalminimumforanyN 2viathefollowingsteps:2 ≥ Step1(Sub-problems): We break down the main optimization problem into several sub-problems. Each sub- problemcanbeseenasensuringnoBellmanerror,foraparticularstatex andagenti 1,...,N , ∈ X ∈ { } 1WhiletheON-SGSPalgorithmthatweproposeisforthissetting,wealsoproposeanotheralgorithm-OFF-SGSP-thatismodelbased. 2Apreliminaryversionofthispaper,withouttheproofs,waspublishedinAAMAS2015-seePrasadetal.[2015]. Incomparisontothe conferenceversion,thispaperincludesamoredetailedproblemformulation,formalproofsofconvergenceofthetwoproposedalgorithms, someadditionalexperimentsandarevisedpresentation. 2 where isthestatespaceandN isthenumberofagentsofthestochasticgameconsidered. X Step2(Solutionpoints): We provideacharacterizationofsolutionpointsthatcorrespondtoNashequilibriaof the underlying game. As a result, we also derive a set of necessary and sufficient conditions, henceforth referredtoasSG-SP(StochasticGame-Sub-Problem)conditions. Step3(Descentdirection): Using SG-SP conditions, we derive a descent direction that avoids local minima. Thisisnotasteepestdescentdirection,butacarefullychosendescentdirectionspecifictothisoptimization problem,whichensuresconvergenceonlytopointsofglobalminimathatcorrespondtoSG-SPpoints(and henceNashstrategies). Step4(Actor-criticalgorithms): We proposealgorithmsthatincorporatethe aforementioneddescentdirection toensureconvergencetostationaryNEoftheunderlyinggame. Thealgorithmsthatweproposeareasfollows: OFF-SGSP. Thisisan offline,centralizedandmodel-basedscheme,i.e., itassumesthatthetransitionstructure oftheunderlyinggameisknown. ON-SGSP. Thisisanonline,model-freeschemethatisdecentralized,i.e.,learningislocalizedtoeachagentwith one instance of ON-SGSP runningon each agent. ON-SGSP only requires that other agents’ actions and rewardsareobservedandnottheirpolicies,i.e.,mapsfromstatestoactions. Wemaketheassumptionthatforallstrategies,theresultingMarkovchainisirreducibleandpositiverecurrent.This assumptioniscommontotheanalysisofpreviousmulti-agentRLalgorithmsaswell(cf. HuandWellman[1999], Littman[2001])3. Tothebestofourknowledge,ON-SGSPisthefirstmodel-freeonlinealgorithmthatconverges in self-play to stationary NE for any finite discounted general-sum stochastic game where the aforementioned assumptionholds. As suggested by BowlingandVeloso [2001], two desirable propertiesof any multi-agentlearning algorithm areasfollows: (a) Rationality4: Learntoplayoptimallywhenotheragentsfollowstationarystrategies;and (b) Self-playconvergence:ConvergetoaNashequilibriumassumingallagentsareusingthesamelearningalgo- rithm. Our ON-SGSP algorithm can be seen to meet both the properties mentioned above. However, unlike the repeatedgamesettingof[BowlingandVeloso,2001,ConitzerandSandholm,2007],ON-SGSPsolvesdiscounted general-sumstochasticgamesandpossessestheoreticalconvergenceguaranteesaswell. AsillustratedinFig. 2,thebasicideainbothOFF-SGSPandON-SGSPistoemploytworecursions,referred to as the actor and the critic, respectively. Conceptually, these can be seen as two nested loops that operate as follows: Criticrecursion: This performs policy evaluation, i.e., estimates the value function for a fixed policy. In the model-basedsetting(i.e.,ofOFF-SGSP),thiscorrespondstothewell-knowndynamicprogrammingproce- dure-valueiteration.Ontheotherhand,inthemodel-freesetting(i.e.,ofON-SGSP),thecriticisbasedon temporaldifference(TD)learning[SuttonandBarto,1998]. Actorrecursion: Thisincrementallyupdatesthepolicyusinggradientdescent.Forthispurpose,itusesadescent direction that ensures convergenceto a global minimum (and hence NE) of the optimization problem we mentionedearlier. 3Forthecaseofstochasticgameswheretherearemultiplecommunicatingclassesofstatesoreventransientstates,apossiblework-around istore-startthegameperiodicallyinarandomstate. 4Thetermrationalityisnottobeconfusedwithitscommoninterpretationineconomicsparlance. 3 Critic (PolicyEvaluation) Policyπi Valuevπi Actor (PolicyImprovement) Figure2:Operationalflowofouralgorithms Usingmulti-timescalestochasticapproximation(seeChapter6ofBorkar[2008])boththerecursionsabovearerun simultaneously,albeitwithvaryingstep-sizeparametersandthismimicsthenestedtwo-loopprocedureoutlined above. Theformalproofofconvergencerequiresconsiderablesophistication,aswebaseourapproachontheordinary differentialequations(ODE)methodforstochasticapproximation[Borkar,2008]. Whileafewpreviouspapersin theliteraturehaveadoptedthisapproach(cf. Akchurina[2009],Weibull[1996]),theirresultsdonotusuallystart withanalgorithmthatisshowntotrackanODE.Instead,anODEisreachedfirstviaanalysisandanapproximate methodisusedtosolvethisODE.Ontheotherhand,weadopttheformerapproachandshowthatbothOFF-SGSP andON-SGSPconvergeusingthefollowingsteps: 1. Usingtwo-timescalestochasticapproximation,weshowthatthevalueandpolicyupdatesonthefastandslow timescales,convergerespectivelytothelimitpointsofasystemofODEs. 2. Next,weprovideasimplifiedrepresentationforthelimitingsetofthepolicyODEandusethistoestablishthat theasymptoticallystablelimitpointsofthepolicyODEcorrespondtoSG-SP(andhenceNash)points. Whilethefirststepaboveusesawell-knownresult(Kushner-Clarklemma)foranalysingstochasticapproximation recursions,thetechniquesusedinthesecondstepabovearequitedifferentfromthoseusedpreviously. Thelatter step is crucial in establishing overall convergence, as the strategy π corresponding to each stable limit gives a stationaryNEoftheunderlyinggeneral-sumdiscountedstochasticgame. Wedemonstratethepracticalityofouralgorithmsontwosynthetictwo-playersetups. Thefirstisasinglestate non-genericgame adoptedfrom HartandMas-Colell [2005] thatcontainstwo NEs (one pure, the other mixed), while the secondisa stick-togethergame with 810,000states (to thebest ofourknowledge,previousworkson general-sumgameshave neverconsideredstate spaces of this size). On the first setup, we show that ON-SGSP alwaysconvergestoNE,whileNashQ[HuandWellman,2003]andFFQ[Littman,2001]donotina significant numberofexperimentalruns. Onthesecondsetup,weshowthatON-SGSPoutperformsNashQandFFQ,while exhibitingarelativelyquickconvergencerate-requiringapproximately21iterationsperstate. Mapofthepaper. Section2reviewsrelevantpreviousworksintheliterature. Section3formalizesa general- sum stochastic game andsets the notationused throughoutthe paper. Section4 formulatesa generalnon-linear optimization problem for solving stochastic games and also form sub-problems whose solutions correspond to Nash strategies. Section 5 presents necessary and sufficient SG-SP conditions for Nash equilibria of general- sum stochastic games. Section 6 presents the offline algorithm OFF-SGSP, while Section 7 providesits online counterpartON-SGSP. Section 8 sketches the proof of convergencefor both the algorithms. Simulation results for the single state non-generic game and the stick-together game settings are presented in Section 9. Finally, concludingremarksareprovidedinSection10. 4 2 Related Work Various approaches have been proposed in the literature for computing Nash equilibrium of general-sum dis- countedstochasticgamesandwediscusssomeofthembelow. Multi-agentRL.Littman[Littman,1994]proposedaminimaxQ-learningalgorithmfortwo-playerzero-sum stochasticgames. HuandWellman[HuandWellman,1999,2003]extendedtheQ-learningapproachtogeneral- sum games, but their algorithms do not provide meaningful convergence guarantees. Friend-or-foe Q-learning (FFQ)[Littman,2001]isafurtherimprovementbasedonQ-learningandwithguaranteedconvergence.However, FFQconvergestoNashequilibriaonlyinrestrictedsettings(SeeconditionsAandBin[Littman,2001]).Moreover, theapproachesin[HuandWellman,1999,2003]requirecomputationofNashequilibriaofabimatrixgame,while theapproachofLittman[2001]requiressolvingalinearprogram,ineachroundoftheiralgorithmsandthisisa computationallyexpensiveoperation. In contrast, ON-SGSP doesnotrequireanysuch equilibriacomputations. Zinkevichetal.[2006]showthatthetraditionalQ-learningbasedapproachesarenotsufficientto computeNash equilibriaingeneral-sumgames5. Policyhillclimbing.ThisisacategoryofpreviousworksthatiscloselyrelatedtoON-SGSPalgorithmthatwe propose. ImportantreferenceshereincludeBowlingandVeloso[2001],Bowling[2005],ConitzerandSandholm [2007] as well as ZhangandLesser [2010]. All these algorithmsare gradient-based,model-freeand are proven toconvergetoNEforstationaryopponentsinself-play. However,theseconvergenceguaranteesareforrepeated gamesonly,i.e.,thesettingisasinglestatestochasticgame,wheretheobjectiveistolearntheNashstrategyfora stage-game(seeDefinition1in[ConitzerandSandholm,2007])thatisrepeatedlyplayed. Ontheotherhand,we considergeneral-sumstochasticgameswheretheobjectiveistolearnthebest-responsestrategyagainststationary opponentsinordertomaximizethevaluefunction(whichisaninfinitehorizondiscountedsum).Further,wework withamoregeneralstatespacethatisnotrestrictedtobeasingleton. Homotopy. HeringsandPeeters [2004] propose an algorithm where a homotopicpath between equilibrium pointsofN independentMDPsandtheN playerstochasticgameinquestion,istracednumerically.This,inturn, givesaNashequilibriumpointofthestochasticgameofinterest.ThisapproachisextendedbyHeringsandPeeters [2006]andBorkovskyetal.[2010]. OFF-SGSPsharessimilaritieswiththeaforementionedhomotopicalgorithms inthesensethatbothare (i) offlineandmodel-basedastheyassumecompleteinformation(esp.thetransitiondynamics)aboutthegame; and (ii) thecomputationalcomplexityforeachiterationofbothalgorithmsgrowsexponentiallywiththenumberof agentsN. (iii) Further, both algorithms are proven to convergeto stationary NE, though their approachadopted is vastly different. OFF-SGSP is a gradient descent algorithm designed to converge to the global minimum of a nonlinearprogram,while the algorithmby HeringsandPeeters[2004] involvesa tracingprocedureto find anequilibriumpoint. Linear programming. MacDermedandIsbell [2009] solve stochastic games by formulating intermediate optimizationproblems,calledMulti-ObjectiveLinearPrograms(MOLPs). However,thesolutionconceptthereis ofcorrelatedequilibriaandNashpointsareastrictsubsetofthisclass(andhencearehardertofind). Also, the complexityoftheiralgorithmscalesexponentiallywiththeproblemsize. BothhomotopyandlinearprogrammingmethodsproposedbyMacDermedandIsbell[2009]andHeringsandPeeters [2004]aretractableonlyforsmall-sizedproblems.Thecomputationalcomplexityofthesealgorithmsmayrender theminfeasibleonlargestate games. Incontrast,ON-SGSPis a model-freealgorithmwitha per-iterationcom- plexitythatislinearinN, allowingforpracticalimplementationsonlargestategamesettings(seeSection9for onesuchexamplewitha statespacecardinalityof810,000). We howevermentionthatper-iterationcomplexity aloneisnotsufficienttoquantifytheperformanceofanalgorithm-seeRemark7. 5Weavoidthisimpossibilityresultbysearchingforbothvaluesandpoliciesinsteadofjustvalues,inourproposedalgorithms. 5 Rational learning. A popular algorithm with guaranteed convergence to Nash equilibria in general-sum stochasticgamesisrationallearning,proposedbyKalaiandLehrer[1993]. Intheiralgorithm,eachagentimain- tains a prior on what he believes to be other agents’ strategy and updates it in a Bayesian manner. Combining thiswithcertainassumptionsofabsolutecontinuityandgrainoftruth,thealgorithmthereisshowntoconvergeto NE.ON-SGSPoperatesinasimilarsettingasthatin[KalaiandLehrer,1993],exceptthatwedonotassumethe knowledgeofrewardfunctions. ON-SGSPisamodel-freeonlinealgorithmandunlike[KalaiandLehrer,1993], anyagent’sstrategyinON-SGSPdoesnotdependuponBayesianestimatesofotheragents’strategiesandhence, theirabsolutecontinuity/grainoftruthassumptionsdonotapply. Evolutionaryalgorithm. Akchurina[2009]employsnumericalmethodsinordertosolveasystemofODEs and only establishes empirical convergenceto NE for a group of randomly generated games. In contrast, ON- SGSP is a model-freealgorithmthat is provablyconvergentto NE in self-play. We also note thatthe system of ODEsgivenbyAkchurina[2009](alsofoundin[Weibull,1996,pp. 189])turnsouttobesimilartoaportionof theODEsthataretrackedbyON-SGSP. Remark1. Shohametal.[2003]andShohametal.[2007]questionifNashequilibriumisausefulsolutioncon- ceptforgeneral-sumgames. However,ifwearewillingtoconcedethatprescriptive,equilibriumagendaisindeed usefulfor stochastic games, then we believe ourwork is theoretically significant. Our ON-SGSP algorithm is a prescriptive,co-operativelearningalgorithmthatobservesasamplepathfromtheunderlyinggameandconverges tostationaryNE.Tothebestofourknowledge,thisisthefirstalgorithmtodoso,withprovenconvergence. 3 Formal Definitions Astochasticgamecanbeseentobeanextensionofthesingle-agentMarkovdecisionprocess.Adiscountedreward stochasticgameisdescribedbyatuple<N, , ,p,r,β >,whereN representsthenumberofagents, denotes X A X N the state space and = (x) is the aggregate action space, where (x) = i(x) is the Cartesian x∈X A ∪ A A A i=1 productofactionspaces( i(x))ofindividualagentswhenthestateofthegameisx Q . Weassumebothstate A ∈X andactionspacestobefinite. Letp(y x,a)denotetheprobabilityofgoingfromthecurrentstatex toy | ∈X ∈X whenthevectorofactionsa (x)(oftheN players)ischosenandletr(x,a) = ri(x,a):i=1,2,...,N ∈ A denotethevectorofrewardfunctionsof allagentswhenthestate isx andthe vectorof actionsa (x) ∈ X (cid:10) ∈ A (cid:11) ischosen. Also,0 < β < 1denotesthediscountfactorthatcontrolstheinfluenceoftherewardsobtainedinthe futureontheagents’strategy(seeDefinition1below). Notation. representsacolumnvectorand1 isavectorofoneswithmelements.Thevariousconstituents h···i m ofthestochasticgameconsideredaredenotedasfollows:6 Action: a = a1,a2,...,aN (x) isthe aggregateaction, a−i is thetupleof actionsofallagentsexcepti ∈ A and −i(x):= j(x)isthesetoffeasibleactionsinstatex ofallagentsexcepti. A (cid:10) A (cid:11) ∈X j6=i Q Policy: πi(x,ai)istheprobabilityofpickingactionai i(x)byagentiinstatex , ∈A ∈X πi(x) = πi(x,ai):ai i(x) is the randomized policy vector in state x for the agent i, πi = ∈A ∈ X πi(x):x ,π = πi :i=1,2,...,N isthestrategy-tupleofallagentsand (cid:10)∈X (cid:11) π−i = πj :j =1,2,...,N,j =i is the strategy-tuple of all agents except agent i. We focus only on (cid:10) (cid:11) (cid:10) 6 (cid:11) stationarystrategiesinthispaper,assuggestedbyTheorem1. (cid:10) (cid:11) N N TransitionProbability: Letπ(x,a)= πi(x,ai)andπ−i(x,a−i)= πj(x,aj).Then,the(Markovian) i=1 j=1,j6=i transitionprobabilityfromstatexQ tostatey wheneachageQntiplaysaccordingtoitsrandomized ∈X ∈X 6Weusethetermspolicyandstrategyinterchangeablyinthepaper. 6 strategyπi canbewrittenas: p(y x,π)= p(y x,a)π(x,a). | | a∈XA(x) Reward: ri(x,a)isthesingle-stagerewardobtainedbyagentiinstatex ,wherea (x)istheaggregate ∈X ∈A actiontaken. Definition1. (Valuefunction)Thevaluefunctionistheexpectedreturnforanyagenti 1,2,...,N andis ∈ { } definedas vi(s )=E βt ri(s ,a)π(s ,a) . (1) π 0  t t  Xt a∈XA(x)(cid:0) (cid:1)   Giventheabovenotionofthevaluefunction,thegoalofeachagentistofindastrategythatachievesaNash equilibrium.Thelatterisdefinedasfollows: Definition2. (NashEquilibrium)AstationaryMarkovstrategyπ∗ = π1∗,π2∗,...,πN∗ issaidtobeNashif vπi∗(s)≥vhiπi,π−i∗i(s),∀πi,∀i,∀s(cid:10)∈X. (cid:11) ThecorrespondingequilibriumofthegameissaidtobeNashequilibrium. Sincewe considera discountedstochasticgamewith a finite state space, we havethe followingwell-known resultthatensurestheexistenceofstationaryequilibrium: Theorem1. Anyfinitediscountedstochasticgamehasanequilibriuminstationarystrategies. WeshallrefertosuchstationaryrandomizedstrategiesasNashstrategies.ThereaderisreferredtoFink[1964], Takahashi[1964],Sobel[1971]foraproofofTheorem1. 4 A Generalized Optimization Problem Basic idea. Using dynamicprogrammingthe Nash equilibriumconditionin Definition 2 can be re-written as: x , i=1,2,...,N, ∀ ∈X ∀ vπi∗(x)= max Eπi(x)Qiπ−i∗(x,ai) , (2) πi(x)∈∆(Ai(x)) (cid:8) (cid:9) where Qiπ−i(x,ai)=Eπ−i(x)ri(x,a)+β p(y|x,a)vi(y), y∈XU(x)   representsthemarginalvalueassociatedwithpickingactionai i(x), instatex foragenti,whileother ∈ A ∈ X agentsactaccordingtoπ−i. Also,∆( i(x))denotesthesetofallpossibleprobabilitydistributionsover i(x). A A Thebasicideabehindtheoptimizationproblemthatweformulatebelowistomodeltheobjectivesuchthatthe valuefunctioniscorrectw.r.t. theagents’strategies,whileaddingaconstrainttoensurethatafeasiblesolutionto theproblemcorrespondstoNashequilibrium. Objective. Apossibleoptimizationobjectiveforagentiwouldbe fi(vi,π)= vi(x) E Qi (x,ai) . − πi π−i x∈X X(cid:0) (cid:1) The objective above has to be minimized over all policies πi ∆( i(x)). But Qi (x,ai), by definition, is ∈ A π−i dependentonstrategiesof allotheragents. So, an isolatedminimizationoffi(vi,πi) wouldnotbemeaningful 7 N and hence, we consider the aggregate objective f(v,π) = fi(vi,π). This objective is minimized over all i=1 policies πi ∆( i(x)) of all agents. Thus, we have an optPimizationproblem with objective as f(v,π) along ∈ A withthenaturalconstraintsensuringthatthepolicyvectorsπi(x)remainasprobabilitiesoverallfeasibleactions i(x)inallstatesx andforagentsi=1,...,N. A ∈X Constraints. Notice that an optimizationproblem with the objective discussed above has only a set of simple constraintsensuringthat π remainsa valid strategy. However,this is notsufficientto accuratelyrepresentNash equilibria of the underlying game. Here, we look at a possible set of additional constraints which might make theoptimizationproblemmoreuseful. Notethatthetermbeingmaximizedinequation(2),i.e.,E Qi (x,ai), πi π−i representsaconvexcombinationofthevaluesofQi (x,ai)overallpossibleactionsai i(x)inagivenstate π−i ∈A x foragivenagenti. Thus,itisimplicitlyimpliedthat ∈X Qiπ−i(x,ai)≤vπi∗(x),∀ai ∈Ai(x),x∈X,i=1,2,...,N. Formally,theoptimizationproblemforanyN 2isgivenbelow: ≥ N minf(v,π)= vi(x) E Qi (x,ai) s.t. v,π i=1x∈X − πi π−i  ((ab))πNi(xπ,ia(xi),a≥i)P0=,∀a1Pi, ∈x(cid:0)Ai(x),,ix=∈1X,2,,i..=.,1N,2,,.(cid:1)..,N,  (3) ∀ ∈X i=1 Intheabove,3(a)–3(b)e(ncs)uQrPeiπt−hia(txπ,aisi)a≤vavliid(xp)o,li∀cayi,w∈hAilei(3x()c,)xis∈neXce,sis=ary1,fo2r,.a.n.y,vNa.lidpolicytobeaNEofthe underlyinggame. Theorem2. Afeasiblepoint(v∗,π∗)oftheoptimizationproblem(3)providesaNashequilibriuminstationary strategiestothecorrespondinggeneral-sumdiscountedstochasticgameifandonlyiff(v∗,π∗)=0. Proof. See[FilarandVrieze,2004,Theorem3.8.2]foraproofinthecaseofN =2.Theproofworksinasimilar mannerforgeneralN. Afewremarksaboutthedifficultiesinvolvedinsolving(3)areinorder. Remark2. (Non-linearity)ForthecaseofN = 2,theobjectivef(v,π)in(3)isoforder3,whilethetoughest constraint3(c)isquadratic. Thisisapparentfromthefactthatthesecondterminsidethesummationinf(v,π) hasthefollowingvariablesmultiplied: π inthefirstexpectation,π insidetheexpectationinthedefinitionofthe 1 2 Q-functionandvinsidethesecondtermoftheexpectationinQ-function. Alongsimilarlines,theconstraint3(c) canbeinferredtobequadratic.Thus,wehaveanoptimizationproblemwithathird-orderobjectiveandquadratic constraints,evenforthecaseofN =2andtheconstituentfunctions(bothobjectiveandconstraints)canbeeasily seen to be non-linear. For a general N > 2, the problem (3) gets more complicated, as more policy variables π ,...,π arethrownin. 1 N Remark 3. (Beyond local minima) FilarandVrieze [2004] have formulated a non-linear optimization prob- lem for the case of two-player zero-sum stochastic games. An associated result (Theorem 3.9.4, page 141, in [FilarandVrieze,2004])statesthateverylocalminimumofthatoptimizationproblemisalsoaglobalminimum. However,thisresultdoesnotholdforageneral-sumgameeveninthetwo-player(andalsoN 3)settingand ≥ hence,therequirementisforaglobaloptimizationschemethatsolves(3). Remark4. (Nosteepestdescent)Fromthepreviousremark,itisapparentthatasimplesteepestdescentscheme isnotenoughtosolve(3)evenforthetwo-playersetting. Thisisbecausetherecanbelocalminimaof (3)thatdo notcorrespondtotheglobalminimumandsteepestdescentschemesguaranteeconvergencetolocalminimaonly. NotethatsteepestdescentschemesweresufficienttosolveforNashequilibriumstrategiesintwo-playerzero-sum 8 stochasticgames,whilethisisnotthecasewithgeneral-sumN-playergames,withN 2. Sections3and4of ≥ [PrasadandBhatnagar,2015]containadetaileddiscussiononinadequaciesofsteepestdescentforgeneral-sum games. Remark5. (NoNewtonmethod)AnaturalquestionthatarisesiscanoneemployaNewtonmethodinorderto solve(3)andtheanswerisinthenegative. ObservethattheHessianoftheobjectivefunctionf(v,π)in(3)has itsdiagonalelementstobezeroandthismakesitindefinite. ThismakeNewtonmethodsinfeasibleastheyrequire invertibilityoftheHessiantowork. Weovercometheabovedifficultiesbyderivingadescentdirection(thatisnotnecessarilysteepest)inorderto solve(3).Beforewepresentthedescentdirection,webreak-down(3)intosimplersub-problems.Subsequentlywe descibeStochasticGame-Sub-Problem(SG-SP)conditionsthatarebothnecessaryandsufficientfortheproblem (3). Sub-problems foreachstateand agent Weformsub-problemsfromthemainoptimizationproblem(3)alongthelinesofPrasadandBhatnagar[2012],for eachstatex andeachagenti 1,2,...,N . Thesub-problemsareformedwiththeobjectiveofensuring ∈ X ∈ { } thatthereisnoBellmanerror(seegi (θ)below). Foranyx ,z = 1,2,..., i(x) andi 1,2,...,N , x,z ∈ X |A | ∈ { } letθ := vi,π−i(x) denotethevalue-policytupleandlet (cid:10) (cid:11) gi (θ):=Qi (x,ai) vi(x) (4) x,z π−i z − denote the Bellman error. Further, let p := πi(x,ai) and p = p :z =1,2,..., i(x) . Then, the sub- z z z |A | problemsareformulatedasfollows: (cid:10) (cid:11) |Ai(x)| minh (θ,p)= p gi (θ) (5) θ,p x z − x,z z=1 X (cid:2) (cid:3) s.t. gi (θ) 0, p 0, forz =1,2,..., i(x), x,z ≤ − z ≤ |A | and p =1. z z X 5 Stochastic Game - Sub-Problem (SG-SP) Conditions Inthissection,wederiveasetofnecessaryandsufficientconditionsforsolutionsof(3)andestablishtheirequiv- alencewithNashstrategies. Definition3(SG-SPPoint). Apoint(v∗,π∗)oftheoptimizationproblem(3)issaidtobeanSG-SPpointifitis afeasiblepointof(3)andforeverysub-problem,i.e.,forallx andi 1,2,...,N , ∈X ∈{ } p∗gi (θ∗)=0, z =1,2,..., i(x). (6) z x,z ∀ |A | Theaboveconditions,whichdefineapointtobeanSG-SPpoint,arecalledSG-SPconditions. 5.1 Equivalence ofSG-SP withNashstrategies TheconnectionbetweenSG-SPpointsandNashequilibriacanbeseenintuitivelyasfollows: (i) The objectivefunctionf(v∗,π∗) in (3) can be expressedasa summationof termsof the formp∗[ gi (θ∗)] z − x,z overz =1,2,..., i(x) andoverallsub-problems.Condition(6)suggeststhateachofthesetermsiszerowhich |A | impliesf(v∗,π∗)=0. (ii)Theobjectiveofthesub-problemistoensurethatthereisnoBellmanerror,whichinturnimpliesthatthevalue estimatesv∗arecorrectwithrespecttothepolicyπ∗ ofallagents. Combining(i)and(ii)withTheorem3.8.2of[FilarandVrieze,2004],wehavethefollowingresult: 9 Theorem 3 (Nash SG-SP). A strategy π∗ is Nash if and only if (v∗,π∗) for the correspondingoptimization ⇔ problem(3)isanSG-SPpoint. TheproofoftheabovetheoremfollowsfromacombinationofLemmas1and2,presentedbelow. Lemma 1 (SG-SP Nash). An SG-SP point (v∗,π∗) gives Nash strategy-tuple for the underlying stochastic ⇒ game. Proof. Theobjectivefunctionvaluef(v∗,π∗)oftheoptimizationproblem(3)canbeexpressedasasummationof termsoftheformp∗[ gi (θ∗)]overz =1,2,...,mandoverallsub-problems.Condition(6)suggeststhateach z − x,z ofthesetermsiszerowhichimpliesf(v∗,π∗)=0.FromFilarandVrieze[2004,Theorem3.8.2,page132],since (v∗,π∗)isafeasiblepointof(3)andf(v∗,π∗) = 0,(v∗,π∗)correspondstoNashequilibriumoftheunderlying stochasticgame. Lemma2(Nash SG-SP). Astrategyπ∗ isNashif(v∗,π∗)forthecorrespondingoptimizationproblem(3)is ⇒ anSG-SPpoint. Proof. From FilarandVrieze [2004, Theorem 3.8.2, page 132], if a strategy π∗ is Nash, then a feasible point (v∗,π∗)existsforthecorrespondingoptimizationproblem(3),wheref(v∗,π∗)=0. Fromtheconstraintsof(3), itisclearthatforafeasiblepoint,p∗[ gi (θ∗)] 0,forz = 1,2,...,m,foreverysub-problem. Sincethesum z − x,z ≥ ofalltheseterms,i.e.,f(v∗,π∗), iszero,eachofthesetermsiszero,i.e.,(v∗,π∗)satisfies(6). Thus,(v∗,π∗)is anSG-SPpoint. 5.2 Kinshipto Karush-Kuhn-Tucker -Sub-Problem (KKT-SP)conditions PrasadandBhatnagar[2012]considerasimilaroptimizationproblemas(3)forthecaseoftwoagents,i.e.,N =2 andderiveasetofverifiablenecessaryandsufficientconditionsthattheycallKKT-SPconditions.Inthefollowing, we first extend the KKT-SP conditions to N-player stochastic games, for any N 2 and later establish the ≥ equivalencebetweenKKT-SPconditionswiththeSG-SPconditionsformulatedabove. TheLagrangiancorrespondingto(5)canbewrittenas |Ai(x)| |Ai(x)| k(θ,p,λ,δ,s,t)=h (θ,p)+ λ gi (θ)+s2 + δ p +t2 , (7) x z x,z z z − z z z=1 z=1 X (cid:0) (cid:1) X (cid:0) (cid:1) whereλ andδ aretheLagrangemultipliersands andt ,z = 1,2,..., i(x) aretheslackvariables,corre- z z z z |A | spondingtothefirstandsecondconstraintsofthesub-problem(5),respectively. Using the Lagrangian (7), the associated KKT conditions for the sub-problem (5) corresponding to a state x S andagenti 1,...,N atapoint θ∗,p∗ arethefollowing: ∈ ∈{ } h i m (a) h (θ∗,p∗)+ λ gi (θ∗)=0, ∇θ x z∇θ x,z z=1  (((dcb)))δλ∂zhpgx∗zi∂(=θp∗(z,θ0p∗,∗))=−0zδ,z=P+1δ,zm2,==..1.0,,,2m,.,..,zm=,1,2,...,m,  (8) z x,z KKT-SP conditionsare show((fen))tλδozzb≥≥e n00e,,cessarzzy==an11d,,22su,,..ffi....c,,iemmnt.,for (v∗,π∗) to representa Nash equilibriumpoint of the underlying stochastic game and π∗ to be a Nash strategy-tuple in the case of N = 2 (see Theorem 3.8 in [PrasadandBhatnagar, 2012]). However, this requires an additional assumption that for each sub-problem, 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.