ebook img

Maximizing the Conditional Expected Reward for Reaching the Goal PDF

1.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Maximizing the Conditional Expected Reward for Reaching the Goal

Maximizing the Conditional Expected Reward for Reaching the Goal (cid:63) (extended version) Christel Baier, Joachim Klein, Sascha Klu¨ppelholz, Sascha Wunderlich Institute for Theoretical Computer Science 7 Technische Universita¨t Dresden, Germany 1 0 2 Abstract. The paper addresses the problem of computing maximal n conditional expected accumulated rewards until reaching a target state a (briefly called maximal conditional expectations) in finite-state Markov J decisionprocesseswheretheconditionisgivenasareachabilityconstraint. 9 Conditional expectations of this type can, e.g., stand for the maximal ex- 1 pected termination time of probabilistic programs with non-determinism, under the condition that the program eventually terminates, or for the ] O worst-caseexpectedpenaltytobepaid,assumingthatatleastthreedead- lines are missed. The main results of the paper are (i) a polynomial-time L algorithmtocheckthefinitenessofmaximalconditionalexpectations,(ii) . s PSPACE-completeness for the threshold problem in acyclic Markov deci- c sionprocesseswherethetaskistocheckwhetherthemaximalconditional [ expectation exceeds a given threshold, (iii) a pseudo-polynomial-time 1 algorithm for the threshold problem in the general (cyclic) case, and (iv) v an exponential-time algorithm for computing the maximal conditional 9 expectation and an optimal scheduler. 8 3 5 1 Introduction 0 . 1 Stochastic shortest (or longest) path problems are a prominent class of optimiza- 0 tionproblemswherethetaskistofindapolicyfortraversingaprobabilisticgraph 7 1 structure such that the expected value of the generated paths satisfying a certain : objective is minimal (or maximal). In the classical setting (see e.g. [14,33,23,28]), v the underlying graph structure is given by a finite-state Markov decision process i X (MDP), i.e., a state-transition graph with nondeterministic choices between sev- r eralactionsforeachofitsnon-terminalstates,probabilitydistributionsspecifying a the probabilities for the successor states for each state-action pair and a reward function that assigns rational values to the state-action pairs. The stochastic shortest (longest) path problem asks to find a scheduler, i.e., a function that resolves the nondeterministic choices, possibly in a history-dependent way, which minimizes (maximizes) the expected accumulated reward until reaching a goal (cid:63) The authors are supported by the DFG through the collaborative research centre HAEC(SFB912),theExcellenceInitiativebytheGermanFederalandStateGovern- ments (cluster of excellence cfAED), the Research Training Group QuantLA (GRK 1763), and the DFG-project BA-1679/11-1. 2 Christel Baier, Joachim Klein, Sascha Klu¨ppelholz, Sascha Wunderlich state. To ensure the existence of the expectation for given schedulers, one often assumes that the given MDP is contracting, i.e., the goal is reached almost surely under all schedulers, in which case the optimal expected accumulated reward is achieved by a memoryless deterministic scheduler that optimizes the expectation from each state and is computable using a linear program with one variable per state (see e.g. [28]). The contraction assumption can be relaxed by requiring the existence of at least one scheduler that reaches the goal almost surely and taking the extremum over all those schedulers [14,23,15]. These algorithms and corresponding value or policy iteration approaches have been implemented in various tools and used in many application areas. Therestrictiontoschedulersthatreachthegoalalmostsurely,however,limits the applicability and significance of the results. First, the known algorithms for computing extremal expected accumulated rewards are not applicable for models where the probability for never visiting a goal state is positive under each scheduler. Second, statements about the expected rewards for schedulers that reach the goal with probability 1 are not sufficient to draw any conclusion for the best- or worst-case behavior, if there exist schedulers that miss the goal with positive probability. This motivates the consideration of conditional stochastic path problems where the task is to compute the optimal expected accumulated reward until reaching a goal state, under the condition that a goal state will indeed be reached and where the extrema are taken over all schedulers that reach the goal with positive probability. More precisely, we address here a slightly more general problem where we are given two sets F and G of states in an MDP M withnon-negativeintegerrewardsandaskforthemaximalexpectedaccumulated reward until reaching F, under the condition that G will be visited (denoted Emax ( F|♦G) where s is the initial state of M). Computation schemes for M,sinit init conditional expectations of this type can, e.g., be used to answer the following questions (assuming the underlying model is a finite-state MDP): (Q1) Whatisthemaximalterminationtimeofaprobabilisticandnondeterministic program, under the condition that the program indeed terminates? (Q2) What are the maximal expected costs of the repair mechanisms that are triggeredincaseswhereaspecificfailurescenariooccurs,underthecondition that the failure scenario indeed occurs? (Q3) What is the maximal energy consumption, under the condition that all jobs of a given list will be successfully executed within one hour? Therelevanceofquestion(Q1)andrelatedproblemsbecomesclearfromthework [24,27,29,13,19] on the semantics of probabilistic programs where no guarantees for almost-sure termination can be given. Question (Q2) is natural for a worst- case analysis of resilient systems or other types of systems where conditional probabilities serve to provide performance guarantees on the protocols triggered in exceptional cases that appear with positive, but low probability. Question (Q3) is typical when the task is to study the trade-off between cost and utility functions (see e.g. [9]). Given the work on anonymity and related notions for information leakage using conditional probabilities in MDP-like models [7,20] or the formalization of posterior vulnerability as an expectation [4], the concept of Maximizing the Conditional Expected Reward for Reaching the Goal 3 γ s1 goal α,1 2 rew(s1,γ) = r s0 α rew(s2,β) = 1 α,1 rew(si,α) = 0 2 s2 fail β,1 2 β,1 2 Fig.1. MDP M[r] for Example 1.1 conditional accumulated excepted rewards might also be useful to specify the degree of protection of secret data or to study the trade-off between privacy and utility,e.g.,usinggainfunctions[5,3].Otherareaswhereconditionalexpectations play a crucial role are risk management where the conditional value-at-risk is used to formalize the expected loss under the assumption that very large losses occur [37,2] or regression analysis where conditional expectations serve to predict the relation between random variables [35]. Example 1.1. To illustrate the challenges for designing algorithms to compute maximal conditional expectations we regard the MDP M[r] shown in Figure 1. The reward of the state-action pair (s ,γ) is given by a reward parameter r∈N. 1 Let s =s be the initial state and F =G={goal}. The only nondeterministic init 0 choice is in state s , while states s and s behave purely probabilistic and goal 2 0 1 and fail are trap states. Given a scheduler S, we write CES for the conditional expectation ES ( goal|♦goal). (See also Section 2 for our notations.) For M[r],s0 the two memoryless schedulers that choose α resp. β in state s we have: 2 1 ·r + 1 ·0 r 1 ·r + 0 CEα = 2 2 = and CEβ = 2 = r 1 + 1 2 1 +0 2 2 2 We now regard the schedulers S for n=1,2,... that choose β for the first n n visits of s and action α for the (n+1)-st visit of s . Then: 2 2 1 ·r + 1 · 1 ·n n−r CESn = 2 2 2n = r + 1 + 1 · 1 2n+1 2 2 2n Thus, CESn >CEβ iff n>r, and the maximum is achieved for n=r+2. This example illustrates three phenomena that distinguish conditional and unconditional expected accumulated rewards and make reasoning about maximal conditional expectations harder than about unconditional ones. First, optimal schedulers for M[r] need a counter for the number of visits in state s . Hence, 2 memoryless schedulers are not powerful enough to maximize the conditional expectation. Second, while the maximal conditional expectation for M[r] with initial state s =s is finite, the maximal conditional expectation for M[r] with init 0 starting state s is infinite as: 2 4 Christel Baier, Joachim Klein, Sascha Klu¨ppelholz, Sascha Wunderlich sup ESn ( goal|♦goal) = sup 2nn = ∞ n∈N M[r],s2 n∈N 21n Third, as S maximizes the conditional expected accumulated reward for r=0, 2 while S is optimal for r = 1, optimal decisions for paths ending in state s 3 2 depend on the reward value r of the γ-transition from state s , although state 1 s is not reachable from s . Thus, optimal decisions for a path π do not only 1 2 depend on the past (given by π) and possible future (given by the sub-MDP that is reachable from π’s last state), but require global reasoning. (cid:4) The main results of this paper are the following theorems. We write CEmax for themaximalconditionalexpectation,i.e.,thesupremumoftheconditionalexpec- tationsES ( F|♦G),whenrangingoverallschedulersSwherePrS (♦G) M,sinit M,sinit is positive and PrS (♦F|♦G)=1. (See also Section 2 for our notations.) M,sinit Theorem 1 (Checking finiteness and upper bound) Thereisapolynomial- time algorithm that checks if CEmax is finite. If so, an upper bound CEub for CEmax is computable in pseudo-polynomial time for the general case and in polynomial time if F =G and Prmin(♦G)>0 for all states s with s|=∃♦G. M,s Thethresholdproblemaskswhetherthemaximalconditionalexpectationexceeds or misses a given rational threshold ϑ. Theorem 2 (Threshold problem) Theproblem“doesCEmax (cid:46)(cid:47)ϑhold?”(where (cid:46)(cid:47) ∈ {>,(cid:62),<,(cid:54)}) is PSPACE-hard and solvable in exponential (even pseudo- polynomial) time. It is PSPACE-complete for acyclic MDPs. For the computation of an optimal scheduler, we suggest an iterative scheduler- improvement algorithm that interleaves calls of the threshold algorithm with linear programming techniques to handle zero-reward actions. This yields: Theorem 3 (Computing optimal schedulers) The value CEmax and an op- timal scheduler S are computable in exponential time. Algorithms for checking finiteness and computing an upper bound (Theorem 1) will be sketched in Sections 3. Section 4 presents a pseudo-polynomial threshold algorithm and a polynomially space-bounded algorithm for acyclic MDPs (Theo- rem2)aswellasanexponential-timecomputationschemefortheconstructionof an optimal scheduler (Theorem 3). Further details, soundness proofs and a proof for the PSPACE-hardness as stated in Theorem 2 can be found in the appendix. The general feasibility of the algorithms will be shown by experimental studies with a prototypical implementation (for details, see Appendix K). Related work. Although conditional expectations appear rather naturally in many applications and despite the large amount of publications on variants of stochastic path problems and other forms of expectations in MDPs (see e.g. [17,34]), we are not aware that they have been addressed in the context of MDPs.ComputationschemesforextremalconditionalprobabilitiesPrmax(ϕ|ψ)or Prmin(ϕ|ψ)whereboththeobjectiveϕandtheassumptionψ arepathproperties Maximizing the Conditional Expected Reward for Reaching the Goal 5 specified in some temporal logic have been studied in [8,6,11]. For reachability properties ϕ and ψ, the algorithm of [8,6] has exponential time complexity, while the algorithm of [11] runs in polynomial time. Although the approach of [11] is notapplicableforcalculatingmaximalconditionalexpectations(seeAppendixB), itcanbeusedtocomputeanupperboundforCEmax (seeSection3).Conditional expectedrewardsinMarkovchainscanbecomputedusingtherescalingtechnique of[11]forfiniteMarkovchainsortheapproximationtechniquesof[18,1]forcertain classes of infinite-state Markov chains. The conditional weakest precondition operator of [29] yields a technique to compute conditional expected rewards for purely probabilistic programs (without non-determinism). 2 Preliminaries We briefly summarize our notations used for Markov decision processes. Further details can be found in textbooks, see e.g. [33,28] or Chapter 10 in [10]. A Markov decision process (MDP) is a tuple M=(S,Act,P,s ,rew) where init S is a finite set of states, Act a finite set of actions, s ∈ S the initial state, init P : S ×Act ×S → [0,1]∩Q is the transition probability function and rew : S×Act →N the reward function. We require that (cid:80) P(s,α,s(cid:48))∈{0,1} for s(cid:48)∈S all (s,α)∈S×Act. We write Act(s) for the set of actions that are enabled in s, i.e., α∈Act(s) iff P(s,α,·) is not the null function. State s is called a trap if Act(s)=∅. The paths of M are finite or infinite sequences s α s α s α ... 0 0 1 1 2 2 where states and actions alternate such that P(s ,α ,s ) > 0 for all i (cid:62) 0. i i i+1 A path π is called maximal if it is either infinite or finite and its last state is a trap. If π = s α s α s α ...α s is finite then rew(π) = rew(s ,α )+ 0 0 1 1 2 2 k−1 k 0 0 rew(s ,α )+...+rew(s ,α )denotestheaccumulatedrewardandfirst(π)= 1 1 k−1 k−1 s , last(π)=s its first resp. last state. The size of M, denoted size(M), is the 0 k sum of the number of states plus the total sum of the logarithmic lengths of the non-zero probability values P(s,α,s(cid:48)) and the reward values rew(s,α).1 An end component of M is a strongly connected sub-MDP. End components can be formalized as pairs E =(E,A) where E is a nonempty subset of S and A a function that assigns to each state s ∈ E a nonempty subset of Act(s) such that the graph induced by E is strongly connected. A (randomized) scheduler for M, often also called policy or adversary, is a function S that assigns to each finite path π where last(π) is not a trap a probabilitydistributionoverAct(last(π)).SiscalledmemorylessifS(π)=S(π(cid:48)) for all finite paths π, π(cid:48) with last(π)=last(π(cid:48)), in which case S can be viewed as a function that assigns to each non-trap state s a distribution over Act(s). S is called deterministic if S(π) is a Dirac distribution for each path π, in which case S can be viewed as a function that assigns an action to each finite path π where last(π) is not a trap. We write PrS or briefly PrS to denote the M,s s 1 The logarithmic length of an integer n is the number of bits required for a represen- tation of n as a binary number. The logarithmic length of a rational number a/b is defined as the sum of the logarithmic lengths of its numerator a and its denominator b, assuming that a and b are coprime integers and b is positive. 6 Christel Baier, Joachim Klein, Sascha Klu¨ppelholz, Sascha Wunderlich probability measure induced by S and s. Given a measurable set ψ of maximal paths, then Prmin(ψ)=inf PrS (ψ) and Prmax(ψ)=sup PrS (ψ). We will M,s S M,s M,s S M,s use LTL-like notations to specify measurable sets of maximal paths. For these it is well-known that optimal deterministic schedulers exists. If ψ is a reachability condition then even optimal deterministic memoryless schedulers exist. Let ∅(cid:54)=F ⊆S. For a comparison operator (cid:46)(cid:47) ∈{=,>,(cid:62),<,(cid:54)} and r ∈N, ♦(cid:46)(cid:47)rF denotes the event “reaching F along some finite path π with rew(π)(cid:46)(cid:47)r”. The notation F will be used for the random variable that assigns to each maximal path ς in M the reward rew(π) of the shortest prefix π of ς where last(π) ∈ F. If ς (cid:54)|= ♦F then ( F)(ς) = ∞. If s ∈ S then ES ( F) denotes M,s the expectation of F in M with starting state s under S, which is infinite if PrS (♦F) < 1. Emax( F) ∈ R∪{±∞} stands for sup ES ( F) where M,s M,s S M,s the supremum is taken over all schedulers S with PrS (♦F)=1. Let ψ be a M,s measurable set of maximal paths. ES ( F|ψ) stands for the expectation of F M,s w.r.t. the conditional probability measure PrS ( · |ψ) given by PrS (ϕ|ψ)= M,s M,s PrS (ϕ∧ψ)/PrS (ψ). Emax( F|ψ) is the supremum of ES ( F|ψ) where M,s M,s M,s M,s PrS (ψ)>0andPrS (♦F|ψ)=1,andPrmax(ϕ|ψ)=sup PrS (ϕ|ψ)where M,s M,s M,s S M,s S ranges over all schedulers with PrS (ψ)>0 and sup∅=−∞. M,s For the remainder of this paper, we suppose that two nonempty subsets F and G of S are given such that Prmax(♦F|♦G)=1. The task addressed in this M,s paper is to compute the maximal conditional expectation given by: CEmax d=ef sup CES ∈R∪{∞} where CES = ES ( F|♦G) M,s M,s M,s M,s S Here, S ranges over all schedulers S with PrS (♦G)>0 and PrS (♦F|♦G)= M,s M,s 1. If M and its initial state are clear from the context, we often simply write CEmax resp. CES. We assume that all states in M are reachable from s and init s ∈/ F ∪G (as CEmax =0 if s∈F and CEmax =Emax ( F) if s∈G\F). init M,sinit 3 Finiteness and upper bound Checking finiteness. We sketch a polynomially time-bounded algorithm that takes as input an MDP M=(S,Act,P,s ,rew) with two distinguished subsets init F andGofS suchthatPrmax (♦F|♦G)=1.IfCEmax =Emax ( F|♦G)=∞ M,sinit M,sinit thentheoutputis“no”.Otherwise,theoutputisanMDPMˆ =(Sˆ,Aˆct,Pˆ,sˆ ,reˆw) init with two trap states goal and fail such that: (1) Emax ( F|♦G) = Emax ( goal|♦goal), (2) sˆM|=,s∃in♦itgoal and Prmin(cid:0)♦M(ˆg,osˆianilt∨fail)(cid:1)=1 for all states sˆ∈Sˆ\{fail}, and Mˆ,sˆ (3) Mˆ does not have critical schedulers where a scheduler U for Mˆ is said to be critical iff PrU (♦fail)=1 and there is a reachable positive U-cycle.2 Mˆ,sˆinit 2 The latter means a U-path π =s0α0s1α1...αk−1sk where s0 =sˆinit and si =sk for some i∈{0,1,...,k−1} such that reˆw(s ,α )>0 for some j ∈{i,...,k−1}. j j Maximizing the Conditional Expected Reward for Reaching the Goal 7 We provide here the main ideas of the algorithms and refer to Appendix C for the details. The algorithm first transforms M into an MDP M˜ that permits to assume F = G = {goal}. Intuitively, M˜ simulates M, while operating in four modes: “normal mode”, “after G”, “after F” and “goal”. M˜ starts in normal mode where it behaves as M as long as neither F nor G have been visited. If a G\F-state has been reached in normal mode then M˜ switches to the mode “after G”. Likewise, as soon as an F \G-state has been reached in normal mode then M˜ switches to the mode “after F”. M˜ enters the goal mode (consisting of a single trap state goal) as soon as a path fragment containing a state in F and a state in G has been generated. This is the case if M visits an F-state in mode“afterG”oraG-stateinmode“afterF”,orastateinF∩Ginthenormal mode. The rewards in the normal mode and in mode “after G” are precisely as in M, while the rewards are 0 in all other cases. We then remove all states s˜ in the “after G” mode with Prmax(♦goal)<1, collapse all states s˜in M˜ with M˜,s˜ s˜(cid:54)|=∃♦goal into a single trap state called fail and add zero-reward transitions to fail from all states s˜that are not in the “after G” mode and Prmax(♦goal)=0. M˜,s˜ Using techniques as in the unconditional case [23] we can check whether M˜ has positive end components, i.e., end components with at least one state-action pair (s,α) with rew(s,α)>0. If so, then Emax ( F|♦G)=∞. Otherwise, we M,sinit collapse each maximal end component of M˜ into a single state. Let Mˆ denote the resulting MDP. It satisfies (1) and (2). Property (3) holds iff Emax ( goal|♦goal)<∞. This condition can be checked in polynomial time Mˆ,sˆinit using a graph analysis in the sub-MDP of Mˆ consisting of the states sˆ with Prmin(♦goal)=0 (see Prop. C.8 and Appendix C.3). Mˆ,sˆ Computing an upper bound. Due to the transformation used for checking finiteness of the maximal conditional expectation, we can now suppose that M=Mˆ,F =G={goal}andthat(2)and(3)hold.Wenowpresentatechnique to compute an upper bound CEub for CEmax. The upper bound will be used later to determine a saturation point from which on optimal schedulers behave memoryless (see Section 4). We consider the MDP M(cid:48) simulating M, while operating in two modes. In its first mode, M(cid:48) attaches the reward accumulated so far to the states. More precisely, the states of M(cid:48) in its first mode have the form (cid:104)s,r(cid:105)∈S×N where 0 (cid:54) r (cid:54) R and R = (cid:80)s∈S(cid:48)max{rewM(cid:48)(s,α) : α ∈ ActM(cid:48)(s)}. The initial state of M(cid:48) is s(cid:48) = (cid:104)s ,0(cid:105). The reward for the state-action pairs init init ((cid:104)s,r(cid:105),α)wherer+rew(s,α)(cid:54)Ris0.IfM(cid:48) firesanactionαinstate(cid:104)s,r(cid:105)where r(cid:48) d=efr+rew(s,α)>R then it switches to the second mode, while earning reward r(cid:48). In its second mode M(cid:48) behaves as M without additional annotations of the states and earning the same rewards as M. From the states (cid:104)goal,r(cid:105), M(cid:48) moves to goal with probability 1 and reward r. There is a one-to-one correspondence between the schedulers for M and M(cid:48) and the switch from M to M(cid:48) does not affect the probabilities and the accumulated rewards until reaching goal. Let N denote the MDP resulting from M(cid:48) by adding reset-transitions from fail (as a state of the second mode) and the copies (cid:104)fail,r(cid:105) in the first mode 8 Christel Baier, Joachim Klein, Sascha Klu¨ppelholz, Sascha Wunderlich to the initial state s(cid:48) . The reward of all reset transitions is 0. The reset- init mechanismhasbeentakenfrom[11]whereithasbeenintroducedasatechniqueto compute maximal conditional probabilities for reachability properties. Intuitively, N “discards” all paths of M(cid:48) that eventually enter fail and “redistributes” their probabilities to the paths that eventually enter the goal state. In this way, N mimics the conditional probability measures PrS ( · |♦goal) = M(cid:48),s(cid:48) init PrS ( · |♦goal) for prefix-independent path properties. Paths π from s to M,sinit init goal in M are simulated in N by paths of the form (cid:37)=ξ ;...ξ ;π where ξ is a 1 k i cycle in N with first(ξ )=s(cid:48) and ξ ’s last transition is a reset-transition from i init i somefail-statetos(cid:48) .Thus,rew(π)(cid:54)rew ((cid:37)).Thedistinctionbetweenthefirst init N andsecondmodetogetherwithproperty(3)ensurethatthenewreset-transitions donotgeneratepositiveendcomponentsinN.Bytheresultsof[23],themaximal unconditional expected accumulated reward in N is finite and we have: Emax ( goal|♦goal) = Emax ( goal|♦goal) (cid:54) Emax ( goal) M,sinit M(cid:48),s(cid:48)init N,s(cid:48)init Hence, we can deal with CEub = Emax ( goal), which is computable in time N,s(cid:48) init polynomial in the size of N by the algorithm proposed in [23]. As size(N) = Θ(R·size(M)) we obtain a pseudo-polynomial time bound for the general case. If, however, Prmin(♦goal)>0 for all states s∈S\{fail} then there is no need M,s for the detour via M(cid:48) and we can apply the reset-transformation M (cid:32) N by adding a reset-transition from fail to s with reward 0, in which case the upper init bound CEub =Emax ( goal) is obtained in time polynomial in the size of M. N,sinit For details we refer to the proof of Prop. C.8 and Section C.4 in the appendix. 4 Threshold algorithm and computing optimal schedulers In what follows, we suppose that M=(S,Act,P,s ,rew) is an MDP with two init trap states goal and fail such that s |= ∃♦goal for all states s ∈ S\{fail} and min Prmin(♦(goal ∨fail))=1 and CEmax =Emax ( goal|♦goal)<∞. As∈sSchedMu,lserSissaidtobereward-based ifS(πM)=,siSnit(π(cid:48))forallfinitepathsπ, π(cid:48) with (last(π),rew(π))=(last(π(cid:48)),rew(π(cid:48))). Thus, deterministic reward-based schedulers can be seen as functions S:S×N→Act. Prop. D.1 in the appendix shows that CEmax equals the supremum of the values CES, when ranging over all deterministic reward-based schedulers S with PrS (♦goal)>0. M,sinit The basis of our algorithms are the following two observations. First, there exists a saturation point ℘ ∈ N such that the optimal decision for all paths π with rew(π)(cid:62)℘ is to maximize the probability for reaching the goal state (see Prop. 4.1 below). The second observation is a technical statement that will be used at several places. Let ρ,θ,ζ,r,x,y,z,p ∈ R with 0 (cid:54) p,x,y,z (cid:54) 1, p > 0, y >z and x+z >0 and let ρ+p(ry+θ) ρ+p(rz+ζ) A = , B = and C=max{A,B} x+py x+pz Then: θ−ζ A(cid:62)B iff r+ (cid:62) C iff θ−(C−r)y (cid:62) ζ−(C−r)z (†) y−z Maximizing the Conditional Expected Reward for Reaching the Goal 9 andtheanalogousstatementfor>ratherthan(cid:62).Thisstatementisaconsequence ofLemmaG.1intheappendix.Wewillapplythisobservationindifferentnuances. To give an idea how to apply statement (†), suppose A = CET and B = CEU where T and U are reward-based schedulers that agree for all paths (cid:37) that do not have a prefix π with rew(π)=r where last(π) is a non-trap state, in which case x denotes the probability for reaching goal from s along such a path init (cid:37) and ρ stands for the corresponding partial expectation, while p denotes the probability of the paths π from s to some non-trap state with rew(π)=r. The init crucial observation is that r+(θ−ζ)/(y−z) does not depend on x,ρ,p. Thus, if r+(θ−ζ)/(y−z)(cid:62)CEub for some upper bound CEub of CEmax then (†) allows to conclude that T’s decisions for the state-reward pairs (s,r) are better than U, independent of x,ρ and p. Let R∈N and S, T be reward-based schedulers. The residual scheduler S↑R is given by (S↑R)(s,r)=S(s,R+r). S(cid:67) T denotes the unique scheduler that R agrees with S for all state-reward pairs (s,r) where r <R and (S(cid:67) T)↑R=T. R We write ES for the partial expectation M,s ∞ ES = (cid:80) PrS (♦=rgoal)·r M,s M,s r=0 Thus,ET =ET ( goal)ifPrT (♦goal)=1,whileET <∞=ET ( goal) M,s M,s M,s M,s M,s if PrT (♦goal)<1. M,s Proposition 4.1. There exists a natural number ℘ (called saturation point of M) and a deterministic memoryless scheduler M such that: (a) CET (cid:54) CET(cid:67)℘M for each scheduler T with PrT (♦goal)>0, and M,sinit (b) CES = CEmax for some deterministic reward-based scheduler S such that PrS (♦goal)>0 and S↑℘=M. M,sinit The proof of Prop. 4.1 (see Appendices E and F) is constructive and yields a polynomial-time algorithm for generating a scheduler M as in Prop. 4.1 and a pseudo-polynomial algorithm for the computation of a saturation point ℘. SchedulerMmaximizestheprobabilitytoreachgoal fromeachstate.Ifthere are two or more such schedulers, then M is one where the conditional expected accumulated reward until reaching goal is maximal under all schedulers U with PrU (♦goal)=Prmax(♦goal) for all states s. Such a scheduler M is computable M,s M,s in polynomial time using linear programming techniques. (See Lemma E.14 in the appendix.) The idea for the computation of the saturation point is to compute the threshold ℘ above which the scheduler M becomes optimal. For this we rely on statement (†) where θ/y stands for the conditional expectation under M, ζ/z for the conditional expectation under an arbitrary scheduler S and C=CEub is an upper bound of CEmax (see Theorem 1), while r =℘ is the wanted value. More precisely, for s∈S, let θ =EM , y =PrM (♦goal)=Prmax(♦goal). To s M,s s M,s M,s compute a saturation point we determine the smallest value ℘∈N such that θ −(CEub−℘)·y = max (cid:0) ES −(CEub−℘)·PrS (♦goal) (cid:1) s s M,s M,s S 10 Christel Baier, Joachim Klein, Sascha Klu¨ppelholz, Sascha Wunderlich for all states s where S ranges over all schedulers for M. In Appendix F we show that instead of the maximum over all schedulers S it suffices to take the local maximum over all “one-step-variants” of M. That is, a saturation point is obtained by ℘=max{(cid:100)CEub−D(cid:101),0} where (cid:8) (cid:9) D = min (θ −θ )/(y −y ) : s∈S,α∈Act(s),y <y s s,α s s,α s,α s (cid:80) (cid:80) and y = P(s,α,t)·y and θ =rew(s,α)·y + P(s,α,t)·θ . s,α t s,α s,α t t∈S t∈S Example 4.2. The so obtained saturation point for the MDP M[r] in Figure 1 is ℘=(cid:100)CEub+1(cid:101). Note that only state s=s behaves nondeterministically, and 2 M(s) = α, y = y = 1, θ = θ = 0, while y = θ = 1. This yields s s,α s s,α s,β s,β 2 D =(0−1)/(1−1)=−1. Thus, ℘(cid:62)r+2 as CEub (cid:62)CEmax >r. (cid:4) 2 2 The logarithmic length of ℘ is polynomial in the size of M. Thus, the value (i.e., the length of an unary encoding) of ℘ can be exponential in size(M). This isunavoidableas therearefamilies(Mk)k∈N ofMDPswherethe sizeofMk isin O(k),while2k isalowerboundforthesmallestsaturationpointofM .This,for k instance, applies to the MDPs M =M[2k] where M[r] is as in Figure 1. Recall k from Example 1.1 that the scheduler S that selects β by the first r+2 visits r+2 of s and α for the (r+3)-rd visit of s is optimal for M[r]. Hence, the smallest saturation point for M[2k] is 2k+2. Threshold algorithm. The input of the threshold algorithm is an MDP M as above and a non-negative rational number ϑ. The task is to generate a deterministic reward-based scheduler S with S↑℘=M (where M and ℘ are as in Prop. 4.1) such that CES >ϑ if CEmax >ϑ, and CES =ϑ if CEmax =ϑ. If CEmax <ϑ then the output of the threshold algorithm is “no”.3 The algorithm operates level-wise and determines feasible actions action(s,r) for all non-trap states s and r =℘−1,℘−2,...,0, using the decisions action(·,i) for the levels i ∈ {r+1,...,℘} that have been treated before and linear pro- gramming techniques to treat zero-reward loops. In this context, feasibility is understood with respect to the following condition: If CEmax (cid:68) ϑ where (cid:68) ∈ {>,(cid:62)} then there exists a reward-based scheduler S with CES (cid:68)ϑ and S(s,R)=action(s,min{℘,R}) for all R(cid:62)r. The algorithm stores for each state-reward pair (s,r) the probabilities y to s,r reachgoal fromsandthecorrespondingpartialexpectationθ forthescheduler s,r given by the decisions in the action table. The values for r = ℘ are given by action(s,℘)=M(s), y =PrM (♦goal) and θ =EM . The candidates for s,℘ M,s s,℘ M,s the decisions at level r <℘ are given by the deterministic memoryless schedulers P for M. We write P for the reward-based scheduler given by P (s,0)=P(s) + + and P (s,i)=action(s,min{℘,r+i}) for i(cid:62)1. Let y =PrP+ (♦goal) and + s,r,P M,s θ =EP+ be the corresponding partial expectation. s,r,P M,s 3 The threshold algorithm solves all four variants of the threshold problem. E.g., CEmax (cid:54)ϑ iff CES =ϑ, while CEmax <ϑ iff the threshold algorithm returns “no”.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.