1 Channels with cost constraints: strong converse and dispersion Victoria Kostina Sergio Verdu´ California Institute of Technology Princeton University Pasadena, CA 91125 Princeton, NJ 08544 [email protected] [email protected] 5 Abstract—This paper shows the strong converse and the finite memory [9]. In a more general setting not requiring 1 dispersion of memoryless channels with cost constraints and the assumption of stationarity or finite memory, Verdu´ and 0 performs refined analysis of the third order term in the asymp- Han [10] showed a necessary and sufficient condition for a 2 toticexpansion of themaximum achievable channel codingrate, showing that it is equal to 1logn in most cases of interest. The channelwithoutcostconstraintstosatisfythestrongconverse, ct analysis is based on a non-a2symnptotic converse boundexpressed while Han [11, Theorem 3.7.1] generalized that condition to O in terms of the distribution of a random variable termed the the setting with cost constraints. In the special case of finite- b-tiltedinformationdensity,whichplaysarolesimilartothatof input channels, that necessary and sufficient condition boils 8 the d-tilted information in lossy source coding. We also analyze down to the capacity being equal to the limit of maximal ] tchheanfunneldsawmiethntcaolsltimcoitnssotrfalionstssy. joint-source-channel coding over normalized mutual informations. In turn, that condition is T implied by the information stability of the channel [12], a I Index Terms—Converse, finite blocklength regime, channels condition which in general is not easy to verify. Using a s. with cost constraints, joint source-channel coding, strong con- novel notion of strong information stability, a general strong c verse, dispersion, memoryless sources, memoryless channels, [ Shannon theory. converse result was recently shown in [13, Theorem 3]. The strong converse for DMC with separable cost was shown by 3 Csisza´r and Ko¨rner [14, Theorem 6.11] and by Han [11, v I. INTRODUCTION 4 Theorem 3.7.2]. Regarding continuous channels, in the most 2 This paper is concernedwith the maximumchannelcoding basic case of the memoryless additive white Gaussian noise 1 rate achievable at average error probability ǫ > 0 where (AWGN) channel with the cost function being the power of 5 the cost of each codeword is constrained. The capacity-cost the channel input block, b (xn) = 1 xn 2, the strong con- 1. functionC(β) ofachannelspecifiesthemaximumachievable verse was shown by Shannnon [15] (cno|ntem| poraneously with 0 channelcodingrate compatiblewith vanishingerrorprobabil- Wolfowitz’s finite-alphabet strong converse). Yoshihara [16] 4 ity and with codeword cost not exceeding β in the limit of proved the strong converse for the time-continuous channel 1 large blocklengths. with additive Gaussian noise having an arbitrary spectrum : v A channelis said to satisfy the strong converse if ǫ 1 as and also gave a simple proof of Shannon’s strong converse Xi n for any code operating at a rate above the ca→pacity. result. Under the requirementthat the power of each message → ∞ For memoryless channels without cost constraints, the strong converges stochastically to a given constant β, the strong r a conversewas first shown by Wolfowitz: [1] treats the discrete conversefortheAWGN channelwith feedbackwasshownby memoryless channel (DMC), while [2] generalizes the result Wolfowitz [17]. Note that in all those analyses of the power- to memoryless channels whose input alphabet is finite while constrained AWGN channel the cost constraint is meant on the output alphabet is the real line. Arimoto [3] showed a a per-codeword basis. In fact, the strong converse ceases to new converse bound stated in terms of Gallager’s random hold if the cost constraint is averaged over the codebook [18, coding exponent, which also leads to the strong converse Section 4.3.3]. for the DMC. Dueck and Ko¨rner [4] found the reliability Channel dispersion quantifies the backoff from capacity, function of DMC for rates above capacity, a result which implies a strong converse. Kemperman [5] showed that the unescapable at finite blocklengths due to the random nature strong converse holds for a DMC with feedback. A simple ofthechannelcomingintoplay,as opposedto the asymptotic proof of strong converse for memoryless channels that does representation of the channel as a deterministic bit pipe not invoke measure concentration inequalities was recently of a given capacity. More specifically, for coding over the given in [6]. For a class of discrete channels with finite DMC, the maximum achievable code rate at blocklength memory, the strong converse was shown by Wolfowitz [7] n compatible with error probability ǫ is approximated by and independently by Feinstein [8], a result soon generalized C VQ 1(ǫ) [19], [20] where C is the channel capacity, − n − to a more general class of stationary discrete channels with V isqthe channel dispersion, and Q−1() is the inverse of · theGaussian complementarycdf.Polyanskiyetal. [20] found ThisworkwassupportedinpartbytheNationalScienceFoundation(NSF) the dispersion of the DMC withoutcost constraintsas well as under Grant CCF-1016625 and by the Center for Science of Information (CSoI),anNSFScience andTechnology Center,underGrantCCF-0939370. thatoftheAWGNchannelwithapowerconstraint.Inparallel, 2 Hayashi[21,Theorem3]gavethedispersionoftheDMCwith n, n, P and b , respectively. Denote YnXn n andwithoutcostconstraints(withthelooseestimateofo(√n) A B | forthethirdorderterm).For constantcompositioncodesover C(β)= sup I(X;Y), (2) theDMC,Polyanskiy[18,Sec.3.4.6]showedthedispersionof E[bP(XX):] β ≤ constantcompositioncodesovertheDMC,whileMoulin[22] λ⋆ =C(β). (3) ′ refined the third-orderterm in the expansionof the maximum achievable code rate, under regularity conditions. Wang et Since C(β) is non-decreasing concave function of β [14, al. [23] gave a second-order analysis of joint source-channel Theorem6.11],λ⋆ 0.ForrandomvariablesY andY¯ defined ≥ coding over finite alphabets based on constant composition on the same space, denote codebooks. dP meInntathlilsimpaitpefor,rwcoeddienmgoonvsetrractheatnhnaetltshewnitohncaosystmcpotnostitcrafiunntsdais- ıYkY¯(y)=logdPYY¯(y). (4) closelyapproximatedintermsofthecdfofa randomvariable If Y is distributed accordingto P , we abbreviate the Y X=x we refer to as the b-tilted information density, which paral- notation as | lels the notion of d-tilted information for lossy compression dP [g2e4n]e.raWlechsahnonwelsa wsiimthplienpnuotn-caossytmcpotnostitcraicnotnsvienrseterbmousnodffbor- ıX;Y¯(x;y)=log dYP|XY¯=x(y). (5) tilted information density. Not only does this bound lead to a in lieu of ıY X=x Y¯(y). The information density ıX;Y(x;y) general strong converse result, but it is also tight enough to betweenreali|zationksoftworandomvariableswithjointdistri- find the channel dispersion-cost function and the third order butionP P followsbyparticularizing(5)to P ,P , X Y X Y X Y term equal to 21logn when coupled with the corresponding wherePX |PY X PY1. Ingeneral,however,{the|functio}n aDcMhiCev,alboigliMty⋆b(onu,nǫd,.βM),othreelsopgeacriifithcmalloy,fwtheemshaoxwimtuhmatafcohriethve- dinist(r5ib)utdioone→s. not|req→uire PY¯ to be induced by any input able code size at blocklength n, error probability ǫ and cost Further, define the function β, is given by, under mild regularity assumptions 1 X;Y¯(x;y,β)=ıX;Y¯(x;y)−λ⋆(b(x)−β). (6) logM⋆(n,ǫ,β)=nC(β) nV(β)Q 1(ǫ)+ logn+O(1) − − 2 (1) The special case of (6) with PY¯ =PY⋆, where PY⋆ is the p unique output distribution that achieves the supremum in (2) where V(β) is the dispersion-cost function, thereby refining [25], defines b-tilted information density: Hayashi’s result [21] and providing a matching converse to the result of Moulin [22]. We observe that the capacity- Definition 1 (b-tilted informationdensity). The b-tilted infor- cost and the dispersion-cost functions are given by the mean mation density between x and y is (x;y,β). X;Y⋆ ∈ X ∈ Y and the variance of the b-tilted information density. This novel interpretation juxtaposes nicely with the corresponding SinceP isuniqueevenifthereareseveral(ornone)input results in [24] (d-tilted information in rate-distortion theory). Y⋆ distributions P that achieve the supremum in (2), there is Furthermore, we generalize (1) to lossy joint source-channel X⋆ no ambiguity in Definition 1. If there are no cost constraints codingofgeneralmemorylesssourcesoverchannelswithcost. (i.e. b(x)=0 x ), then C(β)=0 regardless of β, and Section II introduces the b-tilted information density. Sec- ∀ ∈X ′ tion III states the new non-asymptotic converse bound which X;Y¯(x;y,β)=ıX;Y¯(x;y). (7) holds for a general channel with cost constraints, without making any assumptions on the channel (e.g. alphabets, sta- The counterpart of the b-tilted information density in rate- tionarity, memorylessness). An asymptotic analysis of the distortion theory is the d-tilted information [24]. converse and achievability bounds, including the proof of the Example 1. For n uses of a memoryless AWGN channel strongconverseandtheexpressionforthechanneldispersion- with unit noise power and maximalpower not exceeding nP, cost function, is presented in Section IV in the context of C(P)= nlog(1+P),andtheoutputdistributionthatachieves memoryless channels. Section V generalizes the results in (2) is Yn2⋆ (0,(1+P)I). Therefore Sections III and IV to the lossy joint source-channel coding ∼N setup. (xn;yn,P)= nlog(1+P) loge yn xn 2 Xn;Yn⋆ 2 − 2 | − | loge + yn 2 xn 2+nP , II. b-TILTEDINFORMATION DENSITY 2(1+P) | | −| | (cid:16) (cid:17) (8) In this section, we introduce the concept of b-tilted infor- mation density and several relevant properties in a general where the Euclidean norm is denoted by xn 2 = n x2. single-shot approach. It is easy to check that under P ,|the|distribuit=io1n oif YnXn=xn Fix the transitionprobabilitykernelP : and the | P Y X costfunctionb: X 7→[0,∞].Intheapplic|atioXno→fthYissingle- 1We write PX → PY|X → PY to indicate that PY is the marginal of shot approach in Section IV, X, Y, PY|X and b will become PXPY|X,i.e.PY(y)=RXdPY|X(y|x)dPX(x). 3 (xn;Yn,P) is the same as that of (by ‘ ’ we mean that Xn;Yn⋆ ∼ equality in distribution) D(P P ) I⋆+o(I⋆) xn n: b (xn) β, (xn;Yn,P) Yn|Xn=xnk Yn ≤ n n ∀ ∈A n ≤(19) Xn;Yn⋆ where is the single-letter channel input alphabet, and n P loge xn 2 A ∼ 2 log(1+P)− 2(1+P)"W|nxPn2|2 −n− |P2| #, (9) lIain⋆mquna→=si∞-can1moIdan⋆.x,P(X1n8):bi(mXpnl)i≤esbat.sh.aIt(XPYn⋆;Y×n.)... ×SinPcYe⋆ Cis(βal)way=s where Wℓ denotes a non central chi-square distributed ran- λ dom variable with ℓ degrees of freedom and non-centrality Corollary 2. For all P P X X⋆ parameterλ. The mean of (9) is nlog(1+P), in accordance ≪ with (16), while its variance is212(nP(21++2P|x)2n|2)log2e which Var[X;Y⋆(X;Y,β)]==EE[[VVaarr[[ıX;Y⋆((XX;;YY,)βX)|]X].]] ((2210)) becomes nV(P) (found in [20] and displayed in (46)) af- X;Y⋆ | ter averaging with respect to Xn⋆ distributed according to P (0,PI). Xn⋆ ∼N Proof: Appendix B. Denote 2 β = inf b(x), (10) III. NONASYMPTOTICBOUNDS min x ∈X Converseandachievabilityboundsgivenecessaryandsuffi- β =sup β 0: C(β)<C( ) . (11) max { ≥ ∞ } cientconditions,respectively,on (M,ǫ,β) in orderfora code Theorem 1 below highlights the importance of b-tilted to exist with M codewords and average error probability not information density in the optimization problem (2). Of key exceeding ǫ and cost not exceeding β. Such codes (allowing significanceintheasymptoticanalysisinSectionIV,Theorem stochastic encodersanddecoders)are rigorouslydefinednext. 1givesanontrivialgeneralizationofthewell-knownproperties Definition 2 ((M,ǫ,β) code). An (M,ǫ,β) code for of information density to the setting with cost constraints. P ,b is a pair of random transformations P (en- Y X XS aTchheioevrienmg (12.) Fisixsuβcmhinth<at tβhe<conβsmtraaxi.ntAisssuamcehiethveadt PwXith⋆ {cSodeX|r) aYn}dZP,Zt|hYep(droebcoadbeilri)tysiuscehvatlhuaattePd[wSit6=hSZ]eq≤uipǫ,r|owbhaebrlee − − − equality: on an alphabet of cardinality M, and the codewords satisfy E[b(X⋆)]=β. (12) the maximal cost constraint (a.s.) b(X) β. (22) Then, the following equalities hold. ≤ C(β)=supE[ (X;Y,β)] (13) X;Y PX The non-asymptotic quantity of principal interest is =supE[X;Y⋆(X;Y,β)] (14) M⋆(ǫ,β), the maximum code size achievable at error proba- PX bility ǫ and cost β. =E[ (X⋆;Y⋆,β)] (15) X;Y⋆ =E[ (X⋆;Y⋆,β)X⋆], (16) Theorem 3 (Converse). The existence of an (M,ǫ,β) code X;Y⋆ | for P ,b requires that Y X where (16) holds P -a.s., and P P P , P { | } X⋆ X Y X Y X⋆ PY|XP→rooPf:YA⋆.ppendix A. → | → → ǫ≥mγ>a0x(cid:26) suY¯px:bi(nxf)≤βP(cid:2)ıX;Y¯(x;Y)≤logM −γ|X =x(cid:3) Throughout the paper, we assume that the assumptions of exp( γ) (23) − − Theorem 1 hold. (cid:27) For channels without cost, the inequality ≥mγ>a0x(cid:26) suY¯pxi∈nXf P X;Y¯(x;Y,β)≤logM −γ|X =x D(PY X=x PY⋆) C x (17) (cid:2) (cid:3) | k ≤ ∀ ∈X exp( γ) . (24) iskeytoprovingstrongconverses.Theorem1generalizesthis − − (cid:27) result to channels with cost, showing that E[ (x;Y,β)X =x] C(β) x . (18) Proof: The bound in (23) is due to Wolfowitz [26]. The X;Y⋆ | ≤ ∀ ∈X bound in (24) simply weakens (23) using b(x) β. ≤ Note that (18) is crucialfor showing both the strong converse By restricting the channel input space appropriately, con- and the refined asymptotic analysis. verse bounds for channels with cost constraints can be ob- Remark1. Thegeneralstrongconverseresultin[13,Theorem tained from the converse bounds in [20], [27]. Their analysis 3] includes channels with cost using the concept of ‘quasi- becomes tractable by the introduction of b-tilted information caod’, which is defined as any output distribution P such density in (24) and an application of (18). Yn Achievabilityboundsforchannelswith cost constraintscan 2Weallow βmax=+∞. be obtained from the random coding bounds in [20], [27] 4 by restricting the distribution from which the codewords are that achieves C(β), to obtain drawntosatisfyb(X) β a.s.Inparticular,fortheDMC,we ≤ n maychoosePXn tobeequiprobableonthesetofcodewordsof ǫ inf P (x ;Y ,β) nC(β)+nα X;Y⋆ i i the type closest (among types satisfying the cost constraint) ≥xn n " ≤ # ∈A i=1 to the input distribution P that achieves the capacity-cost X X⋆ exp( nα) (27) function. As shown in [21], such constant composition codes − − n n achieve the dispersion of channel coding under input cost inf P (x ;Y ,β) c(x )+nα constraints. Unfortunately, the computation of such bounds ≥xn n " X;Y⋆ i i ≤ i # ∈A i=1 i=1 may become challenging in high dimension, particularly with X X exp( nα), (28) continuous alphabets. − − where for notational convenience we have abbreviated IV. ASYMPTOTICANALYSIS c(x)=E[X;Y⋆(x;Y,β)X=x], (29) | To introduce the blocklength into the non-asymptotic and (28) employs (14). converse of Section III, we consider (M,ǫ,β) codes for To show that the right side of (28) converges to 1, we P ,b , where P : n n and b : n invokethe following law of largenumbersfor non-identically YnXn n YnXn n [{0, ]|. We ca}ll such codes|(n,MA,ǫ,β7→) coBdes, and denAote t7→he distributed random variables. ∞ corresponding non-asymptotically achievable maximum code Lemma 1 (e.g. [28]). Suppose that W are uncorrelated and size by M⋆(n,ǫ,β). i ∞i=1Var Wcii <∞forsome strictly positivesequence(cn) increasinghto +i . Then, P ∞ A. Assumptions n n 1 W E W 0 in L2. (30) The following basic assumptions hold throughout Section cn i− " i#!→ i=1 i=1 IV. X X (i) The channel is stationary and memoryless, P = YnXn Let W = (x ;Y ,β) and c =i. Since (recall (iv)) P ... P . | i X;Y⋆ i i i YX YX (ii) wThhe|erce×obst:fu×nctio[n0|,is s]e.parable, bn(xn)= n1 ni=1b(xi), ∞ Var 1iX;Y⋆(xi;Yi,β)|Xi =xi ≤Vmax ∞ i12 (31) (iii) EachcodeAwo7→rdisc∞onstrainedtosatisfythemPaximalcost Xi=1 (cid:20) (cid:21) Xi=1 constraint, b (xn) β. < , (32) n ∞ ≤ (iv) sup Var[ (x;Y,β)X=x]=V < . x∈A X;Y⋆ | max ∞ by virtue of Lemma 1, the right side of (28) converges to Under these assumptions, the capacity-cost function is given 1, so any channel satisfying (i)–(iv) also satisfies the strong by converse. As noted in [18, Theorem 77] in the contextof the AWGN C(β)= sup I(X;Y). (25) PX:E[b(X)]≤β channel,thestrongconversedoesnotholdifthecostconstraint is averaged over the codebook,i.e. if, in lieu of (22), the cost Observe that in view of assumptions (i) and (ii), as long as requirement is PY¯n is a product distribution, PY¯n =PY¯ ×...×PY¯, M 1 n E[b(X)S =m] β. (33) Xn;Y¯n(xn;yn,β)= X;Y¯(xi;yi,β). (26) M mX=1 | ≤ Xi=1 To see why the strong converse does not hold in general, fix a code of rate C(β) < R < C(2β) none of whose codewords cost more than 2β and whose error probability B. Strong converse satisfies ǫ 0. Since R < C(2β), such a code exists. n → Althoughthetoolsdevelopedin SectionsIIandIIIare able Now,replacehalfofthecodewordswiththeall-zerocodeword to resultin a strong conversefor channelsthatexhibitergodic (assumingb(0)=0)whileleavingthedecisionregionsofthe behavior(seealsoRemark1),forthesakeofconcretenessand remainingcodewordsuntouched.The averagecostof the new length,weonlydealherewiththememorylesssetupdescribed code satisfies (33), its rate is greater than the capacity-cost in Section IV-A. function,R>C(β), yetitsaverageerrorprobabilitydoesnot We show that if transmission occurs at a rate greater than exceed ǫ + 1 1. n 2 → 2 thecapacity-costfunction,theerrorprobabilitymustconverge to 1, regardlessof the specificsof the code.Towardsthis end, C. Dispersion wefixsomeα>0,wechooselogM nC(β)+2nα,andwe ≥ weaken the bound (24) in Theorem 3 by fixing γ = nα and First, we give the operational definition of the dispersion- PY¯n = PY⋆ ×...×PY⋆, where Y⋆ is the output distribution cost function of any channel. 5 Definition 3 (Dispersion-cost function). The channel Remark 3. As we show in Appendix F, Theorem 4 applies to dispersion-cost function, measured in squared information channels with abstract alphabets provided that in addition to units per channel use, is defined by (i)–(ii), they meet the following criteria: 1(nC(β) logM⋆(n,ǫ,β))2 (a) The cost function b: A→[0,∞] is such that for all γ ∈ V(β)= limlimsup − . (34) [β, ), b 1(γ) is nonempty. In particular, this condition ǫ→0 n→∞ n 2loge 1ǫ is s∞atisfie−d if the channel input alphabet is a metric A An explicit expression for the dispersion-cost function of a space, and b is continuousandunboundedwith b(0)=0. discrete memoryless channel is given in the next result. (b) The distribution of ı (xn;Yn), where P = Theorem 4. In addition to assumptions (i)–(iv), assume that Xn;Yn⋆ Yn⋆ P ... P doesnotdependonthechoiceofxn , the capacity-achieving input distribution PX⋆ is unique and wYh⋆e×re ×=Y⋆xn n: b (xn)=β . ∈Fn that the channel has finite input and output alphabets. Fn { ∈A n } (c) Forallxintheprojectionof onto ,i.e.forallxsuch n F A logM⋆(n,ǫ,β)=nC(β)− nV(β)Q−1(ǫ)+θ(n), (35) that (x,x2,...,xn)∈Fn for some x2,...,xn, C(β)=E[X;Y⋆(Xp⋆;Y⋆,β)], (36) E (X;Y,β) C(β)3 X=x < . (42) X;Y⋆ V(β)=Var[ (X⋆;Y⋆,β)], (37) | − | | ∞ X;Y⋆ h i where the remainder term θ(n) satisfies: (d)3ThereexistsadistributionP supportedon suchthat Xn n F a) If V(β)>0, ıYn Yn⋆(Yn), where PXn PYnXn PYn, is almost surekly bounded by f =o(→√n) fro|m a→bove. 1 n (supp(PX⋆) 1)logn+O(1) θ(n) (38) Then, (35) holds identifying (43)–(45) or all x s.t. −2 | |− ≤ ∈ A b(x)=β: 1 logn+O(1). ≤ 2 C(β)=D(P P ), (43) YX=x Y⋆ (39) | k V(β)=Var[ı (x;Y)X=x], (44) X;Y⋆ b) If V(β)=0, (38) holds, and (39) is replaced by 1 | f +O(1) θ(n) logn+O(1), (45) n 1 − ≤ ≤ 2 θ(n) O n3 . (40) ≤ where f =o(√n) is specified in (d). n (cid:16) (cid:17) Remark 4. Theorem 4 with the remainder in (41) [31] also Proof: holds for the AWGN channel with maximal signal-to-noise Converse. Full details are given in Appendix D. The main ratio P, offering a novel interpretation of the dispersion of steps of the refined asymptotic analysis of the bound in the Gaussian channel [20] Theorem 3 are as follows. First, building on the ideas of [29], [30], we weaken the bound in (24) by a careful choice 1 1 of a non-product auxiliary distribution PY¯n. Second, using V(P)= 2 1− (1+P)2!log2e (46) Theorem 1 and the technical tools developed in Appendix C, we show that the infimum in the right side of (24) is lower as the variance of the b-tilted information density. We note bounded by ǫ for the choice of M in (35). that the AWGN channel satisfies the conditions of Remark 3 Achievability. Full details are given in Appendix E, which with PXn uniform on the power sphere and fn =O(1) [20]. provides an asymptotic analysis of the Dependence Testing Remark 5. As we show in Appendix G, a stationary memo- bound of [20] in which the random codewords are of type ryless channelwith b(x)=x which takes a nonnegativeinput closesttoPX⋆,ratherthandrawnfromtheproductdistribution andaddsanexponentialnoiseofunitmeantoit[32],satisfies PX×...×PX, as in achievability proofs for channel coding the conditions of Remark 3 with fn =O(1), and without cost constraints. We use Corollary 2 to establish that β such constant composition codes achieve the dispersion-cost (x;y,β)=log(1+β)+ (x y+1)loge, (47) X;Y⋆ 1+β − function. C(β)=log(1+β), (48) Remark 2. According to a recent result of Moulin [22], β2 the achievability bound on the remainder term in (38) can V(β)= log2e. (49) be tightened to match the converse bound in (39), thereby (1+β)2 establishing that Remark 6. As should be clear from the proof of Theorem 4, if the capacity-achieving distribution is not unique, then 1 θ(n)= logn+O(1), (41) 2 minVar[ (X⋆;Y⋆,β)] 0<ǫ 1 V(β)= X;Y⋆ ≤ 2 (50) provided that the following regularity assumptions hold: (maxVar[X;Y⋆(X⋆;Y⋆,β)] 21 <ǫ<1 The random variable ı (X⋆;Y⋆) is of nonlattice type; • supp(P )= ; X;Y⋆ wherethe optimizationis performedoverall PX⋆ thatachieve X⋆ • Cov ı (X⋆A;Y⋆),ı (X¯⋆;Y⋆) < C(β). Thisparallelsthedispersionresultforchannelswithout • X;Y⋆ X;Y⋆ cost [20]. Var[ı (X⋆;Y⋆)] where X;Y⋆ (cid:2) (cid:3) PX¯⋆X⋆Y⋆(¯x,x,y)= PY⋆1(y)PX⋆(¯x)PY|X(y|¯x)PY|X(y|x)PX⋆(x). 3Fortheconverse result,assumptions (a)–(c)suffice. 6 V. JOINT SOURCE-CHANNEL CODING Theorem 6 (Gaussian approximation). Assume the channel In this section we state the counterpartsof Theorems3 and has finite input and output alphabets. For stationary memo- 4 in the lossy joint source-channel coding setting. Proofs of ryless sources satisfying the regularity assumptions (i)–(iv) of the results in this section are obtained by fusing the proofs in [27] and channels satisfying assumptions (ii)–(iv) of Section Sections III and IV and those in [27]. IV-A, the parameters of the optimal (k,n,d,ǫ) code satisfy In the joint source-channel coding setup the source is no nC(β) kR(d)= nV(β)+k (d)Q−1(ǫ)+θ(n), (55) longer equiprobable on an alphabet of cardinality M, as in − V Definition 2, rather it is arbitrarily distributed on an abstract where (d) = Var[pS(S,d)], V(β) is given in (37), and the V alphabet .Further,insteadofreproducingthetransmittedS remainder θ(n) satisfies, if V(β)>0, M under a probability of error criterion, we might be interested 1 logn+O logn θ(n) (56) in approximating S within a certain distortion, so that a − 2 ≤ decoding failure occurs if the distortion between the source 1 (cid:16)p (cid:17) θ¯(n)+ supp(P ) 1 logn, (57) and its reproduction exceeds a given distortion level d, i.e. ≤ 2| X⋆ |− (cid:18) (cid:19) if d(S,Z) > d, where Z ∈ M is the representation of S, where θ¯(n) = O(logn) denotes the upper bound on the re- is a reproduction alphabet, and d: R is M M × M 7→ + maindertermgivenin[27,Theorem10].IfV(β)= (d)=0, the distortion measure. A (d,ǫ,cβ) code is a code for a fixed V the upper bound on θ(n) stays the same, and the lower one scource-channel pair such that the probability cof exceeding distortiondisnolargerthanǫ andnochannelcodewordcosts becomes O n31 . more than β. A (d,ǫ,β) code in a block codingsetting, when Proof o(cid:16)utlin(cid:17)e: The achievability part is proven joining a source block of length k is mapped to a channel block of theasymptoticanalysesof[27,Theorem8]andofTheorem9, length n, is called a (k,n,d,ǫ,β) code. The counterpart of shown in AppendixE. For the conversepart, PY¯ is chosen as the b-tilted informationdensity in lossy compressionis the d- in (146), and similar to the proof of the converse part of [27, tilted information,S(s,d), which can be computedusing the Theorem10],atypicalsetofsourceoutcomesisidentified,and equality it is shown using Theorem 7.2 that for every source outcome inthatset,theinnerinfimumin(54)isapproximatelyachieved (s,d)=ı (z;s)+λ d(s,z) λ d, (51) S Z⋆;S S S − by the capacity-achieving channel input type. whereZ⋆ istherandomvariablethatachievestheinfimumon the right side of VI. CONCLUSION R (d), min I(S;Z), (52) We introduced the concept of b-tilted information density S PZ|S: (Definition 1), a random variable whose distribution governs E[d(S,Z)] d ≤ the analysis of optimal channel coding under input cost con- λS = −R′S(d) > 0, and equality in (51) holds for PZ⋆-a.e. straints. The properties of b-tilted information density listed z [24]. In a certain sense, the d-tilted information quantifies in Theorem 1 play a key role in the asymptotic analysis of the number of bits required to reproduce the source outcome the converse bound in Theorem 3 in Section IV, which does s withindistortiond.Forrigorousdefinitionsandfurther not only lead to the strong converse and the dispersion-cost ∈M details we refer the reader to [27]. function when coupled with the corresponding achievability bound, but it also proves that the third order term in the Theorem 5 (Converse). The existence of a (d,ǫ,β) code for asymptotic expansion (1) is upper bounded (in the most S and P requires that Y|X common case of V(β) > 0) by 1logn+O(1). In addition, 2 ǫ≥ PiXn|fSmγ>a0x(cid:26)suY¯pP S(S,d)−X;Y¯(X;Y,β)≥γ cwoedisnhgowoveedricnhSanecnteiolsnwViththcaotstthecornessturlatisntosfa[n2d7]aglseonetirgahlitzeenetod (cid:2) (cid:3) the estimate of the third order term in [27]. As propounded exp( γ) (53) in [29], [30], the gateway to the refined analysis of the third − − (cid:27) order term is an apt choice of a non-productdistribution PY¯n ≥ mγ>a0x(cid:26)suY¯pE(cid:20)xi∈nXf P S(S,d)−X;Y¯(x;Y,β)≥γ |S (cid:21) in the bounds in Theorems 3 and 5. (cid:2) (cid:3) VII. ACKNOWLEDGEMENT exp( γ) , (54) − − We thankthe refereesfortheir unusuallythoroughreviews, (cid:27) where the probabilities in (53) and (54) are with respect to which are reflected in the final version. P P P and P , respectively. S XS Y X Y X=x | | | APPENDIX A Proof:Theboundisobtainedbyweakening[27,Theorem PROOF OF THEOREM1 1] (23) using b(x) β. ≤ We note first two auxiliary results. Under the usual memorylessness assumptions, applying Theorem 1 to the bound in (54), it is easy to show that the Lemma 2 ([33]). Let 0 α 1, and let P Q be ≤ ≤ ≪ strong converse holds for lossy joint source-channel coding distributions on the same probability space. Then, over channels with input cost constraints. A more refined 1 analysis leads to the following result. lim D(αP +(1 α)Q Q)=0. (58) α 0α − k → 7 (64). Note that ELanedmexmpleatg3(XX(¯¯D)obnes<kear-V.arTarahndedhnoa,mn[3v4a]r)i.abLleet go:nXX7→[−su∞ch,+th∞at] D(PY|XkPYˆ|PX¯)≥≥DD((1Y¯{kY¯Yˆ∈) F}k1{Yˆ ∈F}) ((7721)) (cid:2) (cid:0)E[g(X(cid:1)(cid:3))]−∞D(XkX¯)≤logE exp g(X¯) (59) ≥PY¯(F)logPPYY¯ˆ((FF)) (73) 1 with equality if and only if X has distr(cid:2)ibutio(cid:0)n PX⋆(cid:1)(cid:3)such that =PY¯(F)logα. (74) ıX⋆kX¯(x)=g(x)−logE exp g(X¯) . (60) Furthermore, we have (cid:2) (cid:0) (cid:1)(cid:3) E X;Yˆ(Xˆ;Yˆ,β) −E[X;Y⋆(X⋆;Y⋆,β)] Proof: If the left side of (59) is not −∞, we can write = αEh X;Yˆ(X¯;Y¯,βi) +(1−α)E X;Yˆ(X⋆;Y⋆,β) Eh[ (X⋆;Y⋆i,β)] h i(75) E[g(X)]−D(XkX¯)=E g(X)−ıXkX⋆(X)−ıX⋆kX¯(X(6)1) α−E X;Y(⋆X¯;Y¯,β) αE[ (X⋆;Y⋆,β)] (76) (cid:2) (cid:3) ≥ X;Yˆ − X;Y⋆ =logE exp g(X¯) D(X X⋆), h 1 i − k (62) ≥ α PY¯(F)logα −λ⋆E b(X¯) +λ⋆β (cid:2) (cid:0) (cid:1)(cid:3) which is maximized by letting PX =PX⋆. −(cid:16)E[X;Y⋆(X⋆;Y⋆,β)](cid:2) (cid:3) (77) We proceed to prove Theorem 1 by generalizing [35, > 0, (cid:17) (78) Theorem 6.1]. Equality (13) is a standard result in convex optimization. By the assumption, the supremum in the right where (76) is due to D(Y⋆ Yˆ) 0, (77) invokes (74), and k ≥ side of(13) isattainedbyP , thereforeC(α) isequalto the (78) holdsfor sufficientlysmall α, therebycontradicting(13). X⋆ right side of (15). We conclude that indeed PY¯ ≪PY⋆. To show (14), fix 0 α 1. Denote To show (16), define the following function of a pair of ≤ ≤ probability distributions on : X PPXXˆ¯ =→αPPYX|¯X+→(1P−Y¯α, )PX⋆, ((6634)) F(PX,PX¯)==EE[X;Y¯((XX;;YY,,ββ))] −DD((XXkX¯X¯))+D(Y Y¯(7)9) PXˆ →PY|X →PYˆ =αPY¯ +(1−α)PY⋆, (65) (cid:2)X;Y (cid:3)− k k (80) and write E[ (X;Y,β)], (81) X;Y ≤ α E[ (X⋆;Y⋆,β)] E (X¯;Y¯,β) where(81)holdsbythedataprocessinginequalityforrelative X;Y⋆ X;Y⋆ +(cid:0)D(YˆkY⋆) − (cid:2) (cid:3)(cid:1) ecnantrobpey.exSpinrecseseeqduaaslittyheind(o8u1b)leismacahxiiemviezdatbioynPX =PX¯,C(β) = α+Dλ(⋆PαYE|XbkP(XY¯⋆)|P−Xλ⋆)⋆α−Eα[Db((XP⋆Y)|]XkPY⋆|PX¯)+D(Yˆk(Y6⋆6)) C(β)=mPaX¯xmPaXxF(PX,PX¯). (82) = D(PY|Xk(cid:2)PY⋆|P(cid:3)X⋆)−D(PY|XkPY⋆|PXˆ)+D(YˆkY⋆) To solvetheinnermaximizationin (82), we invokeLemma λ⋆E[b(X⋆)]+λ⋆E b(Xˆ) (67) 3 with = D−(P P P ) hD(P i P P ) λ⋆E[b(X⋆)] g(x)=E X;Y¯(x;Y,β)|X =x (83) Y|Xk Y⋆| X⋆ − Y|Xk Yˆ| Xˆ − to conclude that (cid:2) (cid:3) +λ⋆E b(Xˆ) (68) =E[Xh;Y⋆(Xi⋆;Y⋆,β)]−E X;Yˆ(Xˆ;Yˆ,β) (69) mPaXxF(PX,PX¯)=logE exp E X;Y¯(X¯;Y¯,β)|X¯ (8,4) (cid:2) (cid:0) (cid:2) (cid:3)(cid:1)(cid:3) ≥0, h i (70) which in the special case PX¯ = PX⋆ yields, using represen- tation (82), where (70) holds because X⋆ achieves the supremum in the rLigemhtmsaide2oimf p(1li3e)s. AthsastuDme(YˆforYt⋆h)e=moom(eαn)t.tThahtusP,Y¯su≪ppoPsYin⋆g. C(β)≥lEog[E[exp(X(E⋆;[YX⋆;,Yβ⋆)(]X⋆;Y,β)|X⋆])] ((8856)) that E X;Y⋆(X¯;Y¯,β) > Ek[X;Y⋆(X⋆;Y⋆,β)] would lead ≥=C(βX);Y⋆ (87) to a contradiction, since then the left side of (66) would be (cid:2) (cid:3) negative for a sufficiently small α. where (86) applies Jensen’s inequality to the strictly convex To complete the proof of (14), it remains to show P function exp(), and (87) holds by the assumption. We con- Y⋆ · cdoonmtriandaitcetsioanl,l aPssY¯umseucthhatthPatX¯PaX¯nd→F ⊆PY|YX a→re suPcY¯h. thBayt cthluadteE[thXat;,Y⋆in(Xfa⋆c;tY,,(β8)6|)Xh⋆o]lidssalwmitohstesquureallyityc,onwshtaicnht,tihmeprelibeys PY¯(F) > PY⋆(F) = 0, and define the mixture PXˆ as in showing (16). 8 APPENDIX B 2. For δ =A logn, n n PROOF OF COROLLARY 2 q n n To show (21), we invoke (6) to write, for any x , minP W (z) n(D⋆ ∆) Q ∆ ∈X z∈D "i=1 i ≤ n− #≥ (cid:18) rVn⋆(cid:19) X Var[ (X;Y,β)X =x] X;Y⋆ | logn = Var[ı (X;Y) λ⋆(b(X) β) X =x] (88) K . X;Y⋆ − − | − r n = Var[ı (X;Y)X =x]. (89) (102) X;Y⋆ | To show (20), we invoke (16) to write 3. Fix 0 ≤ β ≤ 16. If in (98), Vn⋆ = 0 (which implies that V =0in(99),i.e.we droptherequirementinTheorems min E[Var[X;Y⋆(X;Y,β)|X]] 7.1 and7.2 thatVmin be positive), thenthere exists K ≥0 = E (X;Y⋆(X;Y,β))2 such that for all ∆> n12A+β, where A>0 is arbitrary Eh(E[X;Y⋆(X;Y,β)iX])2 (90) minP n W (z) n(D⋆ +∆) 1 K 1 . =− Eh(X;Y⋆(X;Y,β))2 | C2i(β) (91) z∈D "Xi=1 i ≤ n #≥ − A23 n14−32(1β03) − = Vahr[ (X;Y,β)].i (92) X;Y⋆ Theorem 7 gives a general result on the minimization of a APPENDIX C cdf of a sum of independent random variables parameterized AUXILIARY RESULTON THEMINIMIZATION OF THECDF OF by elements of a metric space: it says that the minimum is approximately achieved by the sum with the largest mean, A SUMOF INDEPENDENTRANDOMVARIABLES underregularityconditions.Themetricnatureoftheparameter Let is a metric space with metric d : 2 R+. space is essential in making sure the means and the variances D D 7→ Let W (z),i = 1,...,n be independent random variables i of W () behave like continuous functions: assumptions (98) i parameterized by z . Denote · and (97) essentially ensure that functions D () and D (z) ∈D n n · n are well-behaved in the neighborhood of the optimum, while 1 Dn(z)= n E[Wi(z)], (93) assumption(96)guaranteesthatDn(·)decaysfastenoughnear i=1 its maximum. X 1 n BeforeweproceedtoproveTheorem7,werecalltheBerry- V (z)= Var[W (z)], (94) n n i Esseen refinement of the central limit theorem. i=1 X n Theorem 8 (Berry-EsseenCLT, e.g. [36, Ch. XVI.5 Theorem 1 Tn(z)= n E |Wi(z)−E[Wi(z)]|3 . (95) 2]). Fix a positive integer n. Let Wi, i = 1,...,n be i=1 independent. Then, for any real t X (cid:2) (cid:3) Let ℓ , ℓ , ℓ , L , L , F , F , V and T be positive n 1 2 3 1 2 1 2 min max V B constants. We assume that there exist z⋆ and sequences P Wi >n Dn+t n Q(t) n, (104) Dn⋆, Vn⋆ such that for all z ∈D, ∈ D (cid:12)(cid:12)(cid:12) "Xi=1 r n !#− (cid:12)(cid:12)(cid:12)≤ √n wh(cid:12)ere (cid:12) Dn⋆ −Dn(z)≥ℓ1d2(z,z⋆)− √ℓ2nd(z,z⋆)− ℓn3, (96) (cid:12) D = 1 n E[W ], (cid:12) (105) n i n L D⋆ D (z⋆) 1, (97) Xi=1 n− n ≤ n 1 n V (z) V⋆ F d(z,z⋆)+ F2 , (98) Vn = n Var[Wi], (106) | n − n|≤ 1 √n Xi=1 n 1 Vmin ≤Vn(z), (99) Tn = n E |Wi−E[Wi]|3 , (107) Tn(z)≤Tmax. (100) Xi=1 (cid:2) (cid:3) c T B = 0 n, (108) Theorem 7. In the setupdescribed above,underassumptions n V3/2 (96)–(100),foranyA>0,thereexistsaK 0suchthat,for n ≥ and 0.4097 c 0.5600 (c 0.4784 for identically all ∆ δn (where δn is specified below) and all sufficiently ≤ 0 ≤ 0 ≤ | |≤ distributed W ). large n: i 1. If δ = A , We also make note of the following lemma, which deals n √n with the behavior of the Q-function. n minP W (z) n(D⋆ ∆) Q ∆ n K . Lemma 4 ( [27, Lemma 4]). Fix b 0. Then, there exists z∈D "Xi=1 i ≤ n− #≥ (cid:18) rVn⋆(cid:19)− √n q ≥0 such that for all z ≥−21b and a≥ll n≥1, (101) q Q √nz Q √nz(1+bz) . (109) − ≤ √n (cid:0) (cid:1) (cid:0) (cid:1) 9 proven shortly, for appropriately chosen a,b,c > 0 we can write We are now equipped to prove Theorem 7. cδ Proof of Theorem 7: To show (103), denote for brevity maxν (z) ν⋆ +bν⋆2+ n (121) ζ =d(z,z⋆) and write z n ≤ n n √n ∈D n for n large enough. P W (z)>n(D⋆ +∆) If i n "Xi=1 # ∆ √Vmin = A, (122) n ≥− 2b − ℓ ℓ A ≤ P"Xi=1Wi(z)>n(cid:18)Dn(z)+ℓ1ζ2− √2nζ− n3 + n12+β(cid:19)# tthheenfaνcn⋆t≥tha−t Q21b(,)anids mLeomnomtoan4icaapllpylideesctroeaνsn⋆in.gSoan,dusLinegm(m12a14),, (110) · we conclude that there exists q >0 such that 1 F1ζ+ √F2n (111) Q √nν⋆ minQ √nν (z) ≤ nK(cid:16)ℓ1ζ21− √ℓ2nζ− ℓn3 + n12A+β(cid:17)2 ≤ Q(cid:0)√nνnn⋆(cid:1)−−Qz∈D√n(cid:0)νn⋆ +√nnbν(cid:1)n⋆2+cδnc (123) ≤ A32 n41−32β, (112) ≤ Q(cid:0)√nνn⋆(cid:1)−Q(cid:0)√nνn⋆ +√nbνn⋆2 + √(cid:1)2πδn (124) where q(cid:0) c(cid:1) (cid:0) (cid:1) + δ , (125) n (110) uses (96) and the assumption on the range of ∆; ≤ √n √2π • (111) is due to Chebyshev’s inequality and V⋆ =0; where • n (112) is by a straightforwardalgebraicexercise revealing (124) is due to • thatζ thatmaximizestheleftsideof(112)isproportional • 1 ξ to A2 . Q(z+ξ) Q(z) , (126) n41+21β ≥ − √2π We proceed to show (101) and (102). which holds for arbitrary z and ξ 0, Denote ≥ (125) holds by Lemma 4 as long as ν⋆ 1. n • n ≥−2b g (z)=P W (z) n(D⋆ ∆) . (113) Thus, (125) establishes (101) and (102). It remains to prove n " i ≤ n− # (121).Toupper-boundmax ν (z),denoteforconvenience i=1 z n X ∈D Using (99) and (100), observe D (z) D⋆ f (z)= n − n, (127) n c T (z) c T V (z) 0 n B = 0 max < . (114) n 3 ≤ 3 ∞ 1 Vn2(z) Vm2in gn(z)= p , (128) V (z) Therefore the Berry-Esseen bound yields: n B and note, using (96), (97),p(99), (100) and (by Ho¨lder’s g (z) Q √nν (z) , (115) inequality) n n − ≤ √n 2 V (z) T3 , (129) (cid:12) (cid:0) (cid:1)(cid:12) n max where (cid:12) (cid:12) ≤ D (z) D⋆ +∆ ν (z), n − n . (116) that n Vn(z) D (z⋆) D⋆ D (z) D⋆ f (z⋆) f (z)= n − n n − n (130) Denote p n − n V (z⋆) − V (z) ∆ n n νn⋆ , Vn⋆ (117) ≥ℓ′1dp2(z,z⋆)− √ℓ′2nd(pz,z⋆)− ℓn′3, (131) Since p where g (z)=Q(√nν⋆)+ g (z) Q √nν (z) n + Q √nnν (z) n Q(−√nν⋆) n (118) ℓ′1 =Tm−a13xℓ1, (132) n (cid:2) − (cid:0)n (cid:1)(cid:3) 1 (cid:2)Q(√(cid:0)nν⋆) (cid:1)B + Q √n(cid:3)ν (z) Q(√nν⋆) , ℓ′2 =Vm−in2ℓ2, (133) ≥ n − √n (cid:2) (cid:0) n (cid:1)− n(1(cid:3)19) ℓ′3 =Vm−in12(L1+ℓ3). (134) Observe that for a,b>0 to show (101), it suffices to show that q 1 1 a b Q(√nν⋆) minQ √nν (z) (120) | − | , (135) n −z∈D (cid:0) n (cid:1)≤ √n (cid:12)(cid:12)√a − √b(cid:12)(cid:12)≤ 2min{a,b}32 for some q 0, and to show (102), replacing q with q√logn so, using (98) an(cid:12)d (99), we(cid:12)conclude ≥ (cid:12) (cid:12) in the right side of (120) would suffice. 1 1 F imSuimnceinQ(1i2s0)mwoneotnoeneidcatlolymdaexcrimeaiszieng√, ntoνna(czh)ie.vAesthweilmlibne- (cid:12)(cid:12) Vn(z) − Vn⋆(cid:12)(cid:12)≤F1′d(z,z⋆)+ √2n′ , (136) (cid:12) (cid:12) (cid:12)p p (cid:12) (cid:12) (cid:12) 10 where where PYK=k, k are defined as follows, for some c>0, { | ∈ K} 1 3 F1′ = 21Vm−in32F1, (137) PY|K=k(y)=PY⋆(y)+ √knyc, (147) F2′ = 2Vm−in2F2. (138) = k Z : k =0, |B| y K ( ∈ Let z0 achieve the maximum maxz νn(z), i.e. Xy∈B ∈D 1 k P (y)+ y 1 P (y) , (148) maxνn(z)=fn(z0)+∆gn(z0). (139) − Y⋆ √nc ≤ √nc ≤ − Y⋆ ) z ∈D A= exp k2 < . (149) Using (136) and (131), we have, −| | ∞ k X∈K (cid:0) (cid:1) Denote by P the minimum Euclidean distance approx- ν (z ) ν (z⋆) Π(Y) n 0 − n imation of an arbitrary PY , where is the set of = (fn(z0) fn(z⋆))+∆(gn(z0) gn(z⋆)) (140) distributions on the channel ∈outQput alphabeQt , in the set ≤ −ℓ′1d2(z−0,z⋆)+(cid:18)√ℓ′2n +|∆|F1′−(cid:19)d(z0,z⋆)+ 2F√2′|n∆| (cid:8)PPYΠ|K(Y=)k=: PkY∈KK=(cid:9)k⋆: where k⋆ =argmin PY BPYK=k . ℓ | k − | + ′3 (141) ∈K(cid:12) (1(cid:12)50) n The quality of approximation (150) is gov(cid:12)erned by [30] (cid:12) 1 ℓ 2 2F ∆ ℓ ≤ 4ℓ′1 (cid:18)√′2n +|∆|F1′(cid:19) + √2′|n | + n′3, (142) PΠ(Y)−PY ≤ |B|(|Bnc|−1). (151) r where (142) follows because the maximum of its left side is Wesaythatxn(cid:12)(cid:12)∈An hasty(cid:12)(cid:12)pePXˆ ifthenumberoftimeseach achieved at d(z ,z⋆)= 1 ℓ′2 + ∆F . Using (96), (99), letter a is encountered in xn is nPX(a). An n-type is a 0 2ℓ′1 √n | | 1′ distribut∈ionAwhose masses are multiples of 1. Denote by P (136), we upper-bound (cid:16) (cid:17) n Xˆ the minimum Euclidean distance approximation of P in the X F ∆ ℓ ℓ F set of n-types, that is, ν (z⋆) ν⋆ + 2′| | + 3 + 3 2′. (143) n ≤ n √n nVmin n32 PXˆ =arg Pmin: |PX−P|. (152) P isa∈nPn-type Applying (142) and (143) to upper-boundmax ν (z), we z n The accuracy of approximation in (152) is controlled by the have established (121) in which ∈D following inequality: b= F1′2Tm32ax, (144) PX−PXˆ ≤ |A|(n|A|−1). (153) 4ℓ′1 For each P (cid:12) , let(cid:12)xn p n be an arbitrary sequence X(cid:12)∈ P (cid:12) ∈ A where we used (98) and (129) to upper-bound ∆2 = ν⋆2V⋆, of type PXˆ, and lower-bound the sum in (146) by the term n n containing P to obtain: thereby completing the proof. Π(Y) n Xn;Y¯n(xn;yn,β)≤ X;Π(Y)(xi,yi,β) i=1 X 2 APPENDIX D +nc PΠ(Y) PY⋆ +A. (154) − PROOF OFTHECONVERSE PARTOF THEOREM4 Applying (146) and (154) t(cid:12)o loosen (24)(cid:12), we conclude by (cid:12) (cid:12) Theorem 3 that, as long as an (n,M,ǫ) code exists, for an ′ Given a finite set , let be the set of all distributions on arbitrary γ >0, A P that satisfy the cost constraint, A n E[b(X)]≤β, (145) ǫ′ ≥PmX∈inPP"Xi=1Wi(PX)≤logM −γ−A#−exp(−(γ15)5,) which is a convex set in R . where |A| Leveraging an idea of Tomamichel and Tan [30], we will W (P )= (x ,Y ,β)+c P P 2, (156) i X X;Π(Y) i i Π(Y) Y⋆ wnoena-kpernod(2u4ct)dbiystcrihbouotisoinngs PwY¯itnh twoebigehatsccohnovseexnctoomfbaivnoartiothnoosef amnidnimYiiziastidoinstorinbuthteedrigachctosriddeinogft(o15P5(cid:12)(cid:12)Y),|Xw=exwi.−4illTaoppe(cid:12)(cid:12)lvyaTluhaeteortehme distributions that are close to P . Specifically (cf. [30]), Y⋆n 4Strictly speaking, the order of Wi(PX), i = 1,...,n depends on the PY¯n(yn)= A1 k exp −|k|2 i=n1PY|K=k(yi), (146) pochafrottohicseuelsatuhrmicshPsoeiqcniu=ee1onfcWeseiaq(rPubeXitnr)caerdiolxye.ns noofttydpeepePnXˆd.oHnotwheevirerr,eslainticveetohrededri,stwribeumtioany X∈K (cid:0) (cid:1)Y