ebook img

Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma PDF

44 Pages·2008·0.5 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma

Catching Up Faster by Switching Sooner:∗ A Prequential Solution to the AIC-BIC Dilemma Timvan Erven Peter Gru¨nwald StevendeRooij CentrumvoorWiskundeen Informatica(CWI) 8 Kruislaan413,P.O. Box 94079 0 1090GB Amsterdam,TheNetherlands 0 {Tim.van.Erven,Peter.Grunwald,Steven.de.Rooij}@cwi.nl 2 l u July7, 2008 J 7 ] T Abstract S . Bayesianmodelaveraging,modelselectionanditsapproximationssuchasBICaregenerallystatisti- h callyconsistent,butsometimesachieveslowerratesofconvergencethanothermethodssuchasAICand t a leave-one-outcross-validation.Ontheotherhand,theseothermethodscanbeinconsistent. Weidentify m thecatch-upphenomenonasanovelexplanationfortheslowconvergenceofBayesianmethods. Based [ onthisanalysiswedefinetheswitchdistribution,amodificationoftheBayesianmarginaldistribution. Weshowthat,underbroadconditions,modelselectionandpredictionbasedontheswitchdistributionis 1 v bothconsistentandachievesoptimalconvergencerates, therebyresolvingtheAIC-BICdilemma. The 5 methodispractical;wegiveanefficientimplementation.Theswitchdistributionhasadatacompression 0 interpretation, and can thus be viewed as a “prequential”or MDL method; yet it is differentfrom the 0 MDLmethodsthatareusuallyconsideredintheliterature.WecomparetheswitchdistributiontoBayes 1 factormodelselectionandleave-one-outcross-validation. . 7 0 8 1 Introduction: The Catch-Up Phenomenon 0 : v Weconsiderinferencebasedonacountablesetofmodels(setsofprobabilitydistributions), focusingontwo i X tasks: model selection and model averaging. In model selection tasks, the goal is to select the model that r bestexplainsthegivendata. Inmodelaveraging,thegoalistofindtheweightedcombinationofmodelsthat a leadstothebestprediction offuturedatafromthesamesource. An attractive property of some criteria for model selection is that they are consistent under weak con- ditions, i.e. if the true distribution P∗ is in one of the models, then the P∗-probability that this model is selected goes to one as the sample size increases. BIC [Schwarz, 1978], Bayes factor model selection [KassandRaftery, 1995], Minimum Description Length (MDL) model selection [Barronetal., 1998] and prequential model validation [Dawid, 1984] are examples of widely used model selection criteria that are usually consistent. However, other model selection criteria such as AIC [Akaike, 1974] and leave-one-out cross-validation (LOO)[Stone,1977],whileofteninconsistent, dotypicallyyieldbetterpredictions. Thisis especiallythecaseinnonparametricsettingsofthefollowingtype: P∗canbearbitrarilywell-approximated by asequence ofdistributions inthe (parametric) models under consideration, but is notitself contained in ∗Apreliminaryversionofapartofthispaperappearedas[vanErvenetal.,2007]. 1 anyofthese. Inmanysuchcases,thepredictivedistribution converges tothetruedistribution attheoptimal rate for AIC and LOO [Shibata, 1983, Li, 1987], whereas in general MDL, BIC, the Bayes factor method andprequential validation onlyachievetheoptimalratetowithinanO(logn)factor[Rissanenetal.,1992, FosterandGeorge,1994,Yang,1999,Gru¨nwald,2007]. Inthispaperwereconciletheseseeminglyconflict- ing approaches [Yang, 2005a] by improving the rate of convergence achieved in Bayesian model selection without losing itsconsistency properties. Firstweprovide anexample toshow whyBayessometimes con- vergestooslowly. 1.1 The Catch-Up Phenomenon GivenpriorsonparametricmodelsM ,M ,...andparameterstherein,Bayesianinferenceassociateseach 1 2 modelM withthemarginaldistribution p¯ ,givenby k k p¯ (xn) = p (xn)w(θ)dθ. k θ Zθ∈Θk obtained by averaging over the parameters according to the prior. In Bayes factor model selection the pre- ferredmodelistheonewithmaximumaposterioriprobability. ByBayes’rulethisisargmax p¯ (xn)w(k), k k wherew(k)denotesthepriorprobabilityofM . Wecanfurtheraverageovermodelindices,aprocesscalled k BayesianModelAveraging(BMA).Theresulting distribution p (xn) = p¯ (xn)w(k)canbeusedfor bma k k prediction. Inasequential setting, theprobability ofadatasequence xn := x ,...,x underadistribution 1 n P ptypicallydecreasesexponentially fastinn. Itisthereforecommontoconsider−logp(xn),whichwecall the code length of xn achieved by p. Wetake all logarithms to base 2, allowing us to measure code length in bits. The name code length refers to the correspondence between code length functions and probability distributions based on the Kraft inequality, but one may also think of the code length as the accumulated log loss that is incurred if we sequentially predict the x by conditioning on the past, i.e. using p(·|xi−1) i [Barronetal.,1998,Gru¨nwald,2007,Dawid,1984,Rissanen,1984]. ForBMA,wehave n n −logp (xn) = −log p (x | xi−1) = −logp (x | xi−1) . bma bma i bma i i=1 i=1 Y X(cid:2) (cid:3) Heretheithtermrepresentsthelossincurredwhenpredictingx givenxi−1usingp (·|xi−1),whichturns i bma outtobeequaltotheposterior average: p (x |xi−1) = p¯ (x |xi−1)w(k|xi−1). bma i k k i Predictionusingp hastheadvantagethatthecodelengthitachievesonxn isclosetothecodelength bma of p¯ , where kˆ is the best of the marginals p¯ ,p¯ ,..., i.e.Pkˆ achieves min −logp¯ (xn). More precisely, kˆ 1 2 k k given a prior w on model indices, the difference between −logp (xn) = −log( p¯ (xn)w(k)) and bma k k −logp¯ (xn) must be in the range [0,−logw(kˆ)], whatever data xn are observed. Thus, using BMA for kˆ P predictionissensibleifwearesatisfiedwithdoingessentiallyaswellasthebestmodelunderconsideration. However,itisoftenpossibletocombinep¯ ,p¯ ,...intoadistribution thatachievessmallercodelengththan 1 2 p¯ ! This is possible if the index kˆ of the best distribution changes with the sample size in a predictable kˆ way. Thisiscommon inmodel selection, for example withnested models, say M ⊂ M . Inthis case p¯ 1 2 1 typically predicts better at small sample sizes (roughly, because M has more parameters that need to be 2 learnedthanM ),whilep¯ predictsbettereventually. Figure1illustratesthisphenomenon. Itshowstheac- 1 2 cumulatedcodelengthdifference−logp¯ (xn)−(−logp¯ (xn))on“ThePictureofDorianGray”byOscar 2 1 Wilde,wherep¯ andp¯ aretheBayesianmarginaldistributions forthefirst-orderandsecond-order Markov 1 2 chains,respectively,andeachcharacterinthebookisanoutcome. Weuseduniform(Dirichlet(1,1,...,1)) priors onthe model parameters (i.e., the “transition probabilities”) ,but thesame phenomenon occurs with 2 60000 1 (bits) 40000 BayesiaSnw MitcMohd−aerDlk Aiosvvtr eoibrraudgteiiron 2ng order 20000 v o Mark 0 h wit −20000 h difference −−6400000000 gt n dele −80000 o C −100000 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 Sample size Figure1: TheCatch-upPhenomenon other common priors , such as Jeffreys”. Clearly p¯ is better for about the first 100000 outcomes, gaining 1 a head start of approximately 40000 bits. Ideally we should predict the initial 100000 outcomes using p¯ 1 and the rest using p¯ . However, p only starts to behave like p¯ when it catches up with p¯ at a sample 2 bma 2 1 size of about 310000, when the code length of p¯ drops below that of p¯ . Thus, in the shaded area p 2 1 bma behaveslikep¯ whilep¯ ismakingbetterpredictions ofthoseoutcomes: sinceatn= 100000, p¯ is40000 1 2 2 bitsbehind, andatn = 310000, ithascaughtup,inbetweenitmusthaveoutperformed p¯ by40000bits! 1 Note that the example models M and M are very crude; for this particular application much better 1 2 models are available. Thus M and M serve as a simple illustration only (see the discussion in Sec- 1 2 tion 8.1). However, our theorems, as well as experiments with nonparametric density estimation on which we will report elsewhere, indicate that the same phenomenon also occurs with more realistic models. In fact, the general pattern that first one model is better and then another occurs widely, both on real-world dataandintheoretical settings. Wearguethatfailure totakethiseffectintoaccount leadstothesuboptimal rate of convergence achieved by Bayes factor model selection and related methods. Wehave developed an alternativemethodtocombinedistributionsp¯ andp¯ intoasingledistributionp ,whichwecalltheswitch 1 2 sw distribution, defined in Section 2. Figure 1 shows that p behaves like p¯ initially, but in contrast to p sw 1 bma itstarts tomimicp¯ almostimmediately afterp¯ starts making betterpredictions; itessentially does thisno 2 2 matterwhatsequencexn isactuallyobserved. p differsfromp inthatitisbasedonapriordistribution sw bma onsequencesofmodelsratherthansimplyapriordistributiononmodels. Thisallowsustoavoidtheimplicit assumption that there is one model which is best at all sample sizes. After conditioning on past observa- tions, the posterior we obtain gives abetter indication of which model performs best at the current sample size,therebyachievingafasterrateofconvergence. Indeed,theswitchdistribution isverycloselyrelatedto earlier algorithms fortracking thebest expert developed intheuniversal prediction literature; seealsoSec- tion7[HerbsterandWarmuth,1998,Vovk,1999,VolfandWillems,1998,Monteleoni andJaakkola,2004]; however,theapplications wehaveinmindandthetheoremsweprovearecompletely different. 1.2 Organization The remainder of the paper is organized as follows (for the reader’s convenience, we have attached a table of contents at the end of the paper). In Section 2 we introduce our basic concepts and notation, and we thendefinetheswitch distribution. Whileintheexampleabove, weswitched between justtwomodels, the 3 general definition allows switching between elements of any finite or countably infinite set of models. In Section 3 we show that model selection based on the switch distribution is consistent (Theorem 1). Then in Section 4 we show that the switch distribution achieves a rate of convergence that is never significantly worse than that of Bayesian model averaging, and we show that, in contrast to Bayesian model averaging, the switch distribution achieves the worst-case optimal rate ofconvergence when itisapplied tohistogram density estimation. In Section 5 we develop a number of tools that can be used to bound the rate of con- vergence in Cesa`ro-mean in more general parametric and nonparametric settings, which include histogram density estimation as a special case. In Section 5.3 and Section 5.4 we apply these tools to show that the switchdistribution achievesminimaxconvergence ratesindensityestimationbasedonexponential families andinsomenonparametric linearregressionproblems. InSection6wegiveapracticalalgorithm thatcom- putestheswitchdistribution. Theorem14ofthatsectionshowsthattherun-timeforkpredictorsisΘ(n·k) time. InSections 7and Section 8weputour work inabroader context and explain howour results fitinto theexisting literature. Specifically, Section 7.1explains howour result can bereconciled withaseemingly contradictory recentresult ofYang[2005a],andSection8.1describes astrange implication ofthecatch-up phenomenon for Bayes factor model selection. The proofs of all theorems are in Appendix A (except the centralresultsofSection5,whichareprovedinthemaintext). 2 The switch distribution for Model Selection and Prediction 2.1 Preliminaries Suppose X∞ = (X ,X , ...)isasequence ofrandom variables thattake values insample space X ⊆ Rd 1 2 for some d ∈ Z+ = {1,2,...}. For n ∈ N = {0,1,2,...}, let xn = (x , ..., x ) denote the first n 1 n outcomes ofX∞, such that xn takes values in the product space Xn = X ×···×X . (Welet x0 denote 1 n theemptysequence.) Form > n,wewriteXm for(X ,...,X ),wherem = ∞isallowed. Weomit n+1 n+1 m thesubscript whenn = 0,writingXm ratherthanXm. 1 Anydistribution P(X∞)maybedefinedintermsofasequential prediction strategy pthatpredicts the nextoutcomeatanytimen ∈ N. Tobeprecise: Giventhepreviousoutcomesxnattimen,apredictionstrat- egyshouldissueaconditionaldensityp(X |xn)withcorrespondingdistributionP(X |xn)forthenext n+1 n+1 outcome X . Suchsequential prediction strategies aresometimes called prequential forecasting systems n+1 [Dawid, 1984]. An instance is given in Example 1 below. Whenever the existence of a ‘true’ distribution P∗ is assumed — in other words, X∞ are distributed according P∗ —, we may think of any prediction strategy p as a procedure for estimating P∗, and in such cases, we will often refer to p an estimator. For simplicity, weassumethroughout thatthedensityp(X |xn)istakenrelativetoeithertheusualLebesgue n+1 measure(ifX iscontinuous) orthecountingmeasure(ifX iscountable). Inthelattercasep(X |xn)isa n+1 probability massfunction. Itisnatural todefinethejointdensity p(xm|xn) = p(x |xn)···p(x |xm−1) n+1 m andletP(X∞ |xn)betheunique distribution onX∞ suchthat, forallm > n,p(Xm |xn)isthedensity n+1 n+1 of its marginal distribution for Xm . Toensure that P(X∞ |xn) is well-defined even if X is continuous, n+1 n+1 we will only allow prediction strategies satisfying the natural requirement that for any k ∈ Z+ and any fixed measurable event A ⊆ X the probability P(A |xk) is a measurable function of xk. This k+1 k+1 k+1 requirement holdsautomatically ifX iscountable. 2.2 Model Selection and Prediction In model selection the goal is to choose an explanation for observed data xn from a potentially infinite list of candidate models M , M , ... We consider parametric models, which we define as sets {p : θ ∈ Θ} 1 2 θ 4 of prediction strategies p that are indexed by elements of Θ ⊆ Rd, for some smallest possible d ∈ N, θ the number of degrees of freedom. A model is more commonly viewed as a set of distributions, but since distributions canbeviewedasprediction strategies asexplained above,wemaythinkofamodelasasetof predictionstrategiesaswell. Examplesofmodelselectionarehistogramdensityestimation[Rissanen etal., 1992](disthenumberofbinsminus1),regressionbasedonasetofbasisfunctionssuchaspolynomials(d is the number ofcoefficients of the polynomial), and the variable selection problem in regression [Shibata, 1983, Li, 1987, Yang, 1999] (d is the number of variables). A model selection criterion is a function δ : ∞ Xn → Z+ that, given any data sequence xn ∈ Xn of arbitrary length n, selects the model M n=0 k withindexk = δ(xn). S WitheachmodelM weassociateasinglepredictionstrategyp¯ . Thebaremphasizesthatp¯ isameta- k k k strategy based on the prediction strategies in M . In many approaches to model selection, for example k AIC and LOO, p¯ is defined using some parameter estimator θˆ , which maps a sequence xn of previous k k observations to an estimated parameter value that represents a “best guess” of the true/best distribution in the model. Prediction is then based on this estimator: p¯ (X | xn) = p (X | xn), which k n+1 θˆk(xn) n+1 alsodefinesajointdensityp¯ (xn) = p¯ (x )···p¯ (x |xn−1). TheBayesianapproachtomodelselectionor k k 1 k n modelaveraginggoestheotherwayaround. ItstartsoutwithapriorwonΘ ,andthendefinestheBayesian k marginaldensity p¯ (xn) = p (xn)w(θ)dθ. (1) k θ Zθ∈Θk Whenp¯ (xn)isnon-zero thisjointdensity inducesauniqueconditional density k p¯ (X ,xn) p¯ (X |xn) = k n+1 , k n+1 p¯ (xn) k which is equal to the mixture of p according to the posterior, w(θ|xn) = p (xn)w(θ)/ p (xn)w(θ)dθ, θ θ θ basedonxn. ThustheBayesianapproachalsodefinesaprediction strategy p¯ (X |xn). k n+1 R Associating a prediction strategy p¯ with each model M is known as the prequential approach to k k statistics [Dawid,1984]orpredictive MDL[Rissanen,1984]. Regardlessofwhetherp¯ isbased onparam- k eter estimation or on Bayesian predictions, we may usually think of it as a universal code relative to M k [Gru¨nwald,2007]. Example 1. Suppose X = {0,1}. Then a prediction strategy p¯ may be based on the Bernoulli model M = {p | θ ∈ [0,1]} that regards X ,X ,... as a sequence of independent, identically distributed θ 1 2 BernoullirandomvariableswithP (X = 1) = θ. WemaypredictX usingthemaximumlikelihood θ n+1 n+1 (ML)estimatorbasedonthepast,i.e.usingθˆ(xn) = n−1 n x . Theprediction forx isthenundefined. i=1 i 1 If we use a smoothed ML estimator such as the Laplace estimator, θˆ′(xn) = (n + 2)−1( n x + 1), P i=1 i then all predictions are well-defined. It is well-known that the predictor p¯′ defined by p¯′(X | xn) = n+1 P p (X ) equals the Bayesian predictive distribution based on a uniform prior. Thus in this case a θˆ′(xn) n+1 Bayesianpredictor andanestimation-based predictor coincide! In general, for a parametric model M , we can define p¯ (X | xn) = p (X ) for some k k n+1 θˆ′(xn) n+1 k smoothed ML estimator θˆ′. The joint distribution with density p¯ (xn) will then resemble, but in general k k not be precisely equal to, the Bayes marginal distribution with density p¯ (xn) under some prior on M k k [Gru¨nwald,2007,Chapter9]. 5 2.3 The switchdistribution Suppose p , p , ... is a list of prediction strategies for X∞. (Although here the list is infinitely long, the 1 2 developments below can with little modification be adjusted to the case where the list is finite.) We first define a family Q = {qs : s ∈ S} of combinator prediction strategies that switch between the original prediction strategies. Heretheparameter spaceSisdefinedas S= {(t ,k ),...,(t ,k ) ∈(N×Z+)m | m ∈ Z+,0 = t < ... < t }. (2) 1 1 m m 1 m The parameter s ∈ S specifies the identities of m constituent prediction strategies and the sample sizes, called switch-points, at which to switch between them. For s = ((t′,k′),...,(t′ ,k′ )), let t (s) = t′, 1 1 m′ m′ i i k (s) = k′ andm(s) = m′. Weomittheargumentwhentheparametersisclearfromcontext;e.g.wewrite i i t3 fort3(s). Foreachs∈ Sthecorresponding qs ∈ Qisdefinedas: p (X |xn) ifn < t , k1 n+1 2  pk2(Xn+1|xn) ift2 ≤ n < t3,  . . qs(Xn+1|xn) =  .. .. (3)    p (X |xn) ift ≤ n< t , km−1 n+1 m−1 m  pkm(Xn+1|xn) iftm ≤n.    Switching tothe samepredictor multipletimes(consecutively ornot)isallowed. Theextraswitch-point t 1 isincluded tosimplify notation; wealwaystaket = 0,sothatk represents thestrategythatisusedinthe 1 1 beginning, beforeanyactualswitchtakesplace. Given a list of prediction strategies p , p , ..., wedefine the switch distribution as a Bayesian mixture 1 2 oftheelementsofQaccording toapriorπ onS: Definition1(switchdistribution). Supposeπ isaprobability massfunction onS. Thentheswitchdistribu- tionP withpriorπ isthedistribution for(X∞,s)thatisdefinedbythedensity sw psw(xn,s) = qs(xn)·π(s) (4) foranyn ∈ Z+,xn ∈ Xn,ands ∈S. Hencethemarginallikelihood oftheswitchdistribution hasdensity psw(xn) = qs(xn)·π(s). (5) s∈S X Although theswitch distribution provides ageneral waytocombine prediction strategies (seeSection 7.3), inthispaperitwillonlybeappliedtocombinepredictionstrategiesp¯ ,p¯ ,...thatcorrespondtoparametric 1 2 models. In this case we may define a corresponding model selection criterion δ . Tothis end, let K : sw n+1 S → Z+ be a random variable that denotes the strategy/model that is used to predict X given past n+1 observations xn. Formally,leti betheunique isuchthatt (s) ≤ nandeithert (s) > n(i.e.thecurrent 0 i i+1 sample size n is between the i-th and i+ 1-st switch-point), or i = m(s) (i.e. the current sample size n is beyond the last switch point). Then K (s) = k (s). Now note that by Bayes’ theorem, the prior n+1 i0 π, together with the data xn, induces a posterior π(s | xn) ∝ qs(xn)π(s) on switching strategies s. This 6 posterioronswitchingstrategiesfurtherinducesaposterioronthemodelK thatisusedtopredictX . n+1 n+1 Algorithm 1,giveninSection6,efficientlycomputes theposterior distribution onK givenxn: n+1 π(K =k |xn) = {s:Kn+1(s)=k}qs(xn)π(s), (6) n+1 p (xn) P sw which is defined whenever p (xn) is non-zero, and can be efficiently computed using Algorithm 1 (see sw Section6). Weturnthisposterior distribution intothemodelselection criterion δ (xn)= argmaxπ(K =k |xn), (7) sw n+1 k whichselectsthemodelwithmaximumposterior probability. 3 Consistency Ifoneofthemodels,saywithindexk∗,isactuallytrue,thenitisnaturaltoaskwhetherδ isconsistent, in sw thesense that itasymptotically selects k∗ withprobability 1. Theorem 1below states that, iftheprediction strategies p¯ associated with the models are Bayesian predictive distributions, then δ is consistent under k sw certain conditions which are only slightly stronger than those required for standard Bayes factor model selection consistency. It is followed by Theorem 2, which extends the result to the situation where the p¯ k arenotnecessarily Bayesian. Bayesfactor modelselection isconsistent ifforallk,k′ 6= k,P¯k(X∞)andP¯k′(X∞)aremutually sin- gular, thatis,ifthereexistsameasurable setA ⊆ X∞ suchthatP¯k(A) = 1andP¯k′(A) = 0[Barronetal., 1998]. For example, this can usually be shown to hold if (a) the models are nested and (b) for each k, Θ is a subset of Θ of w -measure 0. In most interesting applications in which (a) holds, (b) also k k+1 k+1 holds [Gru¨nwald, 2007]. For consistency of δ , we need tostrengthen the mutual singularity-condition to sw a “conditional” mutual singularity-condition: we require that, for all k′ 6= k and all n, all xn ∈ Xn, the distributions P¯k(Xn∞+1 | xn) and P¯k′(Xn∞+1 | xn) are mutually singular. For example, if X1,X2,... are independent andidenticallydistributed (i.i.d.) according toeachP inallmodels,butalsoifX iscountable θ andp¯ (x | x )> 0forallk,allxn+1 ∈ Xn+1,thenthisconditional mutualsingularityisautomatically k n+1 n impliedbyordinary mutualsingularity ofP¯k(X∞)andP¯k′(X∞). LetEs = {s′ ∈ S | m(s′) > m(s),(ti(s′),ki(s′)) = (ti(s),ki(s))fori = 1,...,m(s)}denote the set ofall possible extensions of stomore switch-points. Letp¯ , p¯ , ... beBayesian prediction strategies with 1 2 respective parameter spaces Θ , Θ , ... and priors w , w , ..., and let π be the prior of the corresponding 1 2 1 2 switchdistribution. Theorem1(Consistency oftheswitchdistribution). Suppose π ispositive everywhereon{s ∈ S| m(s) = 1} and such that for some positive constant c, for every s ∈ S, c · π(s) ≥ π(Es). Suppose further that P¯k(Xn∞+1 | xn)and P¯k′(Xn∞+1 | xn) are mutually singular for all k,k′ ∈ Z+, k 6= k′, all n, all xn ∈ Xn. Then, for all k∗ ∈ Z+, for all θ∗ ∈ Θk∗ except for a subset of Θk∗ of wk∗-measure 0, the posterior distribution onK satisfies n+1 π(Kn+1 = k∗ |Xn) n−→→∞1 withPθ∗-probability 1. (8) Therequirement thatc·π(s) ≥ π(Es)isautomatically satisfiedifπ isoftheform m π(s) = π (m)π (k ) π (t |t > t )π (k ), (9) M K 1 T i i i−1 K i i=2 Y 7 where π , π and π are priors on Z+ with full support, and π is geometric: π (m) = θm−1(1−θ) for M K T M M some0≤ θ < 1. Inthiscasec= θ/(1−θ). We now extend the theorem to the case where the universal distributions p¯ ,p¯ ,... are not necessarily 1 2 Bayesian,i.e.theyarenotnecessarily oftheform(1). Itturnsoutthatthe“meta-Bayesian” universaldistri- bution P is still consistent, as long as the following condition holds. The condition essentially expresses sw that, for each k, p¯ must not be too different from a Bayesian predictive distribution based on (1). This k can be verified if all models M are exponential families, and the p¯ represent ML or smoothed ML esti- k k mators (see Theorems 2.1 and 2.2 of [LiandYu, 2000]). Wesuspect that it holds as well for more general parametricmodelsanduniversalcodes, butwedonotknowofanyproof. Condition Thereexist Bayesian prediction strategies p¯B,p¯B,... ofform (1),withcontinuous andstrictly 1 2 positivepriorsw ,w ,...suchthat 1 2 1. Theconditions ofTheorem1holdforp¯B,p¯B,...andthechosen switchdistribution priorπ. 1 2 2. For all k ∈ Z+, for each compact subset Θ′ of the interior of Θ , there exists a K such that for all k θ ∈ Θ′,withP -probability 1,foralln θ −logp¯ (Xn)+logp¯B(Xn) ≤ K. k k 3. Forallk,k′ ∈Z+withk 6= k′andallxn ∈ X∗,thedistributionsP¯kB(Xn∞+1 |xn)andP¯k′(Xn∞+1 | xn) aremutuallysingular. Theorem 2 (Consistency of the switch distribution, Part 2). Let p¯ ,p¯ ,... be prediction strategies and let 1 2 π be the prior of the corresponding switch distribution. Suppose that the condition above holds relative to p¯1,p¯2,... and π. Then, for all k∗ ∈ Z+, for all θ∗ ∈ Θk∗ except for a subset of Θk∗ of Lebesgue-measure 0,theposterior distribution onK satisfies n+1 π(Kn+1 = k∗ |Xn) n−→→∞1 withPθ∗-probability 1. (10) 4 Risk Convergence Rates In this section and the next we investigate how well the switch distribution is able to predict future data in terms of expected logarithmic loss or, equivalently, how fast estimates based on the switch distribution converge to the true distribution in terms of Kullback-Leibler risk. In Section 4.1, we define the central notions of model classes, risk, convergence in Cesa`ro mean, and minimax convergence rates, and we give the conditions on theprior distribution π under which our further results hold. Wethen (Section 4.2)show that the switch distribution cannot converge any slower than standard Bayesian model averaging. As a proof of concept, in Section 4.3 we present Theorem 4, which establishes that, in contrast to Bayesian modelaveraging,theswitchdistributionconvergesattheminimaxoptimalrateinanonparametrichistogram densityestimation setting. In the more technical Section 5, we develop a number of general tools for establishing optimal con- vergence rates for the switch distribution, and we show that optimal rates are achieved in, for example, nonparametric density estimation with exponential families and (basic) nonparametric linear regression, andalsoinstandardparametric situations. 8 4.1 Preliminaries 4.1.1 ModelClasses Thesetupisasfollows. SupposeM ,M ,...isasequenceofparametricmodelswithassociatedestimators 1 2 P¯ ,P¯ ,... as before. Let us write M = ∪∞ M for the union of the models. Although formally M is 1 2 k=1 k a set of prediction strategies, it will often be useful to consider the corresponding set of distributions for X∞ = (X ,X ,...). WithminorabuseofnotationwewilldenotethissetbyMaswell. 1 2 Totestthepredictions oftheswitchdistribution, wewillwanttoassumethatX∞ isdistributed accord- ing to a distribution P∗ that satisfies certain restrictions. These restrictions will always be formulated by assuming that P∗ ∈ M∗,whereM∗ issomerestricted setofdistributions forX∞. Forsimplicity, wewill also assume throughout that, forany n, theconditional distribution P∗(X | Xn−1)has adensity (relative n totheLebesgue orcounting measure) withprobability oneunder P∗. Forexample, ifX = [0,1], thenM∗ mightbethesetofalli.i.d.distributionsthathaveuniformlyboundeddensitieswithuniformlyboundedfirst derivatives, as will be considered in Section 4.3. In general, however, the sequence X∞ need not be i.i.d. (undertheelementsofM∗). We will refer to any set of distributions for X∞ as a model class. Thus both M and M∗ are model classes. In Section 5.5 it will be assumed that M∗ ⊆ M, which we will call the parametric setting. Most ofourresults,however,dealwithvariousnonparametric situations, inwhichM∗\Misnon-empty. Itwill thenbeusefultoemphasizethatM∗ is(much)largerthanMbycallingM∗ anonparametric modelclass. 4.1.2 Risk Given Xn−1 = xn−1, we will measure how well any estimator P¯ predicts X in terms of the Kullback- n Leibler(KL)divergenceD(P∗(X = · |xn−1)kP¯(X = ·| xn−1))[Barron,1998]. SupposethatP andQ n n are distributions forsome random variable Y, with densities p and q respectively. Thenthe KLdivergence fromP toQisdefinedas p(Y) D(PkQ) = E log . P q(Y) (cid:20) (cid:21) KL divergence is never negative, and reaches zero if and only if P equals Q. Taking an expectation over Xn−1 leadstothestandard definitionoftherisk ofestimator P¯ atsamplesizenrelativetoKLdivergence: rn(P∗,P¯) = E D P∗(Xn = ·| Xn−1)kP¯(Xn = · |Xn−1) . (11) Xn−1∼P∗ (cid:2) (cid:0) (cid:1)(cid:3) InsteadofthestandardKLrisk,wewillstudythecumulative risk n R (P∗,P¯) := r (P∗,P¯), (12) n i i=1 X because ofitsconnection toinformation theoretic redundancy (seee.g.[Barron,1998]or[Gru¨nwald,2007, Chapter15]): Forallnitholdsthat n n p∗(X | Xi−1) n p∗(X |Xi−1) r (P∗,P¯)= E log i = E log i = D P∗(n)kP¯(n) , (13) i p¯(X | Xi−1) p¯(X |Xi−1) Xi=1 Xi=1 (cid:20) i (cid:21) " Yi=1 i # (cid:16) (cid:17) where the superscript (n) denotes taking the marginal of the distribution on the first n outcomes. We will show convergence of the predictions of the switch distribution in terms of the cumulative rather than 9 the individual risk. This notion of convergence, defined below, is equivalent to the well-studied notion of convergence in Cesa`ro mean. It has been considered by, among others, Rissanenetal. [1992], Barron [1998],PolandandHutter[2005],anditsconnections toordinaryconvergence oftheriskwereinvestigated indetailbyGru¨nwald[2007]. Asymptotic properties like ‘convergence’ and ‘convergence in Cesa`ro mean’ will be expressed conve- nientlyusingthefollowingnotation, whichextendsnotation from[YangandBarron,1999]: Definition 2 (Asymptotic Ordering of Functions). For any two nonnegative functions g,h : Z+ → R and any c ≥ 0 we write g (cid:22) h if for all ǫ > 0 there exists an n such that for all n ≥ n it holds that c 0 0 g(n) ≤ (1+ǫ)·c·h(n). Thelessprecisestatement thatthereexistssomec > 0suchthatg (cid:22) ·h,willbe c denoted byg (cid:22) h. (Notetheabsence ofthesubscript.) Forc > 0,wedefineh (cid:23) g tomeang (cid:22) h,and c 1/c h (cid:23) g meansthatforsomec > 0,h(cid:23) g. Finally,wesaythatg ≍ hifbothg (cid:22) handh (cid:22) g. c Note that g (cid:22) h is equivalent to g(n) = O(h(n)). One may think of g(n) (cid:22) h(n) as another way of c writinglimsup g(n)/h(n) ≤ c. Thetwostatementsareequivalent ifh(n)isneverzero. n→∞ Wecannowsuccinctly statethattheriskofanestimator P¯ converges to0atratef(n)ifr (P∗,P¯) (cid:22) n 1 f(n), where f : Z+ → R is a nonnegative function such that f(n) goes to 0 as n increases. We say that P¯ converges to 0 at rate at least f(n) in Cesa`ro mean if 1 n r (P∗,P¯) (cid:22) 1 n f(i). As (cid:22) - n i=1 i 1 n i=1 1 ordering isinvariant under multiplication by apositive constant, convergence inCesa`ro meanis equivalent toasymptotically bounding thecumulativeriskofP¯ as P P n n r (P∗,P¯) (cid:22) f(i). (14) i 1 i=1 i=1 X X WewillalwaysexpressconvergenceinCesa`romeanintermsofcumulativerisksasin(14). Thereadermay verify that if the risk of P¯ is always finite and converges to 0 at rate f(n) and lim n f(n) = ∞, n→∞ i=1 then the risk of P¯ also converges in Cesa`ro mean at rate f(n). Conversely, suppose that the risk of P¯ convergesinCesa`romeanatratef(n). DoesthisalsoimplythattheriskofP¯ convergestPo0atratef(n)in the ordinary sense? The answer is “almost”, as shown in [Gru¨nwald, 2007]: The risk of P¯ may be strictly larger than f(n) for some n, but the gap between any two n and n′ > n at which the risk of P¯ exceeds f mustbecomeinfinitelylargewithincreasingn. Thisindicatesthat,althoughconvergence inCesa`romeanis a weaker notion than standard convergence, obtaining fast Cesa`ro mean convergence rates is still a worthy goal inprediction andestimation. Weexplore the connection between Cesa`ro andordinary convergence in moredetailinSection5.2. 4.1.3 MinimaxConvergenceRates Theworst-case cumulativeriskoftheswitchdistribution isgivenby n G (n)= sup r (P∗,P ). (15) sw i sw P∗∈M∗ i=1 X Wewillcompareittotheminimaxcumulative risk,definedas: n G (n):= inf sup r (P∗,P¯), (16) mm-fix i P¯ P∗∈M∗ i=1 X 10

Description:
simplicity, we assume throughout that the density p(Xn+1|xn) is taken relative to either the usual Lebesgue measure (if X is continuous) or the counting
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.