ebook img

Generating Probabilities From Numerical Weather Forecasts by Logistic Regression PDF

0.23 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Generating Probabilities From Numerical Weather Forecasts by Logistic Regression

Generating Probabilities From Numerical Weather Forecasts by Logistic Regression Jochen Br¨ocker∗ 9 0 0 January 28, 2009 2 n a J Abstract 8 Logisticmodelsarestudiedasatooltoconvertoutputfromnumerical 2 weatherforecastingsystems(deterministicandensemble)intoprobability ] forecasts for binary events. A logistic model obtains by putting the log- h arithmic odds ratio equal to a linear combination of the inputs. As any p statisticalmodel,logisticmodelswillsufferfromover-fittingifthenumber - o of inputs is comparable to the number of forecast instances. Computa- a tionalapproachestoavoidover-fittingbyregularisationarediscussed,and . efficient approaches for model assessment and selection are presented. A s c logit version of thesocalled lasso, which isoriginally alinear tool, isdis- i cussed. Inlassomodels,lessimportantinputsareidentifiedanddiscarded, s y thereby providing an efficient and automatic model reduction procedure. h For this reason, lasso models are particularly appealing for diagnostic p purposes. [ 1 1 Introduction v 0 6 Providing forecasts of future events in terms of probabilities has a long and 4 successful history in the environmental sciences. The inherently unstable dy- 4 namics of the atmosphere in conjunction with incomplete information on its . 1 current state prohibit unequivocal forecasts. Probabilities allow to quantify 0 uncertainty (or the absence thereof, i.e. information) in a well defined and 9 consistent manner. So called “subjective probability forecasts”, compiled by 0 experienced meteorologists, were issued by several meteorological offices from : v the 1950’s onwards. These forecasts were based on synoptic weather data col- i X lected from a large number of stations. On a scientific (non–operational) level, probabilisticweatherforecastswerediscussedmuchearlier,eitherbasedonsyn- r a optic information as well as local station data (see e.g. Besson, 1905; Murphy, 1998, provides an excellent account on the history of probability forecasting of weather, along with many references). There is evidence that subjective probabilistic weather forecasts were compiled from non–synoptic information as early as the 1790’s (Murphy, 1998). Desirable properties of probabilistic forecasts as well as methods to quantify their success have been investigated in variouspapers,seeforexamplevon Myrbach(1913);Brier(1950);Good(1952); ∗Max–Planck–Institut fu¨r Physikkomplexer Systeme, No¨thnitzer Strasse34, 01187 Dres- den,Germany,email: [email protected] 1 Winkler and Murphy(1968);Murphy and Winkler(1977,1987);Murphy(1993, 1996), a list which is by no means complete. The advent of the electronic computer opened the possibility to calculate “numerical subjective probabilities”, that is, to calculate probabilities from in- formation data using tuned algorithms. More specifically, an automated sta- tistical learning procedure can be employed to find a relationship (also called model)betweenthe informationdata(alsocalledcovariates, features,orinputs) andthestatisticalpropertiesofthe variabletobeforecast(alsocalledtarget,or verification). Possible inputs might for example be provided by output from a numericalweather predictionsystem,in whichcasethe problemis alsoreferred to as ensemble calibration or probabilistic down-scaling. A model (or more specifically model class) which has gained some attention in the meteorological community is the logistic model, see for example Tippet et al. (2007); Wilks (2006); Hamill et al. (2001) and also references therein for various alternatives. Logistic models, often also referred to as logistic regression,will be the subject of this paper. We will exclusively be concernedwith dichotomic problems, that is,we areonly interestedinforecastingwhethera certaineventhappens ornot. In this case, the logistic model obtains by taking the logarithmic odds ratio log( ρ ) of the forecast probability ρ of the event as a linear function of the 1−ρ inputs. In other words, exp(xβt) ρ= , (1) 1+exp(xβt) where x are the inputs and β some coefficients. The maximum likelihood prin- ciple provides a convenient way to find the coefficients, but alternatives will be consideredinthispaper,too. Inanycase,the coefficientsarefoundbyoptimis- ing the performance over some training data. As with other regression models though,this approachrunsintoproblemsifthe number ofinputs isofthe same order of magnitude as the number of instances in the training data, in which case the inputs are typically also highly correlated. This is a commonsituation in weather forecasting, owing to the large number of available forecast sources. Onewaytoavoidover-fittinginthissituationistomanuallyrestrictthenumber of inputs to the few that seem to be most relevant. This was carried out for example by Besson (1905), but often such a study would require to assess an astronomically large amount of different combinations. Another way is to apply regularisation, which means to reduce the effective degrees of freedom of the model. Efficient regularisation techniques exist for linear models. Owing to the great similarity between linear and logistic mod- els, these techniques can be modified and applied to logistic models, as will be demonstrated in this paper. Logistic models will be defined in Section 2. Section 3 discusses how to regularise logistic models along with further com- putational and implementational aspects. Sections 2 and 3 therefore form the core of the paper. Before getting there, some notation and concepts need to be introduced(Section2). Numericalstudies employingprecipitationforecastsare presentedinSection4. Apartfromthe predictivepoweroflogisticmodels,they also permit to investigate the relative importance of the inputs. The material of this last section thus also demonstrates the capabilities of logistic models as a diagnostic tool. 2 2 Problem Statement and Concepts Theprimaryaimofthissectionistosettlenotationalconventionsandintroduce some concepts. The general setup we have in mind is as follows. As in the introduction, let the target Y be a random variable taking the values 0 and 1 only, with Y =1 indicating that the event under concern happened and Y =0 otherwise. The inputs X are random variables too, taking values in Rd. By x, we will denote any generic point in Rd, while y is always either zero or one. The underlying probability measure will be denoted by P. The probabilistic relationship between X and Y is described through the following objects. Let f (x):=P(X ∈x+dx|Y =0), 0 (2) f (x):=P(X ∈x+dx|Y =1), 1 that is, f and f , respectively, are the densities of X given Y = 0 and Y = 1, 0 1 respectively. By π(x):=P(Y =1|X =x) (3) we denote the conditional probability of the event “Y =1” given X, and π¯ :=P(Y =1) (4) denotes the base rate or grand probability of the event “Y =1”. Finally, f(x):=P(X ∈x+dx) (5) denotes the unconditional density of the inputs X. The Bayes rule entails variousrelationsbetweentheseobjects,forexamplef(x)=f (x)π¯+f (x)(1−π¯). 1 0 A model is a function ρ:Rd →[0,1]so that ρ(X) is the forecastprobability of the event “Y = 1”. Generally speaking, the problem discussed in this paper is to find “good” models, where it remains to be defined what “good” means. Intuitively,itisclearthatρ(x):=π(x)isagoodmodel. Unfortunately,π(x) is notanempiricallymeasurablequantity,andtherefore“interpolating”or“fit- ting” π(x) is not a possible approachto determine ρ(x). A general operational criterion to fit (deterministic or probabilistic) relationships between measured quantities is to optimise the estimated performance, according to suitable cri- teria of performance. Such criteria are subject of the next subsection. 2.1 Scoring Rules A scoring rule (Good, 1952; Kelly, 1956; Brown, 1970; Savage, 1971) is a func- tion S(p,y) where p ∈ [0,1] and y is either zero or one. If ρ(X) is the forecast probability and Y is the corresponding target, then S(ρ(X),Y) quantifies how well ρ(X) succeeded in forecasting Y. Two important examples are the Igno- rance score (Good, 1952), given by the scoring rule S(p,y):=−log(p)·y−log(1−p)·(1−y), (6) and the Brier score (Brier, 1950), given by the scoring rule S(p,y):=(y−p)2 =(1−p)2·y+p2·(1−y). (7) These definitions imply the convention that a smaller score indicates a better forecast. 3 A scoreor scoringrule quantifiesthe successofindividualforecastinstances bycomparingtherandomvariablesρ(X)andY point-wise. Thegeneralquality ofaforecastingsystemiscommonlymeasuredbythemathematicalexpectation E[S(ρ(X),Y)] of the score, which can be estimated by the empirical mean N 1 E[S(ρ(X),Y)]∼= S(ρ(xi),yi) (8) N Xi=1 over a sufficiently large set {(x ,y );i=1...N} of input–target pairs. i i Reassuringly, for the two mentioned scoring rules the score becomes better (i. e. decreases) with increasing ρ if the event happened, while if it does not, the score becomes worse (i.e. increases) with increasing ρ. Furthermore, both scores are proper. To define this notion, consider the scoring function s(q,p):=S(q,1)·p+S(q,0)·(1−p) (9) where q,p are two arbitrary probabilities, that is, numbers in the unit interval. Thescoringfunctionisthe mathematicalexpectationofthescoreinasituation where the forecast is q but in fact p is the true probability of the event “Y = 1”. A score is strictly proper (Brown, 1970; Bro¨cker and Smith, 2007) if the divergence function (or loss function) d(q,p):=s(q,p)−s(p,p) (10) ispositivedefinite,thatis,nevernegativeandzeroonlyifp=q. Thedivergence functionoftheBrierscoreforexampleisd(q,p):=(q−p)2,demonstratingthat this score is strictly proper. The Ignorance is proper as well, since (10) is just the Kullback–Leibler–divergence,which is well known to be positive definite. The mathematical expectation of a strictly proper score allows for a very interesting decomposition (see Bro¨cker, 2007, for a proof). For any strictly properscoringrule, define the entropy e(p):=s(p,p). Furthermorelet π (r):= ρ P(Y =1|ρ(X)=r)betheconditionalprobabilityofY =1giventhatρ(X)=r. This quantity is a function of ρ, but it is a fully calibrated probability forecast. With these definitions, it can be shown that ES(ρ,Y)=e(π¯)−Ed(π¯,π)+Ed(π ,π)+Ed(ρ,π ). (11) ρ ρ These terms can be interpreted as follows: The entropy e(π¯) is the ability of the base rate π¯ to forecast draws from itself, and hence quantifies the funda- mental uncertainty inherent in Y. The term Ed(π¯,π) is positive definite and quantifiestheaveragedivergenceofπ fromitsmean. Itcanhencebeconsidered a generalised variance of π. If the Brier score is used, this term is in fact the ordinary variance of π. The term Ed(π ,π) is also positive definite and quanti- ρ fies how much information is lost when going over from X to ρ(X). The term Ed(ρ,π ) is againpositive definite andquantifies the imperfect calibrationofρ. ρ The reader might want to convince himself that if the Brier score is used and furthermoreX =ρ(X)(i.e.the inputs alreadycompriseaprobabilityforecast), then relation(11) agreeswith the well knowndecomposition of the Brier score. Inparticular,ifρ(x)=π(x),thenalsoπ =π(x)andhencethethirdandfourth ρ term in Equation (11) vanish. We can conclude that the forecast π(X) in fact yields an optimum expected score among all models which can be written as a functionofX. Toachieveyetbetterscores,moreinformationaboutY isneeded than what is contained in X. 4 2.2 Logistic Regression ComprehensivediscussionsofLogisticregressioncanbefoundinMcCullagh and Nelder (1989); Hastie et al. (2001). As mentioned in the Introduction, logistic regres- sion assumes a model of the form ρ(x) = g(xβt), where β are the coefficients and exp(z) g(z):= (12) 1+exp(z) the so–called link function. The quantity g−1(z) = log(z/(1−z)) is referred to as the log–odds–ratio. We will call η := xβt the linear response. From now on and throughout the paper, we will assume that the inputs x carry an entry 1 in the first position and that β = (β ...β ) where β is referred to as the 0 d 0 intercept. Since g−1(ρ) = η, in logistic models, the log–odds–ratio equals the linear response. The coefficients β are determined by minimising the empirical score. Locallyaroundtheoptimum,thisminimisationturnsouttobeequivalent to weighted linear regression, as will be seen in the next subsection. Thereby, logistic models inherit various useful properties from linear models, as long as strictlyproperscoresareusedintheempiricalscore. Thisfactwillbeexploited in the next section. 3 Computational Topics and Regularisation of Logistic Models Consider the empirical score of a logistic model N 1 R(β):= S(g(x βt),y ) (13) k k N kX=1 where as before S is a scoring rule, g is the link function, and {(x ,y ),k = k k 1...N} is a set of input–target pairs, henceforth called training set. We let Rˆ denote the minimum of R with respect to β, and βˆ a corresponding stationary point. To find a stationary point of R, the Newton–Raphson algorithm can be used. The update step for this iterative algorithm can be written as βt =βt− XtWX −1Xtd (14) new (cid:0) (cid:1) with the abbreviations X := x(l) (15) kl k ∂ d := S(g(η ),y ) (16) k k k ∂η ∂2 w := S(g(η ),y ) (17) k ∂η2 k k W := δ w , (18) kl kl k where η := x βt is the log–odds–ratio for sample k. In the case of the Igno- k k rance, it is easy to see that d =g(η )−y , w =g(η )(1−g(η )) (19) k k k k k k Equation (14) is in fact very similar to weighted least squares regression with linear models (Hastie et al., 2001). 5 3.1 L –Type Regularisation 2 Whichever score or minimisation algorithm we employ, the coefficients βˆ so determinedwillshowpoorout–of–sampleperformanceifthenumber ofdegrees of freedom is of the same order of magnitude as the number of instances in the training set. Small changes in the training set will entail large changes in the coefficients,orinotherwords,the coefficientswillexhibitlargevariance. Apart from poor performance, the results become difficult to interprete (Hastie et al., 2001). To give a heuristic argument as to why this happens, suppose we want to fit a linear model to real valued data, and there are two highly correlated inputs. Any large value for β (the coefficient corresponding to the first input) 1 canbe compensatedbyalargevalue(withopposingsign)forβ (thecoefficient 2 corresponding to the second input). The algorithm will use this freedom to “model” the random fluctuations in the inputs. To avoid this behaviour, the degrees of freedom of the model have to be reduced(or,what amounts to the same,the rangeof variationof β), preferably inanadaptivemanner. Astraightforwardapproachistosearchforaminimum of R only among all β with |β|2 ≤ µ with a regularisation parameter µ. Here |β|2 = d β2. The reader is reminded of the convention that β has d+1 k=1 k entries iPn total, but we decided not to put any constraint on the intercept β0. Furthermore, we assume that all inputs are centred and scaled so that they have mean zero and unit variance. To see why regularisation has the desired effect, and to obtain criteria for choosing an appropriate µ, the problem has to be brought into another form. For βˆ to be a stationary point of R under the constraint |β|2 ≤µ, it is necessary that there is a λˆ so that the Lagrangian L(β,λ):=R(β)+λ(|β|2−µ) has a saddle at (βˆ,λˆ), which obviously entails that βˆ is a stationary point for R (β):=R(β)+λˆ|β|2. (20) λˆ We arrive at the conclusion If βˆ maximises R(β) under the constraint |β|2 ≤ µ and if λˆ is the corresponding Lagrange multiplier, then βˆ is a stationary point of R (β). Conversely,ifwefixaλˆ >0andletβˆbeastationarypointof λˆ R (β), then βˆmaximises R(β) under the constraint|β|2 ≤µ:=|βˆ|, λˆ and the corresponding Lagrange multiplier is λˆ. ItisprobablymoreintuitivetooptimiseR(β)underasizeconstraintonβrather thantoaddapenaltytermtotheempiricalscore,astheformercriterionmakes theconstraintonthe coefficientsmoreexplicit. Augmentingtheempiricalscore by a penalty term though (as in Equation 20) has computational advantages, in particular to establish a criterion for choosing the penalty λ, as will be seen soon. Whicheveroptionistaken,itisunderstoodfromnowonthat,havingfixed either λ or µ, the coefficients βˆ depend on λ. Clearly, assessing the suitability of a particular λ by looking at either R(βˆ) or R (βˆ) is impossible, since both λ measures are optimal for λ = 0. A less optimistic measure of performance is the leave–one–out score, which is defined as follows: Let β be the stationary ˆı point of 1 S(g(x βt),y ) − λ|β|2, that is, we remove the i-th point N−1 k6=i k k 2 P 6 from the training set. Having computed β for every i = 1...N, we form the ˆı leave–one–outoutput g :=g(x βt) and finally the leave–one–outscore ˆı i ˆı 1 R (λ):= S(g ,y ), (21) loo N ˆı i Xi whichistheninvestigatedasafunctionofλ(orequivalentlyµ). Theleave–one– out score evaluates every β on precisely that sample point which was removed ˆı fromthetrainingsetbeforefindingβ . Theprospectofhavingtodeterminethe ˆı coefficients N times in order to compute the leave–one–out score for a single λ seems horrible at first glance, but a few calculations will help to simplify this problem drastically. In Appendix A, it will be shown that approximately 1 ηˆı :=xiβˆıt ∼=ηˆ+ 1−wˆ x H−1xtxiH−1 xtidˆi+2Λβˆ . (22) i i i h i with 0 0  1 0  H=XtWˆX+(N −1)Λ, Λ= .. (23)  .     0 0 1    All quantities that carry a hatˆare evaluated at βˆ. The right hand side of Equation (22) is a function of βˆ and λ and therefore can be calculated without having to determine all N leave–one–out coefficients explicitely. Furthermore, to compute η , only very few operations are required repeatedly for every i. ˆı The matrix inversion H−1 needs to be performed only once. In fact, if the Newton–Raphson method is used, all quantities can be recycled. The leave–one–outerror will in general be larger than R(βˆ). The difference allows for a very interesting interpretation in terms of effective degrees of free- dom ofthe model. We now proceedassumingthat the Ignorancehasbeen used as a score. Define δ by δ R (λ)=R(βˆ)+ , (24) loo N It is possible to show (Stone, 1977) that for a model with free parameters, δ asymptotically equals the dimension of the parameter space. Using similar calculations in the present case, it is possible to show that δ ∼= d+1−O(λ) for small λ and δ → 1 for large λ, where d is the number of parameters in the model(notcountingtheintercept). Ifweinterpreteδ asthenumberofeffective degrees of freedom of the model, we obtain the reassuring conclusion that for vanishing penalty the model has d+1 effective degrees of freedom, while for increasingpenaltythenumberofeffectivedegreesoffreedomreducesto1,owing to the fact that no penalty was imposed on the intercept β . Equation (24) is 0 a version of Akaike’s information criterion (AIC, see e.g. Hastie et al., 2001). Akaikerecommendsthatifmodelsareindexedbyaparameterλ,say,themodel with minimum δ AIC:=2R(βˆ)+2 . (25) N should be selected, which we see is asymptotically equivalent to minimising R . Although AIC and leave–one–out error are asymptotically the same, the loo two quantities can differ somewhat for very small sample sizes. If the degrees 7 of freedom of a model are known for some reason, it is possible to use the AIC directly as a criterion for determining the regularisation parameter. We will however go the other way and, knowing R and R(βˆ), determine δ for loo diagnostic purposes. ThereisacorrespondingrelationfortheBrierscore,relatingR ,R(βˆ)and loo thedegreesoffreedom. Thisstatistic,knownasC statisticHastie et al.(2001), p is given by δ C :=R(βˆ)+2 E[g(1−g)]. (26) p N The derivation of C assumes that g(1−g) is approximately constant. This p might be justified in many linear regression situations or if g has a sharply concentrated distribution, but in general this seems to be a quite idealistic as- sumption. Adirectcalculationthoughoftheleave–one–outerrorEquation(21) withtheapproximation(22),bothvalidforanyscore,doesnotsufferfromthese problems and were found here to give much better results. As said,the AIC orthe C –statisticmightstillbe a lastresortifcalculating p the leave–one–out coefficients β is difficult or impossible. It is then necessary ˆı to obtain δ, the number of effective degrees of freedom, by other means. This can require tricky analysis. The next subsection discusses an interesting mod- ification of the current setup for which the number of degrees of freedom are fortunately known. 3.2 L –Type Regularisation, or the Lasso 1 In the previous subsection, we regularised our estimates by constraining the size of β, measured in the L –sense. For linear regression, Tibshirani (1996) 2 suggested to use the L –norm as an alternative, that is, the score is minimised 1 d under the constraint |β| ≤ µ, where |β| := |β |. As before, no constraint k=1 k isplacedontheintercept. TheresultingregPressiontechniquehasbecomeknown asthe lasso. Aninterestingfeature ofthe lassothoughisthatwith increasingly tightconstraining,somecoefficientsbecomeexactlyzero. Interpretingthecorre- sponding inputs as “less important”, the lasso technique is appealing also from a diagnostic point of view. Recently, it has been shown by Zou et al. (2007) that, consistent with intuition, the number of degrees of freedom of the lasso is given by the number of nonzero coefficients. The main features of the lasso persist when logistic models are used with an L –penalty on the coefficients, that is, with increasingly tight constraining, 1 someofthecoefficientsvanishexactly. Thename“lasso”willbekeptalsoforthe logistic case, even though it was originally used for the linear case only. Calcu- latingthe coefficientsforthelassoismoreinvolvedthanforstandardregression orL –regularisedregressionandrequiresquadraticoptimisationtechniques. We 2 havenotrigorouslyproventhatthenumberofdegreesoffreedominthelogistic caseisstillgivenbythenumberofnonzerocoefficients,althoughthisappearsto be quite plausible. Hence wesuggestto determine the regularisationparameter by minimising δ AIC=2R(βˆ)+2 , (27) N whereδisthenumberofnonzerocoefficients,RistheempiricalIgnorancescore, and βˆ is the coefficient which optimises R(β) under the constraint |β| ≤ µ or equivalently R(β)+λ|β|. 8 4 Numerical Experiments In this section, regularised logistic regression is applied to the occurrence of precipitation. Several weather stations in Western Europe were investigated, with similar findings. As a representative example, results for Heligoland, Ger- many(WMO10015)arepresentedhere. Asinputs,highresolutiondeterministic andensembleforecastswereused. Theensembleforecastsconsistofthe50(per- turbed) member ensemble, produced by the then–operational ECMWF global ensemble prediction system. Station data of precipitation was kindly provided byECMWFaswell. Forecastswereavailablefortheyears2001–2005,featuring lead times from one to ten days. All data verified at noon. To form the inputs, high resolution, control and ensemble forecasts for mean sea level pressure, two metre temperature, and precipitation itself were used. Different combinations of inputs were tested. To describe the input com- binations, we will use the following abbreviations: We write prcp, mslp, and t2m for precipitation, mean sea level pressure, and two metre temperature, re- spectively. The high resolution forecast is indicated with a suffix h, while the controlandtheensemblecarrythesuffixes cand e,respectively. Forexample, the high resolution mean sea level pressure forecast is denoted by mslp h, or the ensemble two metre temperature forecast by t2m e. The input season is simply the phase of the year, that is, 2π 2π season= cos( n),sin( n) n=no. of the day (28) (cid:18) 365.2425 365.2425 (cid:19) Intotal,fourdifferentcombinationswereinvestigated(seeTable1). Allcom- binations include season. The first combinationadds prcp h, resulting in 3 in- puts. The second combinationuses all available highresolutionforecasts twice: plainandsquared(8inputs),therebymodellingpotentialnonlinearconnections between precipitation events and the inputs. The third combination uses all available precipitation forecasts: prcp h, prcp c, and prcp e (54 inputs). The fourth combination uses all available forecasts: high resolution, controland en- semble for precip, pressure and temperature. Again, each forecast is included both plain and squared. This combination comprises 314 inputs. I should say that, to the best of my knowledge (although I cannot claim to have combed the literature verythoroughly),so far only the ensemble mean andspreadhave been used as inputs to logistic regression. This might have been done either to avoid over-fitting or in the belief that the ensemble does not contain relevant information beyond mean and spread. The results for the four combinations are displayed in Figures 1–4, which show the empirical score (top panels) and the number of effective degrees of freedom (bottom panels), as defined in Equation (24). The performance is pre- sented in an incremental fashion: Figure 1, top panel, shows the performance of combination I relative to climatology, Figure 2, top panel, shows the perfor- mance ofcombinationII relativeto combinationI and so forth. The confidence bars for all plots were obtained using 10–fold cross validation. Figure 1, top panel, demonstrates the interesting (albeit maybe not surprising) fact that the high resolutionforecastfor precipitation containsa fair amount of probabilistic information, if processedcorrectly. The forecasteven seems to have skill out to day 9. Since this model is trained on around 1600 samples but has only four parameters, it is not surprising that the number of effective degrees of freedom 9 δ is between 4 and 5. The somewhat odd finding that δ is even larger than the total number of parameters has two reasons. Firstly, we are dealing with a finite data set, while the result is true only in the limit of an infinite amount of data. Secondly, due to the penalisation, our parameters are not asymptotically unbiased. Strictly speaking, this entails a further correction to Equation (24), which we found to be less than .1 in all considered examples. Addinghighresolutionforecastsformeansealevelpressureandtemperature to the mix generally adds skill, as Figure 2, top panel, demonstrates. There seems to be no effect though at high lead times, like 9 days. We will see later that there is strong indication that the squared mean sea level pressure adds additional information, more specifically the high resolution mslp h2 at short lead times and the ensemble mslp e2 at longer lead times. What we can see already from Figure 2, bottom panel, though is that the number of EDF’s increases, in particular for lead times 48h and 96h, where this combination featuresthe largestincreaseinskillovercombinationI.Thenumber ofeffective degreesoffreedom δ is howeveralwayssignificantly smallerthan 9, the number of coefficients in this model. This indicates that notall ofthe additional inputs add independent information. Somewhat surprisingly, the precipitation ensemble (along with prcp h and prcp c) shows close to no improvement in skill over combination I in this con- text, apart from lead time 48h maybe (Fig. 3, top panel). The number of effective degrees of freedom δ is always far less than the number of coefficients in this model, and from lead times 96h onwards, δ is even comparable to what was found for combination II. Significant increase in skill is obtained using combination VI (Fig. 4, top panel). In addition to combination III, the present combination uses pressure andtemperatureaswellasallvariablesonceplainandoncesquared. Itwouldof course be interesting to know which of the inputs makes the biggest difference. We have not investigated this in full, but a partial answer will be obtained later using the lasso technique. What is important here is that despite the large number of coefficients (314), there appear to be no signs of over-fitting. Figure 4, bottom panel, demonstrates that the number of effective degrees of freedom δ, albeit always far less than the number of coefficients in this model, is significantly larger than for any other combinations tested here. We can conclude that the additional inputs in fact do add additional information. An interesting aside is that δ always seems to have a maximum around 96h lead time. We have not investigated the reason, but speculate that this is related to the ensemble generation. The ensembles are free runs of slightly perturbed initial conditions. The spread of the ensemble typically grows initially, due to the local instabilities. The growth of uncertainty is of course the reason why ensembles are used in the first place. Eventually, the spread will saturate due to nonlinear effects. The middle ground is where each ensemble member adds the most information to the whole. We are now going to discuss some sample results for the lasso. The goal is to get some idea as to the relative importance of the inputs. As discussed, an interesting feature of the lasso is that with increasingly tight bound on the coefficients (i.e. decreasing µ), some coefficients become zero. The numerical experiments discussed here readily demonstrate this effect. As an example, forecastsforleadtime 48hwereconsidered. Asinputs thevariablesprcp,mslp, t2m, were used, along with season. To get the big picture, we dispensed with 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.