ebook img

Model selection for dynamical systems via sparse regression and information criteria PDF

1.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Model selection for dynamical systems via sparse regression and information criteria

Mangan,Kutz,Brunton&Proctor Model selection for dynamical systems via sparse regression and information criteria Niall M. Mangan‡∗, J. Nathan Kutz∗, Steven L. Brunton†, and Joshua L. Proctor‡ Abstract. Wedevelopanalgorithmformodelselectionwhichallowsfortheconsiderationofacombinatorially large number of candidate models governing a dynamical system. The innovation circumvents a disadvantage of standard model selection which typically limits the number candidate models considered due to the intractability of computing information criteria. Using a recently developed sparse identification of nonlinear dynamics algorithm, the sub-selection of candidate models near the Pareto frontier allows for a tractable computation of AIC (Akaike information criteria) or BIC 7 (Bayes information criteria) scores for the remaining candidate models. The information criteria 1 hierarchicallyranksthemostinformativemodels,enablingtheautomaticandprincipledselectionof 0 the model with the strongest support in relation to the time series data. Specifically, we show that 2 AIC scores place each candidate model in the strong support, weak support or no support category. n Themethodcorrectlyidentifiesseveralcanonicaldynamicalsystems,includinganSEIR(susceptible- a exposed-infectious-recovered)diseasemodelandtheLorenzequations,givingthecorrectdynamical J system as the only candidate model with strong support. 6 ] 1. Introduction. Nonlinear dynamical systems theory has provided a fundamental char- n acterization and understanding of phenomenon across the physical, engineering and biological a - sciences. Traditionally, simplified models are posited by domain experts, and simulations and a analysis are used to explore the underlying dynamical behavior which may include chaotic t a dynamics (e.g. Lorenz equations), nonlinear oscillations (e.g. van der Pol, Duffing), and/or d . bifurcations. The emergence of data-driven modeling methods provides an alternative frame- s c work for the discovery and/or inference of governing nonlinear dynamical equations. From i s this perspective, governing models are posited from time-series measurement data alone. The y recent sparse identification of nonlinear dynamics (SINDy) method [1] uses sparse regression h p and a Pareto analysis to correctly discover parsimonious governing equations from a combi- [ natorially large set of potential dynamical models. This methodology can be generalized to 1 spatio-temporal systems [2, 3] and dynamical systems characterized with rational function v nonlinearities which often occur in biological networks [4]. Although previously suggested [4], 3 no explicit connection between the sparse selection process and information theoretic criteria 7 7 has been established. Information criteria is the standard statistical method established for 1 the model selection process. In this manuscript, we demonstrate that the Akaike information 0 . criteria (AIC) and/or Bayesian information criteria (BIC) can be connected with the SINDy 1 architecture to hierarchically rank models on the Pareto front for automatic selection of the 0 7 most informative model. As outlined in Fig. 1.1, the AIC/BIC metrics can be used to cor- 1 rectly infer dynamical systems for a given time-series data set from a combinatorially large : v set of models. To our knowledge, this is the first explicit demonstration of how information i X theory can be exploited for the identification of dynamical systems. r Modelselectionisawellestablishedstatisticalframeworkforselectingamodelfromasetof a ∗Department of Applied Mathematics, University of Washington, Seattle, WA. 98195. [email protected]. †Department of Mechanical Engineering, University of Washington, Seattle, WA. 98195. ‡Institute for Disease Modeling, Bellevue, WA 98005, USA 1 Mangan,Kutz,Brunton&Proctor Figure1.1. Schematicofmodelselectionprocess,witha)datageneration,b)generationofasetofpotential models, and c) comparison fo the models as a function of the number of terms in the model and relative Akaike informationcriteria(AIC ). Sectionc)showshowmodelsaredown-selectedfromacombinatoriallylargemodel c spaceusingsparseidentificationofnonlineardynamics(SINDy)andthenfurthersub-selectedandrankedusing information criteria. candidate models given time series data, with information theory providing a rigorous criteria for such a selection process. As early as the 1950s, a measure of information loss between empirically collected data and model generated data was proposed to be computed using the Kullback-Leibler (KL) divergence [5, 6]. Akaike built upon this notion to establish a relative estimate of information loss across models that balances model complexity, and goodness-of- fit[7,8]. ThisallowedforaprincipledmodelselectioncriteriathroughtheAkaikeinformation 2 Mangan,Kutz,Brunton&Proctor criteria (AIC). The AIC was later modified by G. Schwarz to define the more commonly used Bayesinformationcriteria(BIC)[9]. BothAICandBICcomputethemaximumloglikelihood of the model and impose a penalty: AIC adds the number of free parameters k of the posited model, while BIC adds half of k multiplied by the log of the number of data points m. Much ofthepopularityofBICstemsfromthefactthatitcanberigorouslyprovedtobeaconsistent metric [9]. Thus if a number of modelsq are proposed, with one of them being the true model, then as m → ∞ the true model is selected as the correct model with probability approaching unity. Regardless of the selection criterion, AIC or BIC, they both provide a relative estimate of information loss across a selection of m models, balancing model complexity and goodness- of-fit [10]. Although successful and statistically rigorous, model selection in its standard implemen- tation is typically performed on q predetermined candidate models, where q is often ten or less[10, 11, 12, 13, 14, 15]. For modern applications to dynamical systems where rich, high- fidelity time series data can be acquired, the restriction on the number of models limits the potential impact of AIC/BIC metrics for discovering the correct nonlinear dynamics. Instead, it is desired to consider a combinatorially large set of potential dynamical models as candi- dates, thus enforcing that q (cid:29) 1. This is computationally intractable with standard model selection, as each of the models from the combinatorially large set would have to be simulated and then evaluated for a BIC/AIC score. As an alternative, sparse regression techniques, embodied by the lasso (least absolute shrinkage and selection operator) method of Tibshirani [16], have enabled variable selection algorithms capable of optimally choosing among a combinatorially large set of potential pre- dictors. Specifically,alassoregressionanalysis,oroneofitsmanygeneralizationsandvariants, performsbothvariableselectionandregularizationinordertoenhancethepredictionaccuracy and interpretability of the statistical model it produces. Such a mathematical tool provides a critically enabling framework for model selection, in particular for identifying dynamical systems. In this work we demonstrate a new mathematical framework that leverages information criteria for model selection with sparse regression for evaluating a combinatorially large set of candidate models. Specifically, we circumvent a direct computation of information criteria for the combinatorially large set of models by first sub-selecting candidate functional forms from which are most consistent with the time series data. Thus we integrate two maturing fields of statistical analysis: (i) sparse regression for nonlinear systems identification via SINDy and (ii) model selection via information criteria. Our algorithm is demonstrated to produce a robust procedure for discovering parsimonious, nonlinear dynamical systems from time series measurement data alone. We demonstrate the methodology on a number of important ex- amples, including the SEIR (susceptible-exposed-infectious-recovered) disease model and the Lorenz equations, and demonstrate its efficacy as a function of noise, length of time series and other key regression factors. Our sparse selection of dynamical models from information the- ory criteria ranks the candidate models and further shows that the correct model is strongly supported by the AIC/BIC metrics. Ultimately, the method provides a cross-validated and ranked set of candidate nonlinear dynamical models for a given time-series of measurement data, thus enabling data-driven discovery of the underlying governing equations. 3 Mangan,Kutz,Brunton&Proctor 2. Background. 2.1. Model selection via information criteria. The process of model selection funda- mentally enables the connection of observations or data to a mathematical model. Further, a well-selected model, which describes a governing law or physical principle underlying the system process, can be utilized for prediction outside of the sampled data and parameter configuration [10]. The substantial challenge facing the selection process is discovering the best predictive model from a combinatorially large space of available models. To emphasize the enormity of this task, consider the number of possible polynomial models up to degree 4 with 5 state variables. Approximately 1038 models would need to be constructed, fit to the data, and compared according to a goodness-of-fit metric [4]. Thus, model selection quickly becomescomputationallyintractableforamodestnumberofvariablesandpolynomialdegree. Typically, a sub-selection of models occurs based on prior scientific knowledge of the pro- cesstoproduceasubset,O(10),ofheuristicallydefinedcandidatemodels[10,11,12,13,14,15]. Recent research has focused on automatically expanding the number of candidate mod- els [17, 18, 19]. Once a subset of models is chosen, the model selection procedure balances the goodness-of-fit with model complexity, i.e. the number of free parameters. A wide-variety of rigorous statistical metrics have been developed to balance model parsimony and predic- tive power including popular methods such as the Akaike information criterion (AIC) [7, 8], Bayesian information criterion (BIC) [9], cross-validation (CV) [20], deviance information criterion (DIC) [21], and minimum description length (MDL) [22]. Methods such as AIC ex- plicitly balance parsimony and relative information loss across models, penalizing the number of parameters in the model to avoid overfitting. Inthismanuscript,weutilizetheubiquitousandwell-knownAICasthestatisticalcriterion for comparing candidate models. The AIC value for each candidate model j is: (2.1) AIC = 2k−2ln(L(x,µˆ)), j whereL(x,µ) = P(x|µ)isthelikelihoodfunction(conditionalprobability)oftheobservations x given the parameters µ of a candidate model, k is the number of free parameters to be estimated, and µˆ is the best-fit parameter values for the data [7, 8]. In practice, the AIC requires a correction for finite sample sizes given by (2.2) AIC = AIC +2(k+1)(k+2)/(m−k−2), c where m is the number of observations. A common likelihood function uses the residual sum of squares (RSS), given by RSS = (cid:80)m (y −g(x ;µ))2, where y are the observed outcomes, i=1 i i i x are the observed independent variables, and g is the candidate model. The RSS is a i well-known objective function for least squares fitting. In this case AIC can be expressed as AIC = mln(RSS/m)+2k [10]. Note that (2.1) penalizes, by increasing the AIC score, the models that have a large number of free parameters and which are unable to capture the characteristics of the observed data. 2.2. SINDy and sparse model selection. Identifying dynamical systems models from data is increasingly possible with access to high-fidelity data from simulations and experi- ments. With traditional methods, only a small handful of model structures may be posited 4 Mangan,Kutz,Brunton&Proctor and fit to data via regression. Indeed, simultaneous identification of both the structure and parameters of a model generally requires an intractable search through combinatorially many candidate models. Genetic programming has been recently used to determine the structure and parameters of dynamical systems [23, 17, 24] and control laws [25], enabling the efficient search of complex function spaces. Sparsity-promoting techniques have also been employed to simultaneously identify the structure and parameters of a dynamical system model. Com- pressed sensing was first used to determine the active terms in the dynamics [26], although it does not work well with overdetermined systems that arise when measurements are abun- dant. In contrast, the sparse identification of nonlinear dynamics (SINDy) algorithm [1] uses sparsity-promoting regression, such as the lasso [16] or sequential thresholded least-squares algorithm[1], toidentifynonlineardynamicalsystemsfromdatainoverdeterminedsituations. Here we review the SINDy architecture for identifying nonlinear dynamics from data. The general observation underlying SINDy is that most dynamical systems of a state x ∈ Rn, d (2.3) x(t) = f(x(t)), dt have only a few active terms in the dynamics, making them sparse in a suitable function space. To identify the structure and parameters of the model, a set of candidate symbolic (cid:2) (cid:3) functions are first concatenated into a library Θ(x) = θ (x) ··· θ (x) . With time-series 1 p data X ∈ Rm×n, where each row is a measurement of the state xT(t ) in time, it is possible k to evaluate the candidate function library Θ(X) ∈ Rm×p at the m time points. Finally, with derivative data X˙ ∈ Rm×n, either measured or obtained by numeric differentiation, it is possible to pose a regression problem: d (2.4) X = Θ(X)Ξ. dt The few active terms in the dynamics, given by the nonzero entries in the columns of Ξ, may be identified using sparse regression. In particular, the sparsest matrix of coefficients Ξ is determined that also provides a good model fit, so that (cid:107)X˙ −Θ(X)Ξ(cid:107) is small. Sparse 2 regression has the added benefit of avoiding overfitting, promoting stability and robustness to noise. SincetheoriginalSINDymethod,therehavebeennumerousinnovationsandextensions tohandlerationalfunctionnonlinearities[4],partialdifferentialequations[2],highlycorrupted data [27], and to build Galerkin regression models in fluids [28]. The method is also connected to the dynamic mode decomposition (DMD) [29] if only linear functions are used in Θ. Inmostsparseregressionalgorithms,thereisaparameterthatdetermineshowaggressively sparsity is promoted. The successful identification of the model in Eq. (2.3) hinges on finding a suitable value of this sparsity-promoting parameter. Generally, a the parameter value is swept through, and a Pareto front is used to select the most parsimonious model. However, the Pareto frontier may not have a sharp elbow or may instead have a cluster of models near theelbow, thuscompromisingtheautomaticnatureofthemodelselectionusingSINDyalone. 3. Methods. Our algorithm integrates sparse regression for nonlinear system identifi- cation with model selection via information criteria. This approach enables the automatic identification of a single best-fit model from a combinatorially large model space. In the first 5 Mangan,Kutz,Brunton&Proctor Algorithm 1 SINDy – AIC Input: Data matrix and time derivative: X,X˙ ∈ Rm×n 1: procedure SINDy–AIC(X, X˙ ) 2: Θ ∈ Rm×p ← library(X) (cid:46) Generate library that contains p candidate terms. 3: for λ(j) ∈ {λ0,λ1,··· ,λq} do (cid:46) Search over sparsification parameter λ. 4: Model(j) ← SINDy(X˙ , Θ, λ(j) ) (cid:46) Identify sparse terms in Θ for model. 5: X(cid:48) ← simulate (Model(j)) (cid:46) Numerically integrate dynamical system (expensive). 6: IC(j) ← AIC( X,X(cid:48)) (cid:46) Compute Akaike information criteria. 7: end for 8: [inds,vals] ← sort(IC) (cid:46) Rank models by AIC score. 9: return Model(inds(1)) (cid:46) Return best-fit model. 10: end procedure step of the algorithm, the SINDy method provides an initial sub-selection of models from a combinatorially large number of candidates. The sub-selection of candidate models near the Pareto frontier is critically enabling as it is computationally intractable to simulate and compare against the time series data. Importantly, the sub-selection can take the number of candidate modes from 109 (for our 2-D cubic example) to a manageable 102. This then allows for a tractable computation of AIC or BIC scores for the remaining candidate modes. The information criteria hierarchically ranks the most informative models, enabling the au- tomatic selection of the model with the strongest support. This is in contrast to a standard Pareto front analysis which looks for a parsimonious model at the elbow of the error versus complexity curve. However, the Pareto frontier may not have a sharp elbow or may instead have a cluster of models near the elbow. Algorithm 1, using AIC as our information criteria, is executed for model selection. Figure 1.1 illustrates the algorithm in practice. When evaluating dynamical systems models, there is some ambiguity about what consti- tutesan”observation.”Wetaketimeseriesdataforagivensetofinitialconditionstobeanob- servation,ratherthantakingeachmeasurementateachtimepoint. Toobtainarepresentative errorfortheithtime-seriesobservation, wecalculatetheaverageabsoluteerrorovertheentire time series: Eavg = (cid:80) |yτ −g(xτ;µ)|. We then substitute this representative error in for τ i i (y −g(x ;µ),andtakethesumofthesquaressothatAIC = mln(((cid:80)m E (x ,y ))/m)+2k. i i i=1 avg i i We use this value for AIC in (2.2). The candidate models with the lowest scores are ranked as the most likely. To be more precise, the AIC scores for each candidate model can have a wide range of values which require a rescaling by the minimum AIC value (AIC ) [10, 30]. The rescaled AIC values min ∆ = AIC −AIC can be directly interpreted as a strength-of-evidence comparison across j i min models. Modelswith∆ ≤ 2havestrong support, 4 ≤ ∆ ≤ 7haveweak support, and∆ ≥ 10 j j j haveno support[30]. Theserankingsallowforaprincipledprocedureforretainingorrejecting models within the candidate pool of models. For time series data with enough data samples, large enough signal to noise and/or sufficiently large set of candidate models, only one model is typically strongly supported by AIC metrics. In the examples used to demonstrate the method, the only strongly supported candidate model is, in fact, the correct model. 6 Mangan,Kutz,Brunton&Proctor Figure 4.1. Selection of model for single variable, x, polynomial system. a) 3 computationally generated timeserieswithadditivenoise(cid:15)=0.001. b)CombinatorialmodelpossibilitieswithwiththoseselectedbySINDy highlighted in blue. c) Relative AIC criteria for all possible models (black dots), and those found by SINDy c (blue circles). Lower plot magnifies strong and weakly supported AIC range, containing only correct model c (magenta circle). 4. Results: Model selection. 4.1. 1-D polynomial model. We demonstrate the relationship between SINDy model selection and AIC ranking procedure, as illustrated in Fig. 1.1 d, with a simple single state variable model with three polynomial terms, (4.1) x˙ = x−0.2x3−0.1x4. We compute three time-series for this model, as shown in Fig. 4.1 a. Assumingafeaturelibrarywithupto5thorderpolynomials,ord = 6,thereare(cid:80)6 (cid:0)6(cid:1) = i=1 i 63 possible models. A representation of the combinatorial set of models is shown in Fig. 4.1 b, and we perform least-squares fitting for the coefficients of each model. All models are cross- validated using 100 initial conditions, and relative AIC scores are calculated. This combi- c natorially complete set of models (black dots) populate the AIC Pareto front in Fig. 4.1 c. c The relative AIC metric successfully characterizes the true, 3 term model as the only model c with ”strong support.” All other models within the full combinatorial space fall within the ”no support” range. SINDyenablesustosub-selectasetofmodelsfromthefeaturelibraryand, forthissimple system, we can compare against the combinatorial set. The models selected by SINDy are highlightedinblueinFig.4.1candd. SINDyfindsthecorrect3termmodel,aswellasmodels with 1, 2, 4, 5, and 6 terms each. These models and coefficients exactly match a subset of those found by combinatorial least-squares, which is expected given that this implementation of SINDy uses least-squares to fit the coefficients. Notably, the 1 and 2-term models selected by SINDy are the most meaningful reductions of the true underlying system (with smaller 7 Mangan,Kutz,Brunton&Proctor Figure 4.2. Evaluation of SINDy selected models for 2-D cubic system. a) computationally generated time series from single set of initial conditions with additive noise (cid:15) = 0.001. b) Relative AIC criteria for models c found by SINDy (blue circles). Magnification in lower plot shows strong and weakly supported AIC range c contains only the correct model (magenta circle).. coefficients set to zero), rather than the 1 and 2 term models with the lowest error in the full feature space. 4.2. 2-D cubic model. Forlargersystems,enumeratingallpossiblemodelsrepresentedin agivenfeaturelibrarywouldbecomputationallyinfeasible,butbyusingSINDyasub-selection of the most relevant models are generated and selected. For example, consider the relatively simple example of a 2-state variable system (n = 2) with a 5th order polynomial library (d = 6). In this case there are N = (cid:0)n+d(cid:1) = 28, possible monomials, and N = (cid:80)Nm(cid:0)Nm(cid:1) = m d p i=1 i 268,435,455 potential models. Figure 4.2 demonstrates the results of performing SINDy and AIC evaluation on a cubic model. With only a single time-series for each state variable (x and y) as input (Fig. 4.2 a), the SINDy selected models with varying number of terms (blue circles) include the correct model (magenta circle). Using 100 randomly selected initial conditions for cross-validation, relative AIC to ranks the correct model as strongly supported, and all other models as having no c support. 4.3. 3-D Disease transmission model. Next we apply our method to the susceptible- exposed-infectious-recovered (SEIR) disease transmission model. Models of this type are often used to determine disease transmission rates, detect outbreaks and develop interven- tion strategies. However, generating the appropriate model for interactions between different populations is currently done heuristically and then evaluated using information criteria [15]. Using SINDy with AIC evaluation would provide data-driven model generation and selection 8 Mangan,Kutz,Brunton&Proctor Figure 4.3. Evaluation of SINDy selected models for 3 state variable SEIR model. a) computationally generated time series with additive noise (cid:15)=2.5×10−4. b) Relative AIC criteria for models found by SINDy. c Magnification in lower plot shows strong and weakly supported AIC range contains only the correct model c (magenta circle). from a wider library of possible interaction terms. As a first step, we apply SINDy with AIC to a discrete, deterministic SEIR model as shown in Fig. 4.3. We input a single time series representing an outbreak for the S, E, and I state variables (n =3), and provide a library of polynomial terms up to 2nd order (d= 3). For this example, the total number of models represented in the library is N = (cid:80)Nm(cid:0)Nm(cid:1) = 1023 with N = p i=1 i m (cid:0)n+d(cid:1) = 10 possible monomials. A complication of the SEIR system is that R is a redundant d state variable; S, E, and I have no dependence on R, and R depends only on a term already represented in the I equation R = R +0.04I . A result of this redundancy, SINDy k+1 k+1 k k cannot find the correct equations with R included in the library. Without R, SINDy selects a set of models from 1023 possible in the library (blue circles), and with 100 cross-validation measurements the relative-AIC evaluation ranks only the correct 6-term model as having any c support (magenta circle) in Fig. 4.3 b. 4.4. Lorenz model. As a final example, we demonstrate SINDy with AIC on the chaotic Lorenz system [31]. Using a library of polynomials up to 2nd order (d=3) for the 3-state variable system (n = 3), there are once again N = 1023 models represented in the function li- p brary. Providingonetime-seriesforeachstatevariable, asshowninFig.4.4a, SINDyrecovers a subset of these models (circles in Fig. 4.4 b). Using 100 cross-validation measurements on eachmodel,therelative-AIC criteriaranksthecorrect7-termmodelashavingstrongsupport c (magenta circle). Unlike in previous examples, one other model is ranked as having ”weak support” (blue circle). This model has an additional small constant term in the equation for 9 Mangan,Kutz,Brunton&Proctor Figure 4.4. Evaluation of SINDy selected models for 3 state variable Lorenz model. a) computationally generated time series with additive noise (cid:15) = 0.001. b) Relative AIC criteria for models found by SINDy. c MagnificationinlowerplotshowsstrongandweaklysupportedAIC range. Thecorrectmodelliesinthestrong c support range and a model with an additional small term in the weak support range. x: x˙ = 8.5×10−6+10(y−x). A possible explanation for the level of support for this model is the chaotic nature of the Lorenz system. Even when the recovered model is correct, small variations in the recovered coefficients (≈ 10−6 for this case), will cause the calculated time-series for the recovered model to diverge from the ”true” model after some length of time (> 1 unit time for these parameters). For the example in Fig. 4.4 b, the cross-validation uses time series of length t = 5 (arb. units). In a true model-selection situation, we would not know this characteristic length scale ahead of time, and a sensitivity analysis would need to be preformed. We discuss this and other challenges to practical implementation in the next section. 5. Practical implementation: noise and number of measurements. SINDy with AIC ranking can successfully select the correct model for a variety of known systems, given low enough measurement noise and a large amount of data for cross-validation. Under practical conditions, the signal-to-noise may be lower than desired and the amount of data available for cross-validation may be restricted. In Fig. 5.1 we show the effects of increasing noise and number of cross-validation experiments on the selection of the correct model for the Lorenz system. Reading from the top of Fig. 5.1 a, for low noise, (cid:15) = 0.1, only the correct model (magenta) falls within the supported (strong or weak) range. Increasing the noise to (cid:15) = 0.2 and 0.5 cause other models to descend into the weak support regime and eventually into the strong support range. Around this level of noise, the relative AIC scores for the c incorrect models are very sensitive to the random additive noise. Repeating the computation 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.