ebook img

Least-action filtering PDF

0.3 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Least-action filtering

Least-action filtering L. C. G. Rogers University of Cambridge 3 First version: September 2008 1 0 2 January 23, 2013 n a J Abstract 2 2 Thispaperpresents anapproach toestimating ahiddenprocessinacontinuous-time ] setting, where the hidden process is a diffusion. The approach is simply to minimize the E negative log-likelihood of the hidden path, where the likelihood is expressed relative to M Wiener measure. This negative log-likelihood is the action integral of the path, which t. we minimize by calculus of variations. We then perform an asymptotic maximum- a likelihood analysis to understand better how the actual path is distributed around the t s least-action path; it turns out that the actual path can be expressed (approximately) [ as the sum of the least-action path and a zero-mean Gaussian process which can be 1 specified quite explicitly. Numerical solution of the ODEs which arise from the calculus v 7 of variations is often feasible, but is complicated by the shooting nature of the problem, 5 and the possibility that we have found a local but not global minimum. We analyze 1 5 the situations when this happens, and provide effective numerical methods for studying . this. We also show how the methodology works in a situation where the hidden positive 1 0 diffusion acts as the random intensity of a point process which is observed; here too it 3 is possible to estimate the hidden process. 1 : v i 1 Introduction. X r a The basic problem tackled in this paper is to try to estimate the hidden part (X ) of a t 0≤t≤T vector1 diffusion process Z [X ;Y ] given the observations (Y ) . Of course, this is an t t t t 0≤t≤T ≡ idealized question, because in practice we would only observe the signal at discrete instants of time, which for simplicity we will assume are equally spaced with a spacing of h > 0. It would then in principle be possible to write down the likelihood2 of (Z ) , and maximize this nh 0≤n≤N over the unobserved variables (X ) , with the Y-values known and fixed. nh 0≤n≤N 1 We usethe Scilab/Matlabnotation,wherez =[x;y]denotes the columnvectorz formedby stackingthe column vector x above the column vector y. 2 We suppose that T =Nh. 1 There are two quite substantial obstacles on this path. The first is that (except in a few special examples) there is no closed-form expression for the transition density of a general diffusion, so evaluating the likelihood is already problematic. The second is that we are faced with a maximization over a space of dimension proportional to N, a number which will be large if h is small, as we envisage. The approach of this paper avoids these difficulties completely by passing to a limiting form of the problem in which the (negative of the) log- likelihood expression to be maximized converges to an integral, the action integral. This can be minimized by calculus of variations, so that instead of having to do some numerical nonlinear optimization, we find ourselves having to solve a second-order non-linear ordinary differential equation (ODE), which is a far easier task. The only issue to be dealt with is that the ODE is of shooting type, with boundary conditions at t = 0 and t = T, so that some iterative solution scheme is needed: the analysis is explained in Section 3. The action integral is only the negative log-likelihood of the path in some formal sense, since it involves derivatives of the paths of the diffusion, and diffusion paths are not differen- tiable. Nevertheless, it is a useful heuristic3 which leads to a rigorous approximation result for the limiting form of the maximum-likelihood path as h 0 : see Theorem 1. ↓ This gives us a way to identify effectively the ‘most likely’ path x∗ of X given the path of Y, but we would also like to have some idea of how the random path of X is distributed around x∗. This involves a study of the log-likelihood in a neighbourhood of x∗ which we find is (to leading order) quadratic in the perturbation ξ = x x∗. Thus the perturbation − ξ is approximately a Gaussian process, whose mean is identically zero, and whose law can be precisely characterized by expressing ξ as the solution of a linear stochastic differential equation: see Section 4. Inference on a hidden diffusion intensity for a point process is discussed in Section 5, and the relationship with SMC methodologies is discussed briefly in Section 6. Section 7 concludes. The idea of studying the action of a diffusion path is quite ancient already, and even as a tool for estimation there is a literature: see, for example, [4], [1], [2]. Markussen [4] arrives at essentially the same identification of the maximum-likelihood path as we do; the expression derived here for the Gaussian law of the perturbation is considerably more explicit. The approach of Archambeau et al [1] has superficial similarities, but is different in major respects; in particular, they use a minimum relative entropy criterion to identify a ‘best’ Gaussian approximation to the hidden process, whose structure appears to require the calculation of expectations of non-linear functionals of the path. In contrast, the approach followed here requires only the solution of ordinary differential equations4 This is particularly advantageous in higher dimensions, as the numerical solution of ordinary differential equations still works quite effectively in moderately large dimension, whereas methods such as particle filtering 3 This heuristic leads in a few lines to (non-rigorous)derivations of the Cameron-Martin-Girsanovchange of measure, and to large-deviationrate functions for various diffusion asymptotics. 4 Archambeau et al work with a more restricted diffusion dynamic (which they assert can be extended), and a somewhat different structure for the relation between observations and the hidden process. 2 struggle in dimension much above about seven, as we shall discuss in Section 6. 2 The setting. Suppose that we have some stochastic differential equation (SDE) in Rd dZ = σ(t,Z )dW +µ(t,Z )dt, (1) t t t t whereZ ispartitionedZ = [X;Y]andwhereX isn-dimensional, Y iss-dimensional, n+s = d. We shall suppose also that σ, σ−1 and µ are in C1. The problem we address is to estimate b the hidden part (X ) of Z given observations of (Y ) ; the possibility that Y is non- t 0≤t≤T t 0≤t≤T informative about X is included in the analysis, so what follows applies to the distribution of an unobserved diffusion also. Closely related to (1) is the first-order Euler-Maruyama difference scheme (n) (n) (n) dz = σ(t ,z )dW +µ(t ,z )dt (2) t n tn t n tn where t 2−n[2nt]. Despite appearances, this can be viewed as a discrete scheme; the n ≡ increments of z have conditionally Gaussian distributions. It is known (see, for example, Mao [3]) that for some constant C E sup Z z(n) 2 C2−n, (3) | t − t | ≤ (cid:2) 0≤t≤T (cid:3) so by the first Borel-Cantelli Lemma we conclude that there is almost-sure uniform conver- gence of the processes z(n) to the solution Z to the SDE. Inthediscrete-timeapproximation(2)totheSDE,theconditionaldensityofx(j2−n) 0≤j≤2nT given y(j2−n) can be written down immediately. The log-likelihood is 0≤j≤2nT N−1 1 λ (x y) = 1 σ(jh,z )−1 z z hµ(jh,z ) 2 ϕ(x ) n | −2 h jh jh+h − jh − jh − 0 Xj=0 (cid:12) (cid:0) (cid:1) (cid:12) (cid:12) (cid:12) N−1 z z = 1 h σ(jh,z )−1 jh+h − jh µ(jh,z ) 2 ϕ(x ), (4) −2 jh h − jh − 0 Xj=0 (cid:12) (cid:0) (cid:1) (cid:12) (cid:12) (cid:12) where the prior density of x is exp( ϕ(x)), and h = 2−n. Inspecting (4), we see a difference 0 − quotient of z which it is tempting to replace with a derivative as n , leading to the → ∞ formal limit T Λ(x y) = 1 σ(s,z )−1 z˙ µ(s,z ) 2 ds ϕ(x ) (5) | −2 Z s s − s − 0 0 (cid:12) (cid:0) (cid:1) (cid:12) T (cid:12) (cid:12) ψ(s,x ,p ) ds ϕ(x ) (6) s s 0 ≡ −Z − 0 3 where the Riemann sum becomes an integral, and we write p x˙ . This is a perfectly s s ≡ sensible functional of a C1 path z, even though for a diffusion process the path will not be differentiable. Our viewpoint is that the problem which concerns us is to estimate x(j2−ν) in 0≤j≤2νT the discretely-sampled model (2), for some fixed (quite large) integer ν, and that the SDE version of the problem is to be used to help us in this. In practice, we will only ever have the observation data y(j2−ν) in discretely-sampled form (how would it be stored if not?!) 0≤j≤2νT so this is the realistic question. When we minimize the log-likelihood (4) over x, we only need the values of y at the multiples of h = 2−ν; we therefore lose no generality in supposing that y has been interpolated (by cubic splines, say) to be C2 in all of [0,T]. This does not of course change (4), but it does mean that the functional Λ is meaningfully defined for any C1 path (x ) . Given this, we have the following result, where to emphasize again, we are t 0≤t≤T considering what happens as n ν tends to infinity, with ν fixed. ≥ Theorem 1 Suppose that σ, σ−1 and µ are C1, and that (y ) is also C1. Suppose further b t 0≤t≤T that lim ϕ(x) = . (7) |x|→∞ ∞ Then lim inf λ (x y) = inf Λ(x y) . (8) n n→∞ x − | x − | (cid:8) (cid:9) (cid:8) (cid:9) Assuming that Λ( y) has a unique maximizer x∗, and that x is a maximizer of λ ( y), then n n ·| ·| x x∗ uniformly. n → Proof. See Appendix A. The usefulness of Theorem 1 is that it shows that a minimizer of λ ( y) is very close to n − ·| the (assumed unique) minimizer of Λ( y); but this minimizer can be found by calculus of − ·| variations without resort to some high-dimensional non-linear optimizer. Notice that although we make a discretization error when we replace the SDE (1) with the Euler-Maruyama scheme (2), in most examples of practical interest the magnitude of this discretization error will be insignificant compared to the observation error that we are attempting to see past. 3 Finding the least-action path. Theorem 1 shows that asymptotically as the step size h = 2−n tends to zero, the maximum- likelihood estimate of the unobserved path converges to the maximizer of the continuous-time analogue Λ. Resorting now to calculus of variation techniques leads us to the following central result. 4 Theorem 2 The path (x,p) which minimizes the action functional T Λ(x y) = ψ(s,x ,p ) ds+ϕ(x ) (9) s s 0 − | Z 0 satisfies the following ODE with boundary conditions: 0 = D ψ(0,x ,p ) D ϕ(x ) (10) pj 0 0 − xj 0 0 = D ψ D D ψ x˙ D D ψ p˙ D D ψ (11) xj − t pj − k pj xk − k pj pk 0 = D ψ(T,x ,p ) (12) pj T T Proof. The negative log-likelihood to be minimized is T T ϕ(x )+ ψ(t,x ,x˙ ) dt ϕ(x )+ ψ(t,x ,p ) dt, (13) 0 t t 0 t t Z ≡ Z 0 0 where we write p x˙. This we attack by calculus of variations; if we have found the optimal ≡ x, then any perturbation to x+ξ must to leading order make zero change to the objective. Writing down the first-order change and integrating by parts gives T ˙ 0 = ξ Dϕ(x )+ ξ D ψ +ξ D ψ dt (14) 0 0 t x p Z · · 0 (cid:8) (cid:9) T = ξ Dϕ(x )+ ξjD ψ T + ξj D ψ D D ψ x˙ D D ψ p˙ D D ψ dt. 0 0 t pj 0 Z t xj − t pj − k pj xk − k pj pk (cid:2) (cid:3) 0 (cid:8) (cid:9) Since ξ is arbitrary, we deduce the conditions (10), (11), (12) for optimality. (cid:3) Remarks. (i) We have found a (generally non-linear) first-order ODE (11) for (x,p), with initial condition (10) and terminal condition (12). This will generally have a unique solution, though the ‘shooting’ nature of the ODE is rather clumsy in practice. (ii) We expect that this methodology will be advantageous in higher-dimensional problems; algorithmsforsolving ODEsinhighdimension tendto degradelessrapidly than(forexample) gradient optimization methods, or particle filtering. (iii) If the time horizon T is too large, it may be that the shooting ODE is not able to identify the initial and terminal conditions with sufficient accuracy. To illustrate the methodology in action, we present here some simple examples. Example 1. Here we take independent one-dimensional Ornstein-Uhlenbeck processes X, y dX = σ dW β Xdt X X − dy = σ dW′ β Ydt y y − 5 where σ = 1.053, σ = 1.0127, β = 0.1054 and β = 0.0253. We observe Y = X +y. Then X y X y Z = [X;Y] solves σ 0 dW β 0 X dZ = X + − X dt (cid:18) σ σ (cid:19)(cid:18) dW′ (cid:19) (cid:18) β +β β (cid:19)(cid:18) Y (cid:19) X y X y y − − dW X σ +A dt, ≡ (cid:18) dW′ (cid:19) (cid:18) Y (cid:19) which is a simple linear SDE. For this example, T 1 p x p x ψ(t,x,p) = A q A . 2(cid:18)(cid:18) Y˙ (cid:19)− (cid:18) Y (cid:19) (cid:19) (cid:18)(cid:18) Y˙ (cid:19)− (cid:18) Y (cid:19) (cid:19) t t t t A sample path of the SDE was simulated, and the results are displayed in Figure 1. The true path is dashed in green, the noisy observation is in blue, and the least-action path is in red. Example 2. This time we have a two-dimensional linear example, with X satisfying 0 1 dX = dW + − X dt dW +A X dt, t t (cid:18)1 0 (cid:19) t ≡ t 0 t and Y = X +y, where y is an independent OU process dy = σdw λy dt, t t t − with λ = 0.005, σ = 20. The results of the least-action analysis are displayed in Figure 2. Once again, the estimates are quite impressive, but since the underlying dynamics are a perturbation of uniform motion in a circle, this is perhaps not so surprising. Example 3. This is another one-dimensional example, but this time non-linear. We have dynamics dX = σ dW bsin(aX ) dt t X t t − for X, where σ = 3, b = 12 and a = 2π/5. Thus the drift tries to keep the diffusion near X to multiples of 5. The observation Y is as before of the form Y = X + y where y is an independent OU process dy = dW′ λy dt t t − t with λ = 0.05. The results of the analysis are shown in Figure 3. As can be seen, the hidden Markov process stays close to multiples of 5, occasionally moving from one value to a neighbouring value, and as the observational error gets larger, the filtering performs less well. Example 4. The final example is a more demanding test of the methodology. Here, we take a two-dimensional non-linear dynamic for X dX = ΣdW bsin(aX ) dt t t t − 6 where 0.9 0.27 Σ2 = (15) (cid:18)0.27 0.9(cid:19) a,basforthepreviousexample. Theobservationisunivariate, andtakestheformY = v X+y · where y is an independent OU process dy = σdw λy dt t t t − where v = (1,2), λ = 0.05, and σ = 1. It would be indeed remarkable if the methodology was able to unscramble the two hidden components of the diffusion from observation of just one component, and the results are shown in Fig 4; sometimes the estimate is close to the true values, sometime less so. But it should be understood that this is not a limitation of the methodology, which is after all discovering the maximum-likelihood estimator of the hidden process; the relatively poor recovery of the hidden path is because we have a small signal-to-noise ratio for this example. 4 Second-order analysis. The first problem we focused on is maximizing the log-likelihood as given for the discretized problem by (4). The unknown in this situation is the sequence (xj2−ν)0≤j≤2νT, and as in X ≡ classical asymptotic ML theory, having identified the ML estimator ˆ of the unknown , we ˆ X X can enquire about the distribution of about . This amounts to understanding the second X X derivativeofthelog-likelihoodattheMLE.Givensuitableregularityofthelikelihood, aTaylor expansion shows that the distribution of around ˆ will be approximately Gaussian, with X X covariance equal to the second derivative of the log-likelihood. However, we have seen that there are considerable methodological advantages in passing from the discrete problem (4) to the continuous analogue (6), and these advantages apply also at the level of the second-order terms in the expansion of the log-likelihood, as we shall now see. Theorem 3 Suppose that x∗ is a local minimizer of the functional Λ( y), and define the ·| matrix-valued n n functions of time × Aij = D D ψ, Bij = D D ψ, qij = D D ψ, (16) t xi xj t xi pj t pi pj evaluated along the path x∗. Then: (i) the ODE A +θ˙ = KTq K = (B +θ )q−1(BT +θ ), θ = 0 (17) t t t t t t t t t t T has a unique solution; 7 (ii) writing x = x∗ +ξ, we have that T Λ(x y)+Λ(x∗ y) = Q (ξ)+o( ( ξ 2 + ξ˙ 2) ds), (18) 0 s s − | | Z | | | | 0 where T Q (ξ) = 1 ξ D2ϕ(x∗)+θ ξ + 1 (ξ˙ +K ξ ) q (ξ˙ +K ξ ) dt, (19) 0 2 0 · 0 0 0 Z 2 t t t · t t t t (cid:8) (cid:9) 0 and where K = q−1(BT +θ ). (20) t t t t Remarks. (i) If we write w˙ = q1/2(ξ˙ + K ξ ), then the integral term in (19) takes the t t t t t form T w˙ 2 dt, which is the action integral for standard Brownian motion. Thus informally 0 | t| whatRwe learn from Theorem 3 is that the perturbation ξ ‘looks like’ a continuous zero-mean Gaussian process obtained by solving the linear SDE −1/2 dξ = K ξ dt+q dW , (21) t − t t t t and started with initial precision D2ϕ(x∗)+θ . The covariance of ξ can thus be obtained from 0 0 an Itˆo expansion of ξξT: . dξξT = ( K ξ ξT ξ ξTKT +q−1)dt, (22) − t t t − t t t t . where = signifies that the two sides differ by a (local) martingale. Hence V˙ = K V V KT +q−1 (23) t − t t − t t t and this allows us to calculate the covariance at time t, again by solving an ODE. (ii) The ODE (17) for θ is quadratic, so there is no general result which guarantees that it has a solution; the solution may explode. However, in this situation we can be sure that it will not, because Λ( y) is minimized at x∗, as we shall see in the proof. ·| (iii)IthastobeadmittedthattherelationshipbetweenTheorem3andtheoriginaldiscretized problem is somewhat tenuous. With effort it would no doubt be possible to establish an analogue of Theorem 1 for the second-order effect, but it is harder to see what the use of such aresult wouldbe. Thewayweenvisage using Theorem3wouldbetogenerateanapproximate distribution for the posterior of given y, which could be checked numerically against a full X posterior5, and which could be used to give approximate moments of the posterior law of . X 5It would be rare that this could be found in closed form, so we would typically need to resort to particle filtering to make a numerical approximation. 8 Proof of Theorem 3. Let us begin by assuming the first statement (i) of the theorem and showing how this leads to the more interesting statement (ii). We will return to prove (i) subsequently. If we go back to the action functional (9) and consider expanding around the local mini- mizer x∗ by perturbing to x∗ +ξ for small ξ, the second-order contribution we get is T Q(ξ) 1 D D ϕ(x∗)ξiξj + 1 ξiξjD D ψ +ξiξ˙jD D ψ + 1 ξ˙iξ˙jD D ψ dt ≡ 2 xi xj 0 0 0 Z 2 xi xj pj xi 2 pi pj 0 (cid:8) (cid:9) T 1 D D ϕ(x∗)ξiξj + 1 ξiAijξj +ξiBijξ˙j + 1 ξ˙iqijξ˙j dt (24) ≡ 2 xi xj 0 0 0 Z 2 t t t t t t 2 t t t 0 (cid:8) (cid:9) where the matrix-valued functions of time A, B and q are defined by (16). This quadratic functional of ξ characterizes the (approximate) Gaussian distribution of the perturbation. Let us notice that because x∗ is assumed to be a minimizer of the action functional, this quadratic functional of ξ must be non-negative. Now suppose that we have some C1 symmetric-matrix-valued function of time, θ, such that θ = 0. Then we may write T T Q(ξ) = Q(ξ)+ 1 ξ θ ξ + 1 ξ θ ξ 2 0 · 0 0 2 t · t t 0 = 1 ξ (D2ϕ(x∗)+θ )(cid:2)ξ (cid:3) 2 0 · 0 0 0 T ˙ ˙ ˙ ˙ ˙ + 1 ξ A ξ +ξ B ξ + 1 ξ q ξ + 1 ξ θ ξ +ξ θ ξ dt. (25) Z 2 t · t t t · t t 2 t · t t 2 t · t t t · t t 0 (cid:8) (cid:9) The quadratic form inside the integral can be expressed as ˙ ˙ ˙ ˙ ˙ ˙ 1 ξ q ξ +ξ (B +θ )ξ + 1 ξ (A +θ )ξ = 1 (ξ +K ξ ) q (ξ +K ξ ), (26) 2 t · t t t · t t t 2 t · t t t 2 t t t · t t t t where K q−1(BT +θ ), provided t ≡ t t t A +θ˙ = KTq K = (B +θ )q−1(BT +θ ), (27) t t t t t t t t t t that is, provided θ solves the ODE (17). Therefore we have shown that the second statement of the theorem follows once we have established the first. We now turn to the issue of the existence6 of a solution to (17). The proof that the quadratic ODE (17) has a solution exploits the only piece of information we have about the quadratic form Q, namely that it is non-negative, since x∗ maximized Λ(x y). We may | therefore consider the optimal control problem T ˙ ˙ ˙ G(t,x) inf 1 ξ A ξ +ξ B ξ + 1 ξ q ξ ds : ξ = x (28) ≡ (cid:26) Z 2 s · s s s · s s 2 s · s s t (cid:27) t (cid:8) (cid:9) 6Uniqueness of the solution is not a problem, because the right-hand side of the ODE (17) is locally Lipschitz in θ, and once we know that there exists a solution this is enough to guarantee uniqueness. 9 which is well posed, and the solution G has a quadratic form: G(t,x) = 1 x H x, (29) 2 · t which is centred, since clearly G(t,0) = 0 for all t. Since G is the value function, we shall have that t ˙ ˙ ˙ h(t) 1 ξ A ξ +ξ B ξ + 1 ξ q ξ ds+G(t,ξ ) (30) ≡ Z 2 s · s s s · s s 2 s · s s t 0 (cid:8) (cid:9) is non-decreasing whatever ξ we choose, and will be constant if we choose the optimal ξ. Differentiating (30) gives us h˙ = 1 ξ A ξ +ξ B ξ˙ + 1 ξ˙ q ξ˙ + 1 ξ H˙ ξ +ξ H ξ˙. (31) t 2 t · t t t · t t 2 t · t t 2 t · t t t t t ˙ ˙ Minimizing over ξ and equating to zero to find the optimal ξ gives us t t ξ˙ = q−1 BTξ +H ξ , (32) t − t t t t t (cid:0) (cid:1) ˙ so that the minimized value of h becomes t 1 ξ A ξ + 1 ξ H˙ ξ 1 BTξ +H ξ q−1 BTξ +H ξ . 2 t · t t 2 t · t t − 2 t t t t · t t t t t (cid:0) (cid:1) (cid:0) (cid:1) At optimality, this must be zero; so 0 = A +H˙ (B +H )q−1(BT +H ). (33) t t − t t t t t Evidently from the definition we have G(T, ) 0, so that H solves (17) with the zero · ≡ boundary condition at t = T. In other words, we have found θ, and θ = H. Notice in ˙ particular that the equation (32) becomes ξ = K ξ . t t t − (cid:3) Numerical aspects. Theorem 3 characterizes the behaviour of the path x around the local minimizer x∗, but in any given example, this is not the end of the story. The point is that when we seek x∗ we resort to numerical means7 to solve the ODE (11) subject to the boundary conditions (10), (12), and it may be that the solution found is not the global minimizer. It could be that the solution obtained is only a local minimizer; it could be that the solution found is a only stationary point of Λ( y); or it could be that the numerical method has got into difficulties ·| and what it returns is not a solution. An obvious approach to this problem is simply to try numerical solution of (17) back in time from initial value θ = 0; however, if the ODE solver T fails, usually the entire calculation stops, and no attempt to continue is possible. We can avoid this issue if we transform the first-order non-linear Riccati ODE into a second-order linear ODE. The following result gives all we need. 7All the calculations for this paper were performed in Scilab www.scilab.org which is a free and very versatile package able to perform a wide range of numerical and graphical tasks. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.