Bayesian Computation for Log-Gaussian Cox Processes–A Comparative Analysis of Methods Ming Teng1, Timothy D. Johnson1, Farouk S. Nathoo2,∗ 7 Abstract 1 0 The Log-Gaussian Cox Process is a commonly used model for the analysis of spatial point 2 patterns. Fitting this model is difficult because of its doubly-stochastic property, i.e., it is n an hierarchical combination of a Poisson process at the first level and a Gaussian Process a J at the second level. Different methods have been proposed to estimate such a process, 3 including traditional likelihood-based approaches as well as Bayesian methods. We focus here on Bayesian methods and several approaches that have been considered for model ] O fitting within this framework, including Hamiltonian Monte Carlo, the Integrated nested C Laplace approximation, and Variational Bayes. We consider these approaches and make . t comparisons with respect to statistical and computational efficiency. These comparisons are a t made through several simulations studies as well as through applications examining both s [ ecological data and neuroimaging data. 1 Keywords: Hamiltonian Monte Carlo, Laplace Approximation, Log-Gaussian Cox Process, v 7 Variational Bayes 5 8 0 0 1. Introduction . 1 Spatial point process models for point pattern data have many applications including 0 7 those involving ecology, geology, seismology, and neuroimaging. The theory of point pro- 1 cesses has its roots with the work of Poisson in the 19th century, while more modern treat- : v ments with a statistical focus include Daley and Vere-Jones (1988), Møller et al. (1998) and i X Illian et al. (2008). Among models for a spatial point process, the homogeneous Poisson r process is the most fundamental but its use is limited in many applications due to its sim- a plistic nature. A related but more flexible process is the Log-Gaussian Cox Process (LGCP), a process that is obtained by assuming a hierarchical structure, where at the first level the process is assumed Poisson conditional on the intensity function, and at the second level the log of the intensity function is assumed to be drawn from a Gaussian process. The flexibility of the model arises from the Gaussian process prior specified over the log-intensity function. Given this hierarchical structure with a Gaussian process at the second level, fitting this model to observed spatial point pattern data is a computational challenge (Murray et al., 2012). ∗ Corresponding author. Tel: +1 250 472 4693. Email: [email protected] 1University of Michigan, Department of Biostatistics, 1415 Washington Heights, Ann Arbor, MI 48109 2University of Victoria, Department of Mathematics and Statistics, Victoria, BC, Canada, V8W 3P4 Preprint submitted to arXiv January 5, 2017 A number of approaches have been developed to estimate the LGCP model in both the classical and Bayesian frameworks. In Diggle (1985) the authors propose an adaption of Rosenblatt’s kernel method (Rosenblatt et al., 1956) for the purpose of non-parametric es- timation and then derive an expression for the mean squared error based on a stationarity assumption. A Bayesian framework is considered in Møller et al. (1998) where the authors propose the use of the Metropolis-Adjusted Langevin Algorithm (MALA) (Besag, 1994) for Monte Carlo sampling of the posterior distribution. These authors also introduce a dis- cretization of the spatial domain in order to attain computational tractability. Adams et al. (2009) proposes an exact estimation method to deal with a modification of such a point process which they term the Sigmoidal Gaussian Cox process. MALA is a special case of the potentially more efficient Hamiltonian Monte Carlo (HMC) (Neal, 1995) algorithm that uses the notion of Hamiltonian dynamics to construct proposals for the Metropolis-Hastings algorithm and has been adopted recently for an increasing number of applications (Neal, 2011). Extending further, the Riemann Manifold Hamiltonian Monte Carlo algorithm pro- posed by Girolami and Calderhead (2011) is a generalization of both MALA and HMC that can lead to more efficient posterior sampling in some cases. One drawback of Riemann Man- ifold HMC is that it requires the inversion of a potentially large matrix (the expected Fisher information matrix plus the negative Hessian of the log prior) at each iteration and this is not computationally feasible for very high-dimensional problems such as those typically involving the LGCP model. An advantage associated with the use of MCMC algorithms for Bayesian computation is theunderlyingtheorywhichguaranteessimulationconsistentestimationofvariousimportant characteristics of the posterior distribution. Thus the practitioner is assured of an accurate MonteCarlorepresentationoftheposteriordistributiongivenasufficient amountofsampling effort. A drawback is that MCMC can be computationally intense, and this has motivated several alternative deterministic approaches for approximate Bayesian inference. One such approach is Variational Bayes (VB) (MacKay, 1997), where the approximation to the posterior distribution is assumed to lie within some convenient family of distributions and then an optimization is carried out to minimize the Kullback-Leibler divergence mea- suring the discrepancy between the true posterior and the approximation. Often the family of distributions within which the approximation is assumed to lie is based on the notion of a mean field approximation, which corresponds to assuming posterior independence between certain model parameters. The idea in this case is to replace stochastic posterior depen- dence between parameters with deterministic dependence between the posterior moments in a manner that minimizes Kullback-Leibler divergence. Variational Bayes approximations have been applied successfully to the analysis of hidden Markov models in MacKay (1997) and to other mixture models Humphreys and Titterington (2000). Zammit-Mangion et al. (2012) used Variational Bayes for models of spatiotemporal systems represented by linear stochastic differential equations and demonstrated quick and efficient approximate inference both for continuous observations and point process data. Mean field Variational Bayes is well suited for dealing with models within the conjugate exponentialfamilywhereclosedformsolutionsfortheiterativestepsoftheoptimizationalgo- rithmareavailable. Ingeneral suchclosed formsolutionsmaynotbeavailableandadditional approximations are then required. This is the case with the LGCP model. Mean field Vari- ational Bayes approximations for non-conjugate models can be obtained by incorporating 2 further approximations based on the delta method or the Laplace method Wang and Blei (2013). These approximations are successfully applied to a correlated topic model and a Bayesian logistic regression model in Wang and Blei (2013). For the LGCP model, tractable variational approximations can be obtained following this approach, where a mean field ap- proximation with further approximations based on the Laplace method are used to handle the non-conjugate structure of the model. Variational Bayes approximations can work well in some settings and the corresponding approximations can be computed relatively fast. A drawback is that there is no underlying theory guaranteeing the accuracy of the approximation or characterizing its error, thus these approximations need to be evaluated on a case-by-case basis, and they may or may not achieve reasonable accuracy depending on the utility of the practitioner. One contribution of this paper is the derivation of a mean field VB approximation which incorporates the Laplace method for the LGCP model. As far as we are aware such approximations have not been considered previously for this model. Another contribution of this paper is to compare this VB approximation with HMC in terms of both statistical and computational efficiency. An alternative approach for approximate Bayesian inference that has gained tremen- dous popularity in the statistical literature is the integrated nested Laplace approximation (INLA) (Rue et al., 2009). INLA is less generally applicable than MCMC or VB as it assumes the model has a latent Gaussian structure with only a moderate number of hyper- parameters. For spatial modeling the approach makes use of the Gaussian Markov Random field (Rue and Held, 2005)andcorresponding approximations which areknown to becompu- tationally efficient. The basis of INLA is the use of the Laplace approximation andnumerical integration with latent Gaussian models to derive approximate posterior marginal distribu- tions. INLA does not produce an approximation to the joint posterior which is a drawback of the approach in settings where the joint posterior (as opposed to the marginals) is of interest. For spatial models incorporating a Gaussian Random field (GRF) with a Mat´ern correla- tion structure, Lindgren et al. (2011) develop an approximate approach based on stochastic partial differential equations (SPDE) and these approximations have been combined for use with INLA. The essence of the approach is to specify a SPDE that has as its solution the GRF and then the SPDE representation is used in conjunction with basis representations to approximate the process over the vertices of a 2-dimensional mesh covering the spatial domain. The value of the process at any location is then obtained based on interpolation of the values at the mesh vertices. In recent work, Simpson et al. (2012) evaluated this ap- proximation applied to the LGCP model with spatially varying covariates and demonstrated adequate performance for the settings and data considered there. A comparison between INLA and MALA for models incorporating GMRF approxima- tions is considered in Taylor and Diggle (2013). In our work, we compare for the LGCP model Bayesian computation based on HMC, VB with a Laplace approximation, INLA, and INLA with the SPDE approximation. The comparisons we make are with respect to com- putational time, properties of estimators, posterior variability, and goodness fitness checking based on posterior predictive methods. Our objective is to provide practical guidance for users of the LGCP model. In addition to these comparisons, there are two novel aspects to the work presented here. First, we develop a mean field variational Bayes approximation that incorporates the Laplace method to deal with the non-conjugacy of the LGCP model. 3 To the best of our knowledge this is the first time such an approximation has been developed for approximate Bayesian inference with the LGCP model. Second, we apply HMC for fully Bayesian inference and a novel aspect of our implementation is that HMC is used to update the decay (correlation parameter) associated with the latent Gaussian process. A result is that the sampling algorithm mixes very well and to our knowledge the development of the HMC algorithm in this context is the first of its kind. The remainder of the paper proceeds as follows. Section 2 discusses various approaches for conducting Bayesian inference for the LGCP model. Section 3 presents a comparison of approaches through simulation studies, while Section 4 makes comparisons using two real point pattern datasets, the first arising from an ecological application and the second arising from a neuroimaging study. The paper concludes with a discussion in Section 5. 2. Bayesian inference for log-Gaussian Cox processes 2.1. Model specification Consider an inhomogeneous Poisson process (s) with intensity function λ(s), s S Z ∈ ⊆ R2. Without loss of generality we shall assume that S is the unit square. The density of a Poisson process does not exist with respect to Lebesgue measure, but the Radon- Nikodym derivative does exist with respect to a unit-rate Poisson process (Møller et al., 1998). We will call this derivative the density of the Poisson process. Given a set of K points s = s ,...,s S, where both the number K and the locations s are random, k 1 K k { } { } ⊂ the density is given by K π[ s λ(s)] = exp [1 λ(s)]ds λ(s ). (1) k k { } | − (cid:26)ZS (cid:27)k=1 Y where λ(s) is the intensity function. If we further assume that the log of the intensity function arises from a Gaussian random field (GRF) (s) so that λ(s) = exp( (s)), then Y Y this hierarchical process is called a log-Gaussian Cox process (LGCP) (Møller et al., 1998). The LGCP, assumed to be stationary and isotropic, is uniquely determined by the mean function µ(s) and the covariance function Cov(s,s′) = σ2r( s s′ ) of the Gaussian process || − || (s), where σ2 is the marginal variance and r( s s′ ) denotes correlation as a function of Y || − || the Euclidean distance s s′ . Two commonly used correlation functions are the power || − || exponential function (Møller and Waagepetersen, 2003) r ( s s′ ) = exp( ρ s s′ δ) p || − || − || − || where ρ > 0 is the decay parameter, δ (0,2] is the power exponential term (which we will ∈ take as a known constant throughout this manuscript); and the Mat´ern correlation function (Mat´ern, 1960) r ( s s′ ) = Γ(ν)2ν−1 −1( s s′ /φ)νK ( s s′ /φ) m ν || − || || − || || − || (cid:0) (cid:1) where φ > 0 is the range parameter, ν > 0 is the shape parameter, and K is the modified ν Bessel function of the second kind. 4 To fit the model in a tractable way a common approach is to divide the spatial domain into an n n uniform grid of equally spaced cells (Møller et al., 1998) and to make the × simplifying assumption that the log-intensity is constant over each grid cell so that the log- intensity (s) within a given cell, say the ith cell, is constant and characterized by its value Y at the corresponding centroid, c , of cell i, i 1...n2 . The unique log-intensity values then i ∈ { } comprise a vector Y = ( (c1), (c2),..., (cn2))T. To simplify notationwe let Yi = (ci) and Y Y Y Y y is a realized value of Y . From the defining property of a GRF, Y follows a multivariate i i normal distribution Y N(µ1n2,σ2C), where C is the n2 n2 correlation matrix with ∼ × elements r( c c ). Let θ be the set of parameters determining the mean and covariance i j || − || of the GRF (e.g. θ = (µ,σ2,ρ) for the power correlation and θ = (µ,σ2,φ) for the Mat´ern), and let A denote the area of each cell in the uniform grid. Under this discretization, the log density (see Equation (1)) is logπ( s θ, y) = constant+ [y n Aexp(y )] k i i i { } | − i X where n is the number of points in s occurring in the ith grid cell. The log posterior can i k { } then be expressed as logπ(θ, y s ) = constant+ [y n Aexp(y )] k i i i | { } − i X 0.5(y µ1n2)Tσ−2C−1(y µ1n2) − − − 0.5n2log(σ2) 0.5log( C )+logπ(θ) (2) − − | | where π(θ) is the prior density of the parameter vector θ. The computational problem for Bayesian inference is the calculation of π(θ, y s ) and its associated marginals or k | { } properties of these distributions. This computation is nontrivial because the calculation of the normalizing constant is nontrivial, particularly when the dimension of the parameter space is high. We now discuss three approaches for approximating the posterior distribution and/or its marginals. 2.2. Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) has its origins with the work of Alder and Wainwright (1959) and Duane et al. (1987) and was first introduced into the statistical literature by Neal (1995). It is a Metropolis-Hastings algorithm that can be used for sampling a high- dimensional target distribution more efficiently than an algorithm based on random-walk proposals, especially when the parameters are highly correlated. The algorithm uses the notion of the (separable) Hamiltonian H(q,p) from physics that is defined as the sum of potential energy U(q) and kinetic energy K(p), where q and p are random vectors that refer to position and momentum. The connection to Bayesian computation lies with relating U(q)totheposteriordistributionandhenceqtothemodelparameters, andwithintroducing auxiliary Gaussian random variables to represent momentum p, a vector having the same length as q. The evolution of this system is then described by the Hamilton equations from 5 statistical mechanics: dq ∂H i = dt ∂p i dp ∂H i = (3) dt −∂q i which, if an analytic solution exists, produces a draw from the posterior distribution. In practice this system is solved using numerical integration techniques (Neal, 2011) and the resulting approximate solution is accepted or rejected using a Metropolis-Hastings step. To carry out the computations required for the LGCP model we use a combination of re- parametrization of the random field and numerical techniques based on the 2D Fast Fourier transform (FFT) as in Møller et al. (1998). Note that although Girolami and Calderhead (2011) has applied RM-HMC to LGCP model, their computation is slow due to inverting the fisher information matrix and a Cholesky factorization of the correlation matrix. While here we can greatly speedup matrix multiplications involving the correlation matrix C (we note here that we use the power exponential correlation function) by using FFT. A second reason to use the re-parametrization is that we avoid inversion on the correlation matrix at each iteration, which in our simulations and data analyses below is a 4096 4096 matrix. × Althoughthissizeofamatrixcanbeinverted onacomputer, itiscomputationallyexpensive. Furthermore, this FFT trick can handle much larger matrices that would be too large to invert on most computers. To use this trick we require the matrix to be block-circulant as there is a direct relationship between the eigenvalue-eigenvector decomposition of a block- circulant matrix and the 2D discrete Fourier transform. However, the correlation matrix has a block-Toeplitz structure. A block-Toeplitz matrix can always be extended to a block- circulant matrix (Wood and Chan, 1994). To do so, we extend the original n n grid × to an m m grid and wrap it on a torus, where m = 2g and g is an integer such that × m 2(n 1). The metric of this toroidal space is then defined by the minimum distance ≥ − between two points. It is easy to show that the new correlation matrix E (of which C is a submatrix), whose elements are based on the metric defined on the torus, is a block-circulant matrix (Møller et al., 1998). In extending the space we must also expand the vector of latent variables Y in a corresponding manner, and we refer to this new vector as Yext (of which Y is a subvector). Also, we set the number of points in cell i, m , still to be n if i is on the i i original grid and equal to 0 otherwise. The block-circulant extended correlation matrix can be decomposed as E = FΛFH, where F is the matrix of eigenvectors, Λ is the diagonal matrix containing the corresponding eigenvalues of E, and H denotes the complex conjugate transpose. Given a random vector v of length m2 the product Ev can be obtained by calculating FHv, ΛFHv and then FΛFHv in order. Note that the first and last calculations amount to a discrete inverse Fourier transform and a discrete Fourier transform (DFT), respectively. The middle calculation is simply element-wise multiplication of Λ and the vector FHv (Rue and Held, 2005). As a result, the complexity of the required matrix operations can be reduced to O(m2log(m2)) using the FFT. After extension of the grid we re-parametrize the latent variables Yext as Yext = µ1m2 + σE21γ where 1m2 denotes the m2-dimensional vector of ones. γ = (γ1,...,γm2)T with γi iid ∼ N(0,1),i = 1,...,m2. The gradients, used in the HMC algorithm, for all the parameters are 6 straightforward to derive except for ρ, we give the expression for ρ here and refer the readers to Appendix A for more details. ∂logπ(ρ | ·) = σ m Aexp µ1m2 +σE21γ TE−12E∗γ, (4) ∂ρ −2 − h i where π( ) denotes the full conditional given (cid:0)the data and o(cid:1)ther parameters, m, E∗ are defined in| t·he appendix. And as E21, E−21, E∗ are all block-circulant matrices, FFT can be used. With the stochastic representation as in Equation 3 the HMC algorithm is based on setting U(q) = log[π( s θ, γ)π(θ, γ)] where q = (γT,θT)T, and the kinetic energy k − { } | term is K(p) = pTM−1p/2, where M is a symmetric, positive-definite ’mass matrix’ and auxiliary momentum variables p (a vector of length m2 +3). In our work we set M to be a diagonal matrix with distinct diagonal components mγ, mµ, mσ−2 and mρ corresponding to γ and θ. Each iteration of the HMC algorithm involves a block update of γ and θ based on the Hamiltonian Monte Carlo scheme. Each such update requires L + 1 evaluations of the gradient vector logπ(θ,γ s ) for some L 1. If L = 1, the HMC algorithm reduces θ,γ k ∇ | { } ≥ to MALA, which typically mixes faster than the random walk Metropolis-Hastings algorithm (Roberts et al., 2001) but not as fast as the more general HMC algorithm. Letting γ∗,θ∗ be the current value in the Markov chain for γ,θ, the HMC update, based on a step size ε > 0, proceeds as follows: At each iteration, the number of steps in the numerical integration, Algorithm 1 HMC algorithm 1. Simulate latent vector p∗ Nm2+3(0,M). Set ∼ γ(0),θ0) = (γ∗,θ∗) ε (cid:0) p(0(cid:1)) = p∗ + θ∗,γ∗[log π(γ∗,θ∗ sk )π(θ∗, γ∗) ]. 2∇ { | { } } 2. For l=1,...,L, T T γ(l),θ(l) = γ(l−1),θ(l−1) +εM−1p(l−1) (cid:16) p(cid:17)(l) = p(cid:16)(l−1)+εl∇θ(l(cid:17)),γ(l) log π γ(l),θ(l) |{sk} π θ(l), γ(l) h n (cid:16) (cid:17) (cid:16) (cid:17)oi where ε = ε for l < L and ε = ε/2. l L 3. Accept γ(L),θ(L) as the new state for (γ,θ) with probability (cid:0) (cid:1) α = min 1,exp H q(L),p(L) +H(q∗,p∗) − else remain in the current st(cid:0)ate γ∗,(cid:8)θ∗ wit(cid:0)h probabi(cid:1)lity 1 α. (cid:9)(cid:1) − Repeat steps 1–3 for a sufficiently long time. L, is drawn from a Poisson distribution with mean 100, while the step size, ε, is initially 7 chosen to be 0.005 and adjusted adaptively during burning so that the acceptance rate is approximately 0.65 (Beskos et al., 2013). Trace plots of the parameters are examined and based on these we adjust the values of M to improve the mixing. 2.3. Mean field Variational Bayes with Laplace Approximation As an alternative to HMC (or MCMC) sampling of the posterior distribution, a deter- ministic approximation can be employed. Mean field variational Bayes (MFVB) is one such approximation that has been applied successfully to a number of problems, including spatial models forhigh-dimensional problems requiring fast computations (Nathoo et al., 2014). For the MFVB algorithm, we return to the parametrization of the model in Equation (2). Let q(y,θ) be an arbitrary density function having the same support as the posterior density π(θ, y s ). Letting logπ( s ) denote the marginal likelihood of the model, we can k k | { } { } express its logarithm as π( s , y, θ) logπ( s ) = q(y,θ)log { k} { k} q(y,θ) Z (cid:26) (cid:27) q(y,θ) + q(y,θ)log π(θ, y s ) Z (cid:26) | { k} (cid:27) π( s , θ, y) q(y,θ)log { k} F(q) ≥ q(y,θ) ≡ Z (cid:26) (cid:27) such that the functional F(q) is a lower bound for logπ( s ) for any q. The approximation k { } is obtained by restricting q to a manageable class of density functions, and maximizing F over that class. We develop the approximation under the assumption that the GRF has a power exponential correlation function, and for now, we will assume that ρ is known, so that C is assumed known in what follows. We assume that the approximating density q can be factorized n2 q(y,θ) = q(y ) q(µ)q(σ2). (5) i " # i=1 Y Under this assumption, a coordinate ascent algorithm is applied to maximize F which leads to a sequence of coordinate-wise updates taking the form q(y ) exp E [logπ( s y)π(y θ)] i = 1,...,n2 i ∝ { −q(yi) { k} | | } q(µ) exp E [logπ(y θ)π(µ)] −q(µ) ∝ { | } q(σ2) exp E−q(σ2)[logπ(y θ)π(σ2)] ∝ { | } where E [ ] denotes the expectation taken with respect to the set of random variables −q(x) · y,θ x under the variational approximation q , and the updates steps are iterated to the −x { }\ convergence of F. We describe the derivation of the update steps below. In what follows, E[ ] will denote the expectation of its argument under the variational approximation q. · If conditionally conjugate Gaussian and inverse-gamma priors are chosen for π(µ) and π(σ2) respectively, µ N(µ ,σ2), σ2 G−1(α,β) the distributions q(µ) and q(σ2) compris- ∼ µ µ ∼ ing the update steps will also be Gaussian and inverse-gamma. To derive the update step 8 for µ we have 1 1 E logπ(y θ)π(µ) = c (µ µ1)TE(σ−2)(C−1)(µ µ1) (µ µ )2/σ2 −q(µ) | − 2 y − y − − 2 − µ µ (6) = c 1 (E(σ−2)1(cid:2)T(C−1)1+1/σ2)µ2 2E(σ−2)µ T(cid:3)(C−1)1µ+µ /σ2 − 2 µ − y µ µ (cid:2) (cid:3) where µ = E[y] andc denotes a constant not depending onµ. As(6) is a quadratic function y of µ it follows that q(µ) is the density of a normal distribution N(µ∗,σ∗2) where after some µ µ algebra we have µ∗ = (E(σ−2)µ T(C−1)1+µ /σ2)(σ∗2)−1, σ∗2 = (E(σ−2)1T(C−1)1+1/σ2)−1. µ y µ µ µ µ µ To derive the update step for σ2 we have n2 1 E−q(σ2)logπ(y θ)π(σ2) = c logσ2 σ−2Eµ[(µy µ1)TC−1(µy µ1) | − 2 − 2 − − +Tr(C−1Σ )] (α+1)logσ2 β/σ2 y − − where Σ is the covariance matrix of y under q(y) and c denotes a constant not depending y on σ2. Simplifying this expression yields n2 1 E−q(σ2)logπ(y | θ)π(σ2) =c−(α+1+ 2 )logσ2 −(β + 2[(µy −µ∗µ1)T(C−1)(µy −µ∗µ1) +σ∗21T(C−1)1+Tr(C−1Σ )])/σ2 µ y and thus q(σ2) is the density of an Inverse-Gamma distribution G−1(α+ n2,β∗) where β∗ = 2 β + 1 (µ µ∗1)T(C−1)(µ µ∗1)+σ∗21T(C−1)1+Tr(C−1Σ ) . 2 y − µ y − µ µ y As the Gaussian prior for y is not conditionally conjugate for the LGCP model, the i (cid:2) (cid:3) variational Bayes update for q(y ) is not a standard distribution and is therefore not easy i to compute without some further approximation. We derive an update step for q(y ) by i applying the Laplace method (Wang and Blei, 2013) within the variational Bayes update. We have E logπ( s y)π(y θ) = c+(y n eyiA) −q(yi) { k} | | i i − 1 (y˜ E(µ)1)TE(σ−2)(C−1)(y˜ E(µ)1)+Var(µ)E(σ−2)Tr((C−1)11T) −2 − − (cid:2) = c+(y n eyiA) 1 (y˜ E(µ)1)TE(σ−2)(C−1)(y˜ E(µ)1)(cid:3) (7) i i − − 2 − − (cid:2) (cid:3) where y˜ denotes (E(y1),...,E(yi−1),yi,E(yi+1),...,E(yn2))T. Taking the derivative with re- spect to y yields, i f(y ) = ∂E logπ( s y)π(y θ)/∂y = n Aeyi E(σ−2) (C−1)(y˜ E(µ)1) i −q(yi) { k} | | i i − − − i = n Aeyi E(σ−2) (cid:2)(C−1) (y˜ E(µ)1(cid:3)) i ij j − − − j X = Aeyi E(σ−2)(C−1) y +n +E(σ−2)(C−1) E(µ) E(σ−2) (C−1) (y˜ E(µ)1) (8) ii i i ii ij j − − − − j6=i X 9 Given (8) we find yˆ such that f(yˆ) = 0. We use Newton’s method to obtain a numerical i i solution where the starting value for Newton’s method is obtained by omitting the linear term in equation (8) and solving the resulting simplified equation exactly. We then take the second derivative with respect to y , H(y ) = ∂f(y )/∂y = Aeyi E(σ−2)(C−1) and given i i i i ii − − this and the solution yˆ, the Laplace method yields a normal distribution N(yˆ, H(yˆ)−1) i i i − for q(y ) which approximates the VB update. The variational-Laplace approximation to the i posterior of the latent field y is then y N(µy,Σy) where µy = (yˆ1,...,yˆn2)T and Σy is a ∼ diagonal matrix with the ith element being H(yˆ)−1. i − Given the update steps derived above, the approximate posterior distribution under the variational-Laplace approximation takes the form (5) where the component densities are standard distributions y N(µ(q),Σ(q)) y y ∼ µ N(µ(q),σ2(q)) ∼ µ µ σ2 G−1(α(q),β(q)). ∼ Theparametersdeterminingthesedistributionsarecalledthe’variationalparameters’. These parameters are obtained through the sequence of update steps derived above which are used to determine equations expressing each variational parameter in terms of the remaining variational parameters; beginning with initial values for these parameters the equations are iterated to convergence of F. With respect to the parameter ρ of the power exponential correlation function, we have found that its inclusion as an unknown parameter into the variational approach leads to convergence problems in a number of trial examples. To deal with this problem, we estimate this parameter prior to running the VB algorithm using the method of minimum contrast (Diggle and Gratton, 1984), i.e., to use non-linear least squares estimation to fit a non- parametric estimated covariance function. The mean field VB algorithm incorporating the Laplace method for the LGCP model is presented in detail in Algorithm 2. 2.4. INLA The integrated nested Laplace approximation (INLA) is another approach for construct- ing a deterministic approximation to the posterior distribution that can be applied to the fairly broad class of latent Gaussian models. The details underlying the approach have been described in a number of recent papers including the seminal work of Rue et al. (2009). We provide here only a brief overview of aspects that are relevant for use with the LGCP model. For spatial models, INLA makes extensive use of the Gaussian Markov random field (GMRF) which is a Gaussian distribution having a sparse precision matrix. Algorithms for fitting models incorporating a GMRF can be made efficient through the use of numerical methods for sparse matrices. In the case of the LGCP model the latent GRF which is a spatially continuous process is approximated by a GMRF on a discrete lattice so that these numerical methods can be applied. We consider this approximation for the case where the GRF is a Mat´ern field with ν known, so that θ = (µ,σ2,φ) and the log intensity is characterized by Y as before. An approximation π˜(θ s ) to the marginal posterior k | { } distribution π(θ s ) is first obtained using the Laplace method. An approximation k | { } π˜(y θ, s ) to the density of the full conditional distribution of each component of the i k | { } 10