A Conceptual Introduction to Hamiltonian Monte Carlo Michael Betancourt 7 Abstract. Hamiltonian Monte Carlo has proven a remarkable empirical 1 success, but only recently have we begun to develop a rigorous under- 0 2 standing of why it performs so well on difficult problems and how it n is best applied in practice. Unfortunately, that understanding is con- a finedwithinthemathematicsofdifferentialgeometrywhichhaslimited J its dissemination, especially to the applied communities for which it is 0 1 particularly important. In this review I provide a comprehensive conceptual account of these ] E theoretical foundations, focusing on developing a principled intuition M behind the method and its optimal implementations rather of any ex- haustive rigor. Whether a practitioner or a statistician, the dedicated . t a readerwillacquireasolidgraspofhowHamiltonianMonteCarloworks, t when it succeeds, and, perhaps most importantly, when it fails. s [ 1 v Michael Betancourt is a research scientist in the Applied Statistics Center at Columbia 4 University. Much of this review was completed as a Research Fellow at the Centre for 3 4 Research in Statistical Methodology, University of Warwick, Coventry CV4 7AL, UK 2 (e-mail: [email protected]). 0 . 1 0 7 1 : v i X r a 1 2 BETANCOURT CONTENTS 1 Computing Expectations By Exploring Probability Distributions . . . . . . . . . 3 1.1 Computing Expectations in Practice . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Parsimonious Expectation Computation . . . . . . . . . . . . . . . . . . . . 5 1.3 The Geometry of High-Dimensional Spaces . . . . . . . . . . . . . . . . . . 6 1.4 The Geometry of High-Dimensional Probability Distributions . . . . . . . . 7 2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Estimating Expectations with Markov Chains . . . . . . . . . . . . . . . . . 9 2.2 Ideal Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Pathological Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 The Metropolis-Hastings Algorithm. . . . . . . . . . . . . . . . . . . . . . . 15 3 The Foundations of Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . 16 3.1 Informing Effective Markov Transitions . . . . . . . . . . . . . . . . . . . . 17 3.2 Phase Space and Hamilton’s Equations . . . . . . . . . . . . . . . . . . . . . 22 3.3 The Idealized Hamiltonian Markov Transition . . . . . . . . . . . . . . . . . 26 4 Efficient Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 The Natural Geometry of Phase Space . . . . . . . . . . . . . . . . . . . . . 27 4.2 Optimizing the Choice of Kinetic Energy. . . . . . . . . . . . . . . . . . . . 30 4.3 Optimizing the Choice of Integration Time . . . . . . . . . . . . . . . . . . 32 5 Implementing Hamiltonian Monte Carlo in Practice . . . . . . . . . . . . . . . . 35 5.1 Symplectic Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Correcting for Symplectic Integrator Error . . . . . . . . . . . . . . . . . . . 38 5.3 Optimal Choice of Symplectic Integrator . . . . . . . . . . . . . . . . . . . . 41 6 The Robustness of Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . 43 6.1 Diagnosing Poorly-Chosen Kinetic Energies . . . . . . . . . . . . . . . . . . 44 6.2 Diagnosing Regions of High Curvature . . . . . . . . . . . . . . . . . . . . . 44 6.3 Limitations of Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A Technical Details of Practical Implementations . . . . . . . . . . . . . . . . . . . 47 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 A.2 Static Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 A.3 Efficient Static Implementations . . . . . . . . . . . . . . . . . . . . . . . . 50 A.4 Dynamic Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A.5 The No-U-Turn Sampler and the Current State of Stan . . . . . . . . . . . 58 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A CONCEPTUAL INTRODUCTION TO HAMILTONIAN MONTE CARLO 3 Hamiltonian Monte Carlo has followed a long and winding path into modern statistical computing. The method was originally developed in the late 1980s as Hybrid Monte Carlo to tackle calculations in Lattice Quantum Chromodynamics (Duane et al., 1987), a field focused on understanding the structure of the protons and neutrons that comprise nuclei, atoms,andultimatelytheworldaroundus.WithinafewyearsRadfordNealrecognizedthe potentialofthemethodforproblemsinappliedstatisticsinhispioneeringworkonBayesian neuralnetworks(Neal,1995).Overthenextdecadethemethodbegantomakeappearances in textbooks, notably MacKay (2003), who first used the term Hamiltonian Monte Carlo insteadof Hybrid Monte Carlo, and Bishop (2006). Neal’s influential review (Neal, 2011), however,reallyintroducedtheapproachintothemainstreamofstatisticalcomputing.With the rise of high-performance software implementations such as Stan (Stan Development Team, 2017), the method has now become a pervasive tool across many scientific, medical, and industrial applications. Only recently, however, have we begun to understand why the success of Hamiltonian Monte Carlo has been so extensive. Instead of relying on fragile heuristics, the method is built upon a rich theoretical foundation that makes it uniquely suited to the high- dimensional problems of applied interest (Betancourt et al., 2014). Unfortunately, this theoretical foundation is formulated in terms of differential geometry, an advanced field of mathematics that is rarely included in statistical pedagogy. Consequently this formal construction is often out of reach of theoretical and applied statisticians alike. The aim of this paper is to introduce the intuition behind the success of Hamiltonian Monte Carlo while avoiding the mathematical details; in particular, I assume only a basic familiarity with probability and calculus. Such an approach necessarily sacrifices rigor, but I hope the concepts will be sufficiently intuitive to satisfy readers working in applied fields. I highly encourage those interested in learning more about Hamiltonian Monte Carlo, or evencontributingtoitsdevelopment,tofollowuponthereferencesdiscussedinBetancourt et al., 2014, Section 2. Our story will begin with an introduction to the geometry of high-dimensional probabil- ity distributions and how that geometry frustrates efficient statistical computing. We will thenconsiderMarkovchainMonteCarlofromthisgeometricperspective,motivingthefea- tures necessary to scale the approach to such high-dimensional problems. By developing a method that inherently satisfies these criteria we will very naturally be led to Hamiltonian MonteCarlo.FinallyIwilldiscusshowthisunderstandingcanbeextendedtomotivatenot just the method itself but also its efficient practical implementation, including optimized tuning as well as inherent diagnostics of pathological behavior. 1. COMPUTING EXPECTATIONS BY EXPLORING PROBABILITY DISTRIBUTIONS Theultimateundertakinginstatisticalcomputingisevaluatingexpectationswithrespect to some distinguished target probability distribution. For example, we might be interested in extracting information from a posterior distribution over model configuration space in 4 BETANCOURT Bayesian inference, or computing coverage of an estimator with respect to the likelihood over data space in frequentist statistics. Here we will be agnostic, considering only a target distribution, π, on a D-dimensional sample space, Q, and the corresponding expectations of functions, E [f]. π Probability distributions, and the corresponding expectations, are rather abstract ob- jects, however, and if we want to use them in practice then we need a more explicit means of specifying them. Here we will assume that the sample space is smooth, in which case that we can represent the target distribution with a probability density function and expec- tations as integrals. Care must be taken, however, as this representation hides some of the more subtle behaviors of high-dimensional spaces that are critical towards understanding how to compute these integrals, and hence the desired expectations, efficiently. 1.1 Computing Expectations in Practice We begin by assuming that the target sample space, Q, can be parameterized by the real numbers such that every point q ∈ Q can be specified with D real numbers. Given a parameter space, Q, we can then specify the target distribution as a smooth probability density function, π(q), while expectations reduce to integrals over parameter space, (cid:90) E [f] = dqπ(q)f(q). π Q Parameterizations are not unique: we can always take another parameterization, Q(cid:48), over which we specify the target distribution with a different probability density function, π(cid:48)(q(cid:48)), while expectations reduce to the new integral, (cid:90) E [f] = dq(cid:48)π(q(cid:48))f(q(cid:48)). π Q(cid:48) Critically,however,theexpectationsvaluesthemselvesareinvarianttoanyparticularchoice of parameterization, so the integrals must be equal, (cid:90) (cid:90) E [f] = dqπ(q)f(q) = dq(cid:48)π(cid:48)(q(cid:48))f(q(cid:48)). π Q Q(cid:48) In this review we will consider only a single parameterization for computing expectations, but we must be careful to ensure that any such computation does not depend on the irrelevant details of that parameterization, such as the particular shape of the probability density function. Once we have chosen a parameterization, the abstract expectation becomes the concrete integral.Unfortunately,foranynontrivialtargetdistributionwewillnotbeabletoevaluate these integrals analytically, and we must instead resort to numerical methods which only approximate them. The accuracy of these approximations, and hence the utility of any given algorithm, however, is limited by our finite computational power. A CONCEPTUAL INTRODUCTION TO HAMILTONIAN MONTE CARLO 5 For a method to scale to the complex problems at the frontiers of applied statistics, it has to make effective use of each and every evaluation of the target density, π(q), and relevant functions, f(q). Optimizing these evaluations is a subtle problem frustrated by the natural geometry of probability distributions, especially over high-dimensional parameter spaces. 1.2 Parsimonious Expectation Computation One way to ensure computational inefficiency is to waste computational resources eval- uating the target density and relevant functions in regions of parameter space that have negligiblecontributiontothedesiredexpectation.Inordertoavoidtheseregions,andfocus our computation only on significant regions of parameter space, we first need to identify how the target density and target function contribute to the overall expectation. Because integration is a linear operation, scaling the integrand proportionately scales the integral. Consequently, a common intuition is to focus on regions where the integrand is largest. This intuition suggests that we consider regions where the target density and target function take on their largest values. Inpracticeweoftenareinterestedincomputingexpectationswithrespecttomanytarget functions, for example in Bayesian inference we typically summarize our uncertainty with both means and variances, or multiple quantiles. Any method that depends on the specific details of any one function will then have to be repeatedly adjusted for each new function we encounter, expanding a single computational problem into many. Consequently, from here on in we will assume that any relevant function is sufficiently uniform in parameter space that its variation does not strongly effect the integrand. Keep in mind, however, that if only a single expectation is in fact of interest then exploiting the structure of that function can provide significant improvements (Mira, Solgi and Imparato, 2013; Oates, Girolami and Chopin, 2016). This assumption implies that the variation in the integrand is dominated by the target density,andhenceweshouldconsidertheneighborhoodaroundthemodewherethedensity is maximized. This intuition is consistent with the many statistical methods that utilize the mode, such as maximum likelihood estimators and Laplace approximations, although conflicts with our desire to avoid the specific details of the target density. Indeed, this intuition is fatally naive as it misses a critical detail. Expectation values are given by accumulating the integrand over a volume of parameter space and, while the density is largest around the mode, there is not much volume there. To identify the regions of parameter space that dominate expectations we need to consider the behavior of both the density and the volume. In high-dimensional spaces the volume behaves very differently from the density, resulting in a tension that concentrates the significant regions of parameter space away from either extreme. 6 BETANCOURT (a) (b) (c) Fig 1. To understand how the distribution of volume behaves with increasing dimension we can consider a rectangular partitioning centered around a distinguished point, such as the mode. (a) In one dimension the relative weight of the center partition is 1/3, (b) in two dimensions it is 1/9, (c) and in three dimensions it is only 1/27. Very quickly the volume in the center partition becomes negligible compared to the neighboring volume. 1.3 The Geometry of High-Dimensional Spaces One of the characteristic properties of high-dimensional spaces is that there is much more volume outside any given neighborhood than inside of it. Although this may at first appear strange, we can build intuition to demystify this behavior by considering a few simple cases. For example, consider partitioning parameter space into rectangular boxes centered around the mode (Figure 1). In one dimension there are only two partitions neighbor- ing the center partition, leaving a significant volume around the mode. Adding one more dimension, however, introduces eight neighboring partitions, and in three dimensions there arealready26.Ingeneralthereare3D−1neighboringpartitionsinaD-dimensionalspace, and for even small D the volume neighboring the mode dominates the volume immediately around the mode. This pattern only amplifies if we consider more partitions further away from the mode that multiply even faster. Alternatively, we can take a spherical view of parameter space and consider the relative volume a distance δ inside and outside of a spherical shell (Figure 2). In one dimension the interior and exterior volumes are equal, but in two and three dimensions more and more volume concentrates on the outside of the shell. Centering the shell at the mode we can see once again that the volume in any neighborhood containing the mode becomes more and more negligible as the dimension of the parameter space increases. Generically, then, volume is largest out in the tails of the target distribution away from the mode, and this disparity grows exponentially with the dimension of parameter space. Consequently, the massive volume over which we integrate can compensate to give a significant contribution to the target expectation despite the smaller density. In order to A CONCEPTUAL INTRODUCTION TO HAMILTONIAN MONTE CARLO 7 δ δ δ δ δ δ (a) (b) (c) Fig 2. The dominance of volume away from any point in parameter space can also be seen from a spherical perspective, where we consider the volume contained radial distance δ both interior to and exterior to a D-dimensional spherical shell, shown here with dashed lines. (a) In one dimension the spherical shell is a line and volumes interior and exterior are equivalent. (b) In two dimensions the spherical shell becomes circle and there is more volume immediately outside the shell than immediately inside. (c) The exterior volume grows even larger relative to the interior volume in three dimensions, where the spherical shell is now a the surface of a sphere. In fact, with increasing dimension the exterior volume grows exponentially large relative to the interior volume, and very quickly the volume around the mode is dwarfed by the volume away from the mode. identity the neighborhoods that most contributions to expectations, we need to carefully balance the behavior of both the density and the volume. 1.4 The Geometry of High-Dimensional Probability Distributions The neighborhood immediately around the mode features large densities, but in more thanafewdimensionsthesmallvolumeofthatneighborhoodpreventsitfromhavingmuch contribution to any expectation. On the other hand, the complimentary neighborhood far away from the mode features a much larger volume, but the vanishing densities lead to similarly negligible contributions expectations. The only significant contributions come from the neighborhood between these two extremes known as the typical set (Figure 3). Importantly, because probability densities and volumes transform oppositely under any reparameterization, the typical set is an invariant object that does not depend on the irrelevant details of any particular choice of parameters. As the dimension of parameter space increases, the tension between the density and the volume grows and the regions where the density and volume are both large enough to yield a significant contribution becomes more and more narrow. Consequently the typical set becomes more singular with increasing dimension, a manifestation of concentration of measure. The immediate consequence of concentration of measure is that the only signifi- cant contributions to any expectation come from the typical set; evaluating the integrand outside of the typical set has negligible effect on expectations and hence is a waste of pre- cious computational resources. In other words, we can accurately estimate expectations by 8 BETANCOURT dq Typical Set π(q) dq π(q) |q - q | Mode Fig3.Inhighdimensionsaprobabilitydensity,π(q),willconcentratearounditsmode,butthevolumeover which we integrate that density, dq, is much larger away from the mode. Contributions to any expectation are determined by the product of density and volume, π(q)dq, which then concentrates in a nearly-singular neighborhood called the typical set (grey). averaging over the typical set instead of the entirety of parameter space. Consequently, in order to compute expectations efficiently, we have to be able to identify, and then focus our computational resources into, the typical set (Figure 4). This helps to explain, for example, why brute force methods like naive quadrature scale so poorly with dimension. A grid of length N distributed uniformly in a D-dimensional parameter space requires ND points and hence ND evaluations of the integrand. Unless N is incredibly large, however, it is unlikely that any of these points will intersect the narrow typical set, and the exponentially-growing cost of averaging over the grid yields worse and worseapproximationstoexpectations.Ingeneral,framingalgorithmsbyhowtheyquantify thetypicalsetisapowerfulwaytoquicklyintuithowanalgorithmwillperforminpractice. Of course, understanding why we want to focus on the typical set in only the first step. How to construct an algorithm that can quantify the typical set of an arbitrary target distribution is another problem altogether. There are many strategies for this task, but one of the most generic, and hence most useful in applied practice, is Markov chain Monte Carlo (Robert and Casella, 1999; Brooks et al., 2011). 2. MARKOV CHAIN MONTE CARLO Markov chain Monte Carlo uses a Markov chain to stochastically explore the typical set, generating a random grid across the region of high probability from which we can con- A CONCEPTUAL INTRODUCTION TO HAMILTONIAN MONTE CARLO 9 Fig 4. In high-dimensional parameter spaces probability mass, π(q)dq, and hence the dominant contribu- tions to expectations, concentrates in a neighborhood called the typical set. In order to accurately estimate expectations we have to be able to identify where the typical set lies in parameter space so that we can focus our computational resources where they are most effective. struct accurate expectation estimates. Given sufficient computational resources a properly designedMarkovchainwilleventually explorethetypicalsetofanydistribution.Themore practical, and much more challenging question, however, is whether a given Markov chain will explore a typical set in the finite time available in a real analysis. In this section I’ll introduce a more substantial definition of Markov chain Monte Carlo and discuss both its ideal and pathological behaviors. Finally we’ll consider how to im- plement Markov chain Monte Carlo in practice and see how fragile of an endeavor it can be. 2.1 Estimating Expectations with Markov Chains A Markov chain is a progression of points in parameter space generated by sequentially applying a random map known as a Markov transition. Alternatively, we can think of a Markov transition as a conditional probability density, T(q(cid:48) | q), defining to which point, q(cid:48), we are most likely to jump from the initial point, q (Figure 5). An arbitrary Markov chain will simply wander through parameter space and will not be ofanyparticularuseincomputingexpectations.Somethingveryspecialhappens,however, if the Markov transition preserves the target distribution, (cid:90) π(q) = dq(cid:48)π(q(cid:48))T(q | q(cid:48)). Q More intuitively, this condition implies that if we generated a ensemble of samples from the target distribution and applied the transition then we would get a new ensemble that was still distributed according to the target distribution. 10 BETANCOURT (a) (b) (c) Fig 5. (a) A Markov chain is a sequence of points in parameter space generated by a Markov transition density (green) that defines the probability of a new point given the current point. (b) Sampling from that distribution yields a new state in the Markov chain and a new distribution from which to sample. (c) Repeating this process generates a Markov chain that meanders through parameter space. Fig6.WhenaMarkovtransition(green)preservesthetargetdistribution,itconcentratestowardsthetypical set (red), no matter where it is applied. Consequently, the resulting Markov chain will drift into and then acrossthetypicalsetregardlessofitsinitialstate,providingapowerfulquantificationofthetypicalsetfrom which we can derive accurate expectation estimators. So long as this condition holds, at every initial point the Markov transition will concen- trate towards the typical set. Consequently, no matter where we begin in parameter space the corresponding Markov chain will eventually drift into, and then across, the typical set (Figure 6). Given sufficient time, the history of the Markov chain, {q ,...,q }, denoted samples 0 N generated by the Markov chain, becomes a convenient quantification of the typical set. In particular, we can estimate expectations across the typical set, and hence expectations across the entire parameter space, by averaging the target function over this history, N 1 (cid:88) fˆ = f(q ). N n N n=0 As we run the Markov chain for longer and longer, it will better explore the typical set and, up to some technicalities, these Markov chain Monte Carlo estimators will converge