ebook img

Graphical Models, Exponential Families, and Variational Inference PDF

305 Pages·1.581 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Graphical Models, Exponential Families, and Variational Inference

FoundationsandTrends(cid:1)R in MachineLearning Vol.1,Nos.1–2(2008)1–305 (cid:1)c 2008M.J.WainwrightandM.I.Jordan DOI:10.1561/2200000001 Graphical Models, Exponential Families, and Variational Inference Martin J. Wainwright1 and Michael I. Jordan2 1 Department of Statistics, and Department of Electrical Engineering and Computer Science, University of California, Berkeley 94720, USA, [email protected] 2 Department of Statistics, and Department of Electrical Engineering and Computer Science, University of California, Berkeley 94720, USA, [email protected] Abstract The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Graphical models have become a focus of research in many statisti- cal, computational and mathematical fields, including bioinformatics, communication theory, statistical physics, combinatorial optimiza- tion, signal and image processing, information retrieval and statistical machine learning. Many problems that arise in specific instances — including the key problems of computing marginals and modes of probability distributions — are best studied in the general setting. Working with exponential family representations, and exploiting the conjugate duality between the cumulant function and the entropy for exponential families, we develop general variational representa- tions of the problems of computing likelihoods, marginal probabili- ties and most probable configurations. We describe how a wide variety of algorithms — among them sum-product, cluster variational meth- ods, expectation-propagation, mean field methods, max-product and linear programming relaxation, as well as conic programming relax- ations — can all be understood in terms of exact or approximate forms of these variational representations. The variational approach provides acomplementaryalternativetoMarkovchainMonteCarloasageneral source of approximation methods for inference in large-scale statistical models. 1 Introduction Graphical models bring together graph theory and probability theory in a powerful formalism for multivariate statistical modeling. In vari- ous applied fields including bioinformatics, speech processing, image processing and control theory, statistical models have long been for- mulated in terms of graphs, and algorithms for computing basic statis- tical quantities such as likelihoods and score functions have often been expressed in terms of recursions operating on these graphs; examples includephylogenies,pedigrees,hiddenMarkovmodels,Markovrandom fields, and Kalman filters. These ideas can be understood, unified, and generalized within the formalism of graphical models. Indeed, graphi- cal models provide a natural tool for formulating variations on these classical architectures, as well as for exploring entirely new families of statistical models. Accordingly, in fields that involve the study of large numbers of interacting variables, graphical models are increasingly in evidence. Graph theory plays an important role in many computationally ori- ented fields, including combinatorial optimization, statistical physics, and economics. Beyond its use as a language for formulating models, graph theory also plays a fundamental role in assessing computational 3 4 Introduction complexity and feasibility. In particular, the running time of an algo- rithm or the magnitude of an error bound can often be characterized in terms of structural properties of a graph. This statement is also true in the context of graphical models. Indeed, as we discuss, the com- putational complexity of a fundamental method known as the junction tree algorithm —whichgeneralizesmanyoftherecursivealgorithmson graphs cited above — can be characterized in terms of a natural graph- theoretic measure of interaction among variables. For suitably sparse graphs, the junction tree algorithm provides a systematic solution to the problem of computing likelihoods and other statistical quantities associated with a graphical model. Unfortunately, many graphical models of practical interest are not “suitablysparse,”sothatthejunctiontreealgorithmnolongerprovides a viable computational framework. One popular source of methods for attempting to cope with such cases is the Markov chain Monte Carlo (MCMC) framework, and indeed there is a significant literature on the application of MCMC methods to graphical models [e.g., 28, 93, 202]. Our focus in this survey is rather different: we present an alternative computational methodology for statistical inference that is based on variational methods. These techniques provide a general class of alter- nativestoMCMC,andhaveapplicationsoutsideofthegraphicalmodel framework. As we will see, however, they are particularly natural in their application to graphical models, due to their relationships with the structural properties of graphs. Thephrase“variational”itselfisanumbrellatermthatreferstovar- ious mathematical tools for optimization-based formulations of prob- lems, as well as associated techniques for their solution. The general idea is to express a quantity of interest as the solution of an opti- mization problem. The optimization problem can then be “relaxed” in various ways, either by approximating the function to be optimized or by approximating the set over which the optimization takes place. Suchrelaxations,inturn,provideameansofapproximatingtheoriginal quantity of interest. The roots of both MCMC methods and variational methods lie in statistical physics. Indeed, the successful deployment of MCMC methods in statistical physics motivated and predated their entry into 5 statistics. However, the development of MCMC methodology specif- ically designed for statistical problems has played an important role in sparking widespread application of such methods in statistics [88]. A similar development in the case of variational methodology would be of significant interest. In our view, the most promising avenue toward a variational methodology tuned to statistics is to build on existing links between variational analysis and the exponential family of distri- butions [4, 11, 43, 74]. Indeed, the notions of convexity that lie at the heartofthestatisticaltheoryoftheexponentialfamilyhaveimmediate implications for the design of variational relaxations. Moreover, these variational relaxations have particularly interesting algorithmic conse- quences in the setting of graphical models, where they again lead to recursions on graphs. Thus, we present a story with three interrelated themes. We begin in Section 2 with a discussion of graphical models, providing both an overview of the general mathematical framework, and also presenting several specific examples. All of these examples, as well as the majority of current applications of graphical models, involve distributions in the exponential family. Accordingly, Section 3 is devoted to a discussion of exponential families, focusing on the mathematical links to convex analysis, and thus anticipating our development of variational meth- ods. In particular, the principal object of interest in our exposition is a certain conjugate dual relation associated with exponential fam- ilies. From this foundation of conjugate duality, we develop a gen- eral variational representation for computing likelihoods and marginal probabilities in exponential families. Subsequent sections are devoted to the exploration of various instantiations of this variational princi- ple, both in exact and approximate forms, which in turn yield various algorithms for computing exact and approximate marginal probabili- ties, respectively. In Section 4, we discuss the connection between the Bethe approximation and the sum-product algorithm, including both its exact form for trees and approximate form for graphs with cycles. We also develop the connections between Bethe-like approximations and other algorithms, including generalized sum-product, expectation- propagation and related moment-matching methods. In Section 5, we discusstheclassofmeanfieldmethods,whicharisefromaqualitatively 6 Introduction different approximation to the exact variational principle, with the addedbenefitofgeneratinglowerboundsonthelikelihood.InSection6, we discuss the role of variational methods in parameter estimation, including both the fully observed and partially observed cases, as well as both frequentist and Bayesian settings. Both Bethe-type and mean field methods are based on nonconvex optimization problems, which typically have multiple solutions. In contrast, Section 7 discusses vari- ational methods based on convex relaxations of the exact variational principle, many of which are also guaranteed to yield upper bounds on the log likelihood. Section 8 is devoted to the problem of mode compu- tation, with particular emphasis on the case of discrete random vari- ables, in which context computing the mode requires solving an integer programming problem. We develop connections between (reweighted) max-product algorithms and hierarchies of linear programming relax- ations. In Section 9, we discuss the broader class of conic programming relaxations, and show how they can be understood in terms of semidef- inite constraints imposed via moment matrices. We conclude with a discussion in Section 10. Thescopeofthissurveyislimitedinthefollowingsense:givenadis- tribution represented as a graphical model, we are concerned with the problemofcomputingmarginalprobabilities(includinglikelihoods),as well as the problem of computing modes. We refer to such computa- tional tasks as problems of “probabilistic inference,” or “inference” for short. As with presentations of MCMC methods, such a limited focus may appear to aim most directly at applications in Bayesian statis- tics. While Bayesian statistics is indeed a natural terrain for deploying many of the methods that we present here, we see these methods as having applications throughout statistics, within both the frequentist and Bayesian paradigms, and we indicate some of these applications at various junctures in the survey. 2 Background Webeginwithbackgroundongraphicalmodels.Thekeyideaisthatof factorization: a graphical model consists of a collection of probability distributions that factorize according to the structure of an underly- ing graph. Here, we are using the terminology “distribution” loosely; our notation p should be understood as a mass function (density with respect to counting measure) in the discrete case, and a density with respect to Lebesgue measure in the continuous case. Our focus in this sectionistheinterplaybetweenprobabilisticnotionssuchasconditional independence on one hand, and on the other hand, graph-theoretic notions such as cliques and separation. 2.1 Probability Distributions on Graphs We begin by describing the various types of graphical formalisms that are useful. A graph G=(V,E) is formed by a collection of vertices V ={1,2,...,m}, and a collection of edges E ⊂V ×V. Each edge con- sists of a pair of vertices s,t∈E, and may either be undirected, in which case there is no distinction between edge (s,t) and edge (t,s), or directed, in which case we write (s→t) to indicate the direction. See Appendix A.1 for more background on graphs and their properties. 7 8 Background In order to define a graphical model, we associate with each vertex s∈V a random variable X taking values in some space X . Depend- s s ing on the application, this state space X may either be continuous, s (e.g., X =R) or discrete (e.g., X ={0,1,...,r −1}). We use lower- s s case letters (e.g., x ∈X ) to denote particular elements of X , so that s s s the notation {X =x } corresponds to the event that the random s s variable X takes the value x ∈X . For any subset A of the vertex s s s set V, we define the subvector X :=(X , s∈A), with the notation A s x :=(x , s∈A) corresponding to the analogous quantity for values A s of the random vector XA. Similarly, we define ⊗s∈AXs to be the Carte- sian product of the state spaces for each of the elements of X . A 2.1.1 Directed Graphical Models Given a directed graph with edges (s→t), we say that t is a child of s, or conversely, that s is a parent of t. For any vertex s∈V, let π(s) denote the set of all parents of given node s∈V. (If a vertex s has no parents, then the set π(s) should be understood to be empty.) A directedcycleisasequence(s ,s ,...,s )suchthat(s →s )∈E for 1 2 k i i+1 alli=1,...,k −1,and(s →s )∈E.SeeFigure2.1foranillustration k 1 of these concepts. Now suppose that G is a directed acyclic graph (DAG), meaning that every edge is directed, and that the graph contains no directed Fig.2.1 (a)Asimpledirectedgraphicalmodelwithfourvariables(X1,X2,X3,X4).Vertices {1,2,3}areallparentsofvertex4,writtenπ(4)={1,2,3}.(b)Amorecomplicateddirected acyclicgraph(DAG)thatdefinesapartialorderonitsvertices.Notethatvertex6isachild of vertex 2, and vertex 1 is an ancestor of 6. (c) A forbidden directed graph (nonacyclic) thatincludesthedirectedcycle(2→4→5→2). 2.1 Probability Distributions on Graphs 9 cycles. For any such DAG, we can define a partial order on the vertex set V by the notion of ancestry: we say that node s is an ancestor of u if there is a directed path (s,t ,t ,...,t ,u) (see Figure 2.1(b)). Given 1 2 k a DAG, for each vertex s and its parent set π(s), let p (x | x ) s s π(s) denote a nonnegative function over the variables (x ,x ), normalized (cid:1) s π(s) such that p (x | x )dx =1. In terms of these local functions, a s s π(s) s directed graphical model consists of a collection of probability distribu- tions (densities or mass functions) that factorize in the following way: (cid:2) p(x ,x ,...,x )= p (x | x ). (2.1) 1 2 m s s π(s) s∈V It can be verified that our use of notation is consistent, in that p (x | x ) is, in fact, the conditional probability of {X =x } s s π(s) s s given {X =x } for the distribution p(·) defined by the factor- π(s) π(s) ization (2.1). This follows by an inductive argument that makes use of the normalization condition imposed on p (·), and the partial ordering s induced by the ancestor relationship of the DAG. 2.1.2 Undirected Graphical Models Intheundirectedcase,theprobabilitydistributionfactorizesaccording to functions defined on the cliques of the graph. A clique C is a fully connected subset of the vertex set V, meaning that (s,t)∈E for all s,t∈C. Let us associate with each clique C a compatibility function ψC :(⊗s∈CXs)→R+. Recall that ⊗s∈CXs denotes the Cartesian prod- uct of the state spaces of the random vector X . Consequently, the C compatibility function ψ is a local quantity, defined only for elements C x within the clique. C With this notation, an undirected graphical model — also known as aMarkov random field (MRF),oraGibbs distribution —isacollection of distributions that factorize as (cid:2) 1 p(x ,x ,...,x )= ψ (x ), (2.2) 1 2 m C C Z C∈C where Z is a constant chosen to ensure that the distribution is nor- malized. The set C is often taken to be the set of all maximal cliques of the graph, i.e., the set of cliques that are not properly contained 10 Background within any other clique. This condition can be imposed without loss of generality, because any representation based on nonmaximal cliques can always be converted to one based on maximal cliques by redefin- ing the compatibility function on a maximal clique to be the product over the compatibility functions on the subsets of that clique. How- ever, there may be computational benefits to using a nonmaximal rep- resentation — in particular, algorithms may be able to exploit specific features of the factorization special to the nonmaximal representation. For this reason, we do not necessarily restrict compatibility functions to maximal cliques only, but instead define the set C to contain all cliques. (Factor graphs, discussed in the following section, allow for a finer-grained specification of factorization properties.) It is important to understand that for a general undirected graph the compatibility functions ψ need not have any obvious or direct C relation to marginal or conditional distributions defined over the graph cliques. This property should be contrasted with the directed factor- ization (2.1), where the factors correspond to conditional probabilities over the child-parent sets. 2.1.3 Factor Graphs For large graphs, the factorization properties of a graphical model, whether undirected or directed, may be difficult to visualize from the usual depictions of graphs. The formalism of factor graphs provides an alternative graphical representation, one which emphasizes the factor- ization of the distribution [143, 157]. Let F represent an index set for the set of factors defining a graph- ical model distribution. In the undirected case, this set indexes the collection C of cliques, while in the directed case F indexes the set of parent–child neighborhoods. We then consider a bipartite graph (cid:4) (cid:4) (cid:4) G =(V,F,E ), where V is the original set of vertices, and E is a new edge set, joining only vertices s∈V to factors a∈F. In particular, edge (s,a)∈E(cid:4) if and only if x participates in the factor indexed by s a∈F. See Figure 2.2(b) for an illustration. For undirected models, the factor graph representation is of partic- ular value when C consists of more than the maximal cliques. Indeed, the compatibility functions for the nonmaximal cliques do not have

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.