List of Figures Page: xix List of Algorithms Page: xxi List of Generative Stories Page: xxiii Preface Page: xxv Acknowledgments Page: xxix Preliminaries Page: 1 Probability Measures Page: 1 Random Variables Page: 2 Continuous and Discrete Random Variables Page: 3 Joint Distribution over Multiple Random Variables Page: 4 Conditional Distributions Page: 5 Bayes' Rule Page: 7 Independent and Conditionally Independent Random Variables Page: 7 Exchangeable Random Variables Page: 8 Expectations of Random Variables Page: 9 Models Page: 11 Parametric vs. Nonparametric Models Page: 12 Inference with Models Page: 12 Generative Models Page: 14 Independence Assumptions in Models Page: 16 Directed Graphical Models Page: 17 Learning from Data Scenarios Page: 19 Bayesian and Frequentist Philosophy (Tip of the Iceberg) Page: 22 Summary Page: 23 Exercises Page: 24 Introduction Page: 25 Overview: Where Bayesian Statistics and NLP Meet Page: 26 First Example: The Latent Dirichlet Allocation Model Page: 29 The Dirichlet Distribution Page: 34 Inference Page: 38 Summary Page: 39 Second Example: Bayesian Text Regression Page: 39 Conclusion and Summary Page: 41 Exercises Page: 42 Priors Page: 43 Conjugate Priors Page: 44 Conjugate Priors and Normalization Constants Page: 47 The Use of Conjugate Priors with Latent Variable Models Page: 48 Mixture of Conjugate Priors Page: 49 Renormalized Conjugate Distributions Page: 51 Discussion: To Be or Not To Be Conjugate? Page: 52 Summary Page: 53 Priors Over Multinomial and Categorical Distributions Page: 53 The Dirichlet Distribution Re-visited Page: 54 The Logistic Normal Distribution Page: 58 Discussion Page: 64 Summary Page: 65 Non-informative Priors Page: 65 Uniform and Improper Priors Page: 66 Jeffreys Prior Page: 67 Discussion Page: 68 Conjugacy and Exponential Models Page: 69 Multiple Parameter Draws in Models Page: 70 Structural Priors Page: 72 Conclusion and Summary Page: 73 Exercises Page: 75 Bayesian Estimation Page: 77 Learning with Latent Variables: Two Views Page: 78 Bayesian Point Estimation Page: 79 Maximum a Posteriori Estimation Page: 79 Posterior Approximations Based on the MAP Solution Page: 87 Decision-theoretic Point Estimation Page: 88 Discussion and Summary Page: 90 Empirical Bayes Page: 90 Asymptotic Behavior of the Posterior Page: 92 Summary Page: 93 Exercises Page: 94 Sampling Methods Page: 95 MCMC Algorithms: Overview Page: 96 NLP Model Structure for MCMC Inference Page: 97 Partitioning the Latent Variables Page: 98 Gibbs Sampling Page: 99 Collapsed Gibbs Sampling Page: 102 Operator View Page: 106 Parallelizing the Gibbs Sampler Page: 109 Summary Page: 110 The Metropolis-Hastings Algorithm Page: 111 Variants of Metropolis-Hastings Page: 112 Slice Sampling Page: 113 Auxiliary Variable Sampling Page: 113 The Use of Slice Sampling and Auxiliary Variable Sampling in NLP Page: 115 Simulated Annealing Page: 116 Convergence of MCMC Algorithms Page: 116 Markov Chain: Basic Theory Page: 118 Sampling Algorithms Not in the MCMC Realm Page: 120 Monte Carlo Integration Page: 123 Discussion Page: 124 Computability of Distribution vs. Sampling Page: 124 Nested MCMC Sampling Page: 125 Runtime of MCMC Samplers Page: 125 Particle Filtering Page: 126 Conclusion and Summary Page: 127 Exercises Page: 129 Variational Inference Page: 131 Variational Bound on Marginal Log-likelihood Page: 131 Mean-field Approximation Page: 134 Mean-field Variational Inference Algorithm Page: 135 Dirichlet-multinomial Variational Inference Page: 137 Connection to the Expectation-maximization Algorithm Page: 141 Empirical Bayes with Variational Inference Page: 143 Discussion Page: 144 Initialization of the Inference Algorithms Page: 144 Convergence Diagnosis Page: 145 The Use of Variational Inference for Decoding Page: 146 Variational Inference as KL Divergence Minimization Page: 147 Online Variational Inference Page: 147 Summary Page: 148 Exercises Page: 149 Nonparametric Priors Page: 151 The Dirichlet Process: Three Views Page: 152 The Stick-breaking Process Page: 153 The Chinese Restaurant Process Page: 155 Dirichlet Process Mixtures Page: 157 Inference with Dirichlet Process Mixtures Page: 158 Dirichlet Process Mixture as a Limit of Mixture Models Page: 161 The Hierarchical Dirichlet Process Page: 161 The Pitman-Yor Process Page: 163 Pitman-Yor Process for Language Modeling Page: 165 Power-law Behavior of the Pitman-Yor Process Page: 166 Discussion Page: 167 Gaussian Processes Page: 168 The Indian Buffet Process Page: 168 Nested Chinese Restaurant Process Page: 169 Distance-dependent Chinese Restaurant Process Page: 169 Sequence Memoizers Page: 170 Summary Page: 171 Exercises Page: 172 Bayesian Grammar Models Page: 173 Bayesian Hidden Markov Models Page: 174 Hidden Markov Models with an Infinite State Space Page: 175 Probabilistic Context-free Grammars Page: 177 PCFGs as a Collection of Multinomials Page: 180 Basic Inference Algorithms for PCFGs Page: 180 Hidden Markov Models as PCFGs Page: 184 Bayesian Probabilistic Context-free Grammars Page: 185 Priors on PCFGs Page: 185 Monte Carlo Inference with Bayesian PCFGs Page: 186 Variational Inference with Bayesian PCFGs Page: 187 Adaptor Grammars Page: 189 Pitman-Yor Adaptor Grammars Page: 190 Stick-breaking View of PYAG Page: 192 Inference with PYAG Page: 192 Hierarchical Dirichlet Process PCFGs (HDP-PCFGs) Page: 196 Extensions to the HDP-PCFG Model Page: 197 Dependency Grammars Page: 198 State-split Nonparametric Dependency Models Page: 198 Synchronous Grammars Page: 200 Multilingual Learning Page: 201 Part-of-speech Tagging Page: 203 Grammar Induction Page: 204 Further Reading Page: 205 Summary Page: 207 Exercises Page: 208 Closing Remarks Page: 209 Basic Concepts Page: 211 Basic Concepts in Information Theory Page: 211 Entropy and Cross Entropy Page: 211 Kullback-Leibler Divergence Page: 212 Other Basic Concepts Page: 212 Jensen's Inequality Page: 212 Transformation of Continuous Random Variables Page: 213 The Expectation-maximization Algorithm Page: 213 Distribution Catalog Page: 215 The Multinomial Distribution Page: 215 The Dirichlet Distribution Page: 216 The Poisson Distribution Page: 217 The Gamma Distribution Page: 217 The Multivariate Normal Distribution Page: 218 The Laplace Distribution Page: 218 The Logistic Normal Distribution Page: 219 The Inverse Wishart Distribution Page: 220 Bibliography Page: 221 Author's Biography Page: 241 Index Page: 243 Blank Page Page: ii
Description:Natural language processing (NLP) went through a profound transformation in the mid-1980s when it shifted to make heavy use of corpora and data-driven techniques to analyze language. Since then, the use of statistical techniques in NLP has evolved in several ways. One such example of evolution took place in the late 1990s or early 2000s, when full-fledged Bayesian machinery was introduced to NLP. This Bayesian approach to NLP has come to accommodate for various shortcomings in the frequentist approach and to enrich it, especially in the unsupervised setting, where statistical learning is done without target prediction examples.
We cover the methods and algorithms that are needed to fluently read Bayesian learning papers in NLP and to do research in the area. These methods and algorithms are partially borrowed from both machine learning and statistics and are partially developed "in-house" in NLP. We cover inference techniques such as Markov chain Monte Carlo sampling and variational inference, Bayesian estimation, and nonparametric modeling. We also cover fundamental concepts in Bayesian statistics such as prior distributions, conjugacy, and generative modeling. Finally, we cover some of the fundamental modeling techniques in NLP, such as grammar modeling and their use with Bayesian analysis.