ebook img

Introduction to Statistical Machine Learning PDF

524 Pages·2016·12.49 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning Introduction to Statistical Machine Learning Masashi Sugiyama AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEWYORK • OXFORD • PARIS • SANDIEGO SANFRANCISCO • SINGAPORE • SYDNEY • TOKYO MorganKaufmannPublishersisanImprintofElsevier AcquiringEditor:ToddGreen EditorialProjectManager:AmyInvernizzi ProjectManager:MohanambalNatarajan Designer:MariaInesCruz MorganKaufmannisanimprintofElsevier 225WymanStreet,Waltham,MA02451USA Copyright©2016byElsevierInc.Allrightsofreproductioninanyformreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicor mechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem,without permissioninwritingfromthepublisher.Detailsonhowtoseekpermission,furtherinformationabout thePublisher’spermissionspoliciesandourarrangementswithorganizationssuchastheCopyright ClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite: www.elsevier.com/permissions. ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher (otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience broadenourunderstanding,changesinresearchmethodsorprofessionalpractices,maybecome necessary.Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein evaluatingandusinganyinformationormethodsdescribedherein.Inusingsuchinformationormethods theyshouldbemindfuloftheirownsafetyandthesafetyofothers,includingpartiesforwhomtheyhave aprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeany liabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligence orotherwise,orfromanyuseoroperationofanymethods,products,instructions,orideascontainedin thematerialherein. LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress BritishLibraryCataloging-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary. ISBN:978-0-12-802121-7 ForinformationonallMorganKaufmannpublications visitourwebsiteatwww.mkp.com Contents Biography ..................................................................................xxxi Preface.....................................................................................xxxiii PART 1 INTRODUCTION CHAPTER 1 Statistical Machine Learning 3 1.1 Types of Learning ......................................................3 1.2 Examples of Machine Learning Tasks.............................4 1.2.1 Supervised Learning........................................4 1.2.2 Unsupervised Learning.....................................5 1.2.3 Further Topics................................................6 1.3 Structure of This Textbook...........................................8 PART 2 STATISTICS AND PROBABILITY CHAPTER 2 Random Variables and Probability Distributions 11 2.1 Mathematical Preliminaries .......................................11 2.2 Probability .............................................................13 2.3 Random Variable and Probability Distribution................14 2.4 Properties of Probability Distributions..........................16 2.4.1 Expectation, Median, and Mode.......................16 2.4.2 Variance and Standard Deviation......................18 2.4.3 Skewness, Kurtosis, and Moments....................19 2.5 Transformation of Random Variables............................22 CHAPTER 3 Examples of Discrete Probability Distributions 25 3.1 Discrete Uniform Distribution.....................................25 3.2 Binomial Distribution ...............................................26 3.3 Hypergeometric Distribution.......................................27 3.4 Poisson Distribution.................................................31 3.5 Negative Binomial Distribution...................................33 3.6 Geometric Distribution..............................................35 CHAPTER 4 Examples of Continuous Probability Distributions 37 4.1 Continuous Uniform Distribution.................................37 4.2 Normal Distribution..................................................37 4.3 Gamma Distribution, Exponential Distribution, and Chi- Squared Distribution................................................41 4.4 Beta Distribution.....................................................44 4.5 Cauchy Distribution and Laplace Distribution ................47 4.6 t-Distribution and F-Distribution.................................49 v vi Contents CHAPTER 5 Multidimensional Probability Distributions 51 5.1 Joint Probability Distribution......................................51 5.2 Conditional Probability Distribution.............................52 5.3 Contingency Table....................................................53 5.4 Bayes’ Theorem.......................................................53 5.5 Covariance and Correlation........................................55 5.6 Independence.........................................................56 CHAPTER 6 Examples of Multidimensional Probability Distributions 61 6.1 Multinomial Distribution ...........................................61 6.2 Multivariate Normal Distribution.................................62 6.3 Dirichlet Distribution................................................63 6.4 Wishart Distribution.................................................70 CHAPTER 7 Sum of Independent Random Variables 73 7.1 Convolution............................................................73 7.2 Reproductive Property ..............................................74 7.3 Law of Large Numbers..............................................74 7.4 Central Limit Theorem..............................................77 CHAPTER 8 Probability Inequalities 81 8.1 Union Bound..........................................................81 8.2 Inequalities for Probabilities......................................82 8.2.1 Markov’s Inequality and Chernoff’s Inequality......82 8.2.2 Cantelli’s Inequality and Chebyshev’s Inequality ..83 8.3 Inequalities for Expectation.......................................84 8.3.1 Jensen’s Inequality........................................84 8.3.2 Hölder’s Inequality and Schwarz’s Inequality.......85 8.3.3 Minkowski’s Inequality...................................86 8.3.4 Kantorovich’s Inequality .................................87 8.4 Inequalities for the Sum of Independent Random Vari- ables.....................................................................87 8.4.1 Chebyshev’s Inequality and Chernoff’s Inequality....................................................88 8.4.2 Hoeffding’s Inequality and Bernstein’s Inequality....................................................88 8.4.3 Bennett’s Inequality.......................................89 CHAPTER 9 Statistical Estimation 91 9.1 Fundamentals of Statistical Estimation ........................91 9.2 Point Estimation......................................................92 9.2.1 Parametric Density Estimation.........................92 9.2.2 Nonparametric Density Estimation....................93 9.2.3 Regression and Classification...........................93 Contents vii 9.2.4 Model Selection............................................94 9.3 Interval Estimation...................................................95 9.3.1 Interval Estimation for Expectation of Normal Samples......................................................95 9.3.2 Bootstrap Confidence Interval..........................96 9.3.3 Bayesian Credible Interval...............................97 CHAPTER 10 Hypothesis Testing 99 10.1 Fundamentals of Hypothesis Testing............................99 10.2 Test for Expectation of Normal Samples...................... 100 10.3 Neyman-Pearson Lemma......................................... 101 10.4 Test for Contingency Tables...................................... 102 10.5 Test for Difference in Expectations of Normal Samples.. 104 10.5.1 Two Samples without Correspondence ............. 104 10.5.2 Two Samples with Correspondence.................. 105 10.6 Nonparametric Test for Ranks................................... 107 10.6.1 Two Samples without Correspondence ............. 107 10.6.2 Two Samples with Correspondence.................. 108 10.7 Monte Carlo Test ................................................... 108 PART 3 GENERATIVE APPROACH TO STATISTICAL PATTERN RECOGNITION CHAPTER 11 Pattern Recognition via Generative Model Estimation 113 11.1 Formulation of Pattern Recognition ........................... 113 11.2 Statistical Pattern Recognition................................. 115 11.3 Criteria for Classifier Training ................................... 117 11.3.1 MAP Rule.................................................. 117 11.3.2 Minimum Misclassification Rate Rule.............. 118 11.3.3 Bayes Decision Rule.................................... 119 11.3.4 Discussion................................................. 121 11.4 Generative and Discriminative Approaches.................. 121 CHAPTER 12 Maximum Likelihood Estimation 123 12.1 Definition............................................................. 123 12.2 Gaussian Model..................................................... 125 12.3 Computing the Class-Posterior Probability................... 127 12.4 Fisher’s Linear Discriminant Analysis (FDA)................. 130 12.5 Hand-Written Digit Recognition ................................ 133 12.5.1 Preparation................................................ 134 12.5.2 Implementing Linear Discriminant Analysis ...... 135 12.5.3 Multiclass Classification............................... 136 CHAPTER 13 Properties of Maximum Likelihood Estimation 139 viii Contents 13.1 Consistency.......................................................... 139 13.2 Asymptotic Unbiasedness........................................ 140 13.3 Asymptotic Efficiency............................................. 141 13.3.1 One-Dimensional Case ................................. 141 13.3.2 Multidimensional Cases................................ 141 13.4 Asymptotic Normality............................................. 143 13.5 Summary............................................................. 145 CHAPTER 14 Model Selection for Maximum Likelihood Estimation 147 14.1 Model Selection.................................................... 147 14.2 KL Divergence ...................................................... 148 14.3 AIC..................................................................... 150 14.4 Cross Validation..................................................... 154 14.5 Discussion ........................................................... 154 CHAPTER 15 Maximum Likelihood Estimation for Gaussian Mixture Model 157 15.1 Gaussian Mixture Model.......................................... 157 15.2 MLE ................................................................... 158 15.3 Gradient Ascent Algorithm....................................... 161 15.4 EM Algorithm ....................................................... 162 CHAPTER 16 Nonparametric Estimation 169 16.1 Histogram Method................................................. 169 16.2 Problem Formulation.............................................. 170 16.3 KDE.................................................................... 174 16.3.1 Parzen Window Method................................ 174 16.3.2 Smoothing with Kernels................................ 175 16.3.3 Bandwidth Selection.................................... 176 16.4 NNDE................................................................. 178 16.4.1 Nearest Neighbor Distance............................ 178 16.4.2 Nearest Neighbor Classifier........................... 179 CHAPTER 17 Bayesian Inference 185 17.1 Bayesian Predictive Distribution................................ 185 17.1.1 Definition.................................................. 185 17.1.2 Comparison with MLE.................................. 186 17.1.3 Computational Issues................................... 188 17.2 Conjugate Prior..................................................... 188 17.3 MAP Estimation.................................................... 189 17.4 Bayesian Model Selection........................................ 193 CHAPTER 18 Analytic Approximation of Marginal Likelihood 197 18.1 Laplace Approximation ........................................... 197 18.1.1 Approximation with Gaussian Density.............. 197 Contents ix 18.1.2 Illustration................................................. 199 18.1.3 Application to Marginal Likelihood Approximation............................................ 200 18.1.4 Bayesian Information Criterion (BIC)............... 200 18.2 Variational Approximation........................................ 202 18.2.1 Variational Bayesian EM (VBEM) Algorithm....... 202 18.2.2 Relation to Ordinary EM Algorithm.................. 203 CHAPTER 19 Numerical Approximation of Predictive Distribution 205 19.1 Monte Carlo Integration........................................... 205 19.2 Importance Sampling............................................. 207 19.3 Sampling Algorithms.............................................. 208 19.3.1 Inverse Transform Sampling .......................... 208 19.3.2 Rejection Sampling..................................... 212 19.3.3 Markov Chain Monte Carlo (MCMC) Method...... 214 CHAPTER 20 Bayesian Mixture Models 221 20.1 Gaussian Mixture Models......................................... 221 20.1.1 Bayesian Formulation................................... 221 20.1.2 Variational Inference.................................... 223 20.1.3 Gibbs Sampling.......................................... 228 20.2 Latent Dirichlet Allocation (LDA)............................... 229 20.2.1 Topic Models.............................................. 230 20.2.2 Bayesian Formulation................................... 231 20.2.3 Gibbs Sampling.......................................... 232 PART 4 DISCRIMINATIVE APPROACH TO STATISTICAL MACHINE LEARNING CHAPTER 21 Learning Models 237 21.1 Linear-in-Parameter Model....................................... 237 21.2 Kernel Model........................................................ 239 21.3 Hierarchical Model................................................. 242 CHAPTER 22 Least Squares Regression 245 22.1 Method of LS........................................................ 245 22.2 Solution for Linear-in-Parameter Model...................... 246 22.3 Properties of LS Solution......................................... 250 22.4 Learning Algorithm for Large-Scale Data..................... 251 22.5 Learning Algorithm for Hierarchical Model.................. 252 CHAPTER 23 Constrained LS Regression 257 23.1 Subspace-Constrained LS........................................ 257 23.2 ℓ -Constrained LS.................................................. 259 2 x Contents 23.3 Model Selection.................................................... 262 CHAPTER 24 Sparse Regression 267 24.1 ℓ -Constrained LS.................................................. 267 1 24.2 Solvingℓ -Constrained LS ....................................... 268 1 24.3 Feature Selection by Sparse Learning ........................ 272 24.4 Various Extensions................................................. 272 24.4.1 Generalizedℓ -Constrained LS....................... 273 1 24.4.2 ℓ -Constrained LS....................................... 273 p 24.4.3 ℓ +ℓ -Constrained LS.................................. 274 1 2 24.4.4 ℓ -Constrained LS...................................... 276 1,2 24.4.5 Trace Norm Constrained LS........................... 278 CHAPTER 25 Robust Regression 279 25.1 Nonrobustness ofℓ -Loss Minimization...................... 279 2 25.2 ℓ -Loss Minimization.............................................. 280 1 25.3 Huber Loss Minimization......................................... 282 25.3.1 Definition.................................................. 282 25.3.2 Stochastic Gradient Algorithm ....................... 283 25.3.3 Iteratively Reweighted LS.............................. 283 25.3.4 ℓ -Constrained Huber Loss Minimization.......... 286 1 25.4 Tukey Loss Minimization......................................... 290 CHAPTER 26 Least Squares Classification 295 26.1 Classification by LS Regression................................. 295 26.2 0/1-Loss and Margin............................................... 297 26.3 Multiclass Classification.......................................... 300 CHAPTER 27 Support Vector Classification 303 27.1 Maximum Margin Classification................................ 303 27.1.1 Hard Margin Support Vector Classification........ 303 27.1.2 Soft Margin Support Vector Classification......... 305 27.2 Dual Optimization of Support Vector Classification........ 306 27.3 Sparseness of Dual Solution..................................... 308 27.4 Nonlinearization by Kernel Trick................................ 311 27.5 Multiclass Extension .............................................. 312 27.6 Loss Minimization View........................................... 314 27.6.1 Hinge Loss Minimization............................... 315 27.6.2 Squared Hinge Loss Minimization................... 316 27.6.3 Ramp Loss Minimization............................... 318 CHAPTER 28 Probabilistic Classification 321 28.1 Logistic Regression................................................ 321 28.1.1 Logistic Model and MLE............................... 321 28.1.2 Loss Minimization View................................ 324

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.