ebook img

Machine Learning for Text PDF

505 Pages·2018·6.14 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Machine Learning for Text

Charu C. Aggarwal Machine Learning for Text 123 CharuC.Aggarwal IBMT.J.WatsonResearchCenter YorktownHeights,NY,USA ISBN978-3-319-73530-6 ISBN978-3-319-73531-3 (eBook) https://doi.org/10.1007/978-3-319-73531-3 LibraryofCongressControlNumber:2018932755 ©SpringerInternationalPublishingAG,partofSpringerNature2018 Preface “ If it is true that there is always more than one way of construing a text, it is not true that all interpretations are equal.” – Paul Ricoeur Therichareaoftextanalyticsdrawsideasfrominformationretrieval,machinelearning, and natural language processing. Each of these areas is an active and vibrant field in its own right, and numerous books have been written in each of these different areas. As a result, many of these books have covered some aspects of text analytics, but they have not covered all the areas that a book on learning from text is expected to cover. At this point, a need exists for a focussed book on machine learning from text. This book is a first attempt to integrate all the complexities in the areas of machine learning, information retrieval, and natural language processing in a holistic way, in order to create a coherent and integrated book in the area. Therefore, the chapters are divided into three categories: 1. Fundamental algorithms and models: Many fundamental applications in text analyt- ics, such as matrix factorization, clustering, and classification, have uses in domains beyond text. Nevertheless, these methods need to be tailored to the specialized char- acteristics of text. Chapters 1 through 8 will discuss core analytical methods in the context of machine learning from text. 2. Information retrieval and ranking: Many aspects of information retrieval and rank- ing are closely related to text analytics. For example, ranking SVMs and link-based ranking are often used for learning from text. Chapter 9 will provide an overview of information retrieval methods from the point of view of text mining. 3. Sequence- and natural language-centric text mining: Although multidimensional rep- resentations can be used for basic applications in text analytics, the true richness of the text representation can be leveraged by treating text as sequences. Chapters 10 through14willdiscusstheseadvancedtopicslikesequenceembedding,deeplearning, information extraction, summarization, opinionmining,textsegmentation, andevent extraction. Becauseofthediversityoftopicscoveredinthisbook,somecarefuldecisionshavebeenmade on the scope of coverage. A complicating factor is that many machine learning techniques depend on the use of basic natural language processing and information retrieval method- ologies. This is particularly true of the sequence-centric approaches discussed in Chaps.10 through 14 that are more closely related to natural language processing. Examples of an- alytical methods that rely on natural language processing include information extraction, event extraction, opinion mining, and text summarization, which frequently leverage basic natural language processing tools like linguistic parsing or part-of-speech tagging. Needless to say, natural language processing is a full fledged field in its own right (with excellent books dedicated to it). Therefore, a question arises on how much discussion should be pro- videdontechniquesthatlieontheinterfaceofnaturallanguageprocessingandtextmining without deviating from the primary scope of this book. Our general principle in making these choices has been to focus on mining and machine learning aspects. If a specific nat- ural language or information retrieval method (e.g., part-of-speech tagging) is not directly abouttextanalytics,wehaveillustratedhowtousesuchtechniques(asblack-boxes)rather thandiscussingtheinternalalgorithmicdetailsofthesemethods.Basictechniqueslikepart- of-speech tagging have matured in algorithmic development, and have been commoditized to the extent that many open-source tools are available with little difference in relative performance. Therefore, we only provide working definitions of such concepts in the book, andtheprimaryfocuswillbeontheirutilityasoff-the-shelftoolsinmining-centricsettings. The book provides pointers to the relevant books and open-source software in each chapter in order to enable additional help to the student and practitioner. Thebookiswrittenforgraduatestudents,researchers,andpractitioners.Theexposition has been simplified to a large extent, so that a graduate student with a reasonable under- standingoflinearalgebraandprobabilitytheorycanunderstandthebookeasily.Numerous exercises are available along with a solution manual to aid in classroom teaching. Throughoutthisbook,avectororamultidimensionaldatapointisannotatedwithabar, such as X or y. A vector or multidimensional point may be denoted by either small letters orcapital letters, aslong asit hasabar. Vectordot productsaredenoted by centereddots, such as X·Y. A matrix is denoted in capital letters without a bar, such as R. Throughout the book, the n × d document-term matrix is denoted by D, with n documents and d dimensions. The individual documents in D are therefore represented as d-dimensional row vectors, which are the bag-of-words representations. On the other hand, vectors with one component for each data point are usually n-dimensional column vectors. An example is the n-dimensional column vector y of class variables of n data points. Yorktown Heights, NY, USA Charu C. Aggarwal Contents 1 Machine Learning for Text: An Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 What Is Special About Learning from Text? . . . . . . . . . . . . . . . . . 3 1.3 Analytical Models for Text . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Text Preprocessing and Similarity Computation . . . . . . . . . . . 5 1.3.2 Dimensionality Reduction and Matrix Factorization . . . . . . . . . 7 1.3.3 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.3.1 Deterministic and Probabilistic Matrix Factorization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.3.2 Probabilistic Mixture Models of Documents . . . . . . . . 8 1.3.3.3 Similarity-Based Algorithms . . . . . . . . . . . . . . . . . 9 1.3.3.4 Advanced Methods . . . . . . . . . . . . . . . . . . . . . . 9 1.3.4 Text Classification and Regression Modeling . . . . . . . . . . . . . 10 1.3.4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.4.2 Rule-Based Classifiers . . . . . . . . . . . . . . . . . . . . 11 1.3.4.3 Na¨ıve Bayes Classifier . . . . . . . . . . . . . . . . . . . . 11 1.3.4.4 Nearest Neighbor Classifiers . . . . . . . . . . . . . . . . . 12 1.3.4.5 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.4.6 Broader Topics in Classification . . . . . . . . . . . . . . . 13 1.3.5 Joint Analysis of Text with Heterogeneous Data . . . . . . . . . . . 13 1.3.6 Information Retrieval and Web Search . . . . . . . . . . . . . . . . 13 1.3.7 Sequential Language Modeling and Embeddings . . . . . . . . . . . 13 1.3.8 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.9 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.10 Opinion Mining and Sentiment Analysis . . . . . . . . . . . . . . . 14 1.3.11 Text Segmentation and Event Detection . . . . . . . . . . . . . . . 15 1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Text Preparation and Similarity Computation 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Raw Text Extraction and Tokenization . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Web-Specific Issues in Text Extraction . . . . . . . . . . . . . . . . 21 2.3 Extracting Terms from Tokens . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Stop-Word Removal. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Hyphens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Case Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Usage-Based Consolidation . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.5 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Vector Space Representation and Normalization . . . . . . . . . . . . . . . 24 2.5 Similarity Computation in Text . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1 Is idf Normalization and Stemming Always Useful? . . . . . . . . . 28 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Matrix Factorization and Topic Modeling 31 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.2 Normalizing a Two-Way Factorization into a Standardized Three-Way Factorization . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.1 Example of SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 The Power Method of Implementing SVD . . . . . . . . . . . . . . 39 3.2.3 Applications of SVD/LSA . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.4 Advantages and Disadvantages of SVD/LSA . . . . . . . . . . . . . 40 3.3 Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Interpretability of Nonnegative Matrix Factorization . . . . . . . . 43 3.3.2 Example of Nonnegative Matrix Factorization . . . . . . . . . . . . 43 3.3.3 Folding in New Documents . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.4 Advantages and Disadvantages of Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Probabilistic Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Connections with Nonnegative Matrix Factorization . . . . . . . . . 50 3.4.2 Comparison with SVD . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3 Example of PLSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.4 Advantages and Disadvantages of PLSA . . . . . . . . . . . . . . . 51 3.5 A Bird’s Eye View of Latent Dirichlet Allocation . . . . . . . . . . . . . . 52 3.5.1 Simplified LDA Model . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.2 Smoothed LDA Model . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Nonlinear Transformations and Feature Engineering . . . . . . . . . . . . . 56 3.6.1 Choosing a Similarity Function . . . . . . . . . . . . . . . . . . . . 59 3.6.1.1 Traditional Kernel Similarity Functions . . . . . . . . . . 59 3.6.1.2 Generalizing Bag-of-Words to N-Grams . . . . . . . . . . 62 3.6.1.3 String Subsequence Kernels . . . . . . . . . . . . . . . . . 62 3.6.1.4 Speeding Up the Recursion . . . . . . . . . . . . . . . . . 65 3.6.1.5 Language-Dependent Kernels . . . . . . . . . . . . . . . . 65 3.6.2 Nystro¨m Approximation . . . . . . . . . . . . . . . . . . . . . . . . 66 3.6.3 Partial Availability of the Similarity Matrix . . . . . . . . . . . . . 67 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.8.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4 Text Clustering 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.1 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Feature Selection and Engineering . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1.1 Term Strength . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1.2 Supervised Modeling for Unsupervised Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.1.3 Unsupervised Wrappers with Supervised Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.2 Feature Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.2.1 Matrix Factorization Methods . . . . . . . . . . . . . . . . 77 4.2.2.2 Nonlinear Dimensionality Reduction . . . . . . . . . . . . 78 4.2.2.3 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Topic Modeling and Matrix Factorization . . . . . . . . . . . . . . . . . . . 79 4.3.1 Mixed Membership Models and Overlapping Clusters . . . . . . . . 79 4.3.2 Non-overlappingClustersandCo-clustering:AMatrixFactorization View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.2.1 Co-clustering by Bipartite Graph Partitioning . . . . . . . 82 4.4 Generative Mixture Models for Clustering . . . . . . . . . . . . . . . . . . 83 4.4.1 The Bernoulli Model . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.2 The Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.3 Comparison with Mixed Membership Topic Models . . . . . . . . . 87 4.4.4 Connections with Na¨ıve Bayes Model for Classification . . . . . . . 88 4.5 The k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5.1 Convergence and Initialization . . . . . . . . . . . . . . . . . . . . . 91 4.5.2 Computational Complexity. . . . . . . . . . . . . . . . . . . . . . . 91 4.5.3 Connection with Probabilistic Models . . . . . . . . . . . . . . . . . 91 4.6 Hierarchical Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . 92 4.6.1 Efficient Implementation and Computational Complexity . . . . . . 94 4.6.2 The Natural Marriage with k-Means . . . . . . . . . . . . . . . . . 96 4.7 Clustering Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.7.1 Choosing the Ensemble Component . . . . . . . . . . . . . . . . . . 97 4.7.2 Combining the Results from Different Components . . . . . . . . . 98 4.8 Clustering Text as Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.8.1 Kernel Methods for Clustering . . . . . . . . . . . . . . . . . . . . . 99 4.8.1.1 Kernel k-Means . . . . . . . . . . . . . . . . . . . . . . . . 99 4.8.1.2 Explicit Feature Engineering . . . . . . . . . . . . . . . . 100 4.8.1.3 Kernel Trick or Explicit Feature Engineering? . . . . . . . 101 4.8.2 Data-Dependent Kernels: Spectral Clustering . . . . . . . . . . . . 102 4.9 Transforming Clustering into Supervised Learning . . . . . . . . . . . . . . 104 4.9.1 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.10 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.10.1 The Pitfalls of Internal Validity Measures . . . . . . . . . . . . . . 105 4.10.2 External Validity Measures. . . . . . . . . . . . . . . . . . . . . . . 105 4.10.2.1 Relationship of Clustering Evaluation to Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.10.2.2 Common Mistakes in Evaluation . . . . . . . . . . . . . . 109 4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.12 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.12.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5 Text Classification: Basic Models 113 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.1.1 Types of Labels and Regression Modeling . . . . . . . . . . . . . . 114 5.1.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1.3 Inductive, Transductive, and Deductive Learners . . . . . . . . . . 116 5.1.4 The Basic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.1.5 Text-Specific Challenges in Classifiers . . . . . . . . . . . . . . . . . 117 5.1.5.1 Chapter Organization . . . . . . . . . . . . . . . . . . . . 117 5.2 Feature Selection and Engineering . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.1 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2.2 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.3 Pointwise Mutual Information . . . . . . . . . . . . . . . . . . . . . 119 5.2.4 Closely Related Measures . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.5 The χ2-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.6 Embedded Feature Selection Models . . . . . . . . . . . . . . . . . 122 5.2.7 Feature Engineering Tricks . . . . . . . . . . . . . . . . . . . . . . . 122 5.3 The Na¨ıve Bayes Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3.1 The Bernoulli Model . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3.1.1 Prediction Phase . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.1.2 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3.2 Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.3 Practical Observations . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3.4 Ranking Outputs with Na¨ıve Bayes . . . . . . . . . . . . . . . . . . 127 5.3.5 Example of Na¨ıve Bayes . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.5.1 Bernoulli Model. . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.5.2 Multinomial Model . . . . . . . . . . . . . . . . . . . . . . 130 5.3.6 Semi-Supervised Na¨ıve Bayes . . . . . . . . . . . . . . . . . . . . . 131 5.4 Nearest Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.4.1 Properties of 1-Nearest Neighbor Classifiers . . . . . . . . . . . . . 134 5.4.2 Rocchio and Nearest Centroid Classification . . . . . . . . . . . . . 136 5.4.3 Weighted Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . 137 5.4.3.1 Bagged and Subsampled 1-Nearest Neighbors as Weighted Nearest Neighbor Classifiers. . . . . . . . . . 138 5.4.4 Adaptive Nearest Neighbors: A Powerful Family . . . . . . . . . . . 140 5.5 Decision Trees and Random Forests . . . . . . . . . . . . . . . . . . . . . . 142 5.5.1 Basic Procedure for Decision Tree Construction . . . . . . . . . . . 142 5.5.2 Splitting a Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.5.2.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.5.3 Multivariate Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.5.4 Problematic Issues with Decision Trees in Text Classification. . . . 145 5.5.5 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.5.6 Random Forests as Adaptive Nearest Neighbor Methods . . . . . . 147 5.6 Rule-Based Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.6.1 Sequential Covering Algorithms . . . . . . . . . . . . . . . . . . . . 148 5.6.1.1 Learn-One-Rule . . . . . . . . . . . . . . . . . . . . . . . . 149 5.6.1.2 Rule Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.6.2 Generating Rules from Decision Trees. . . . . . . . . . . . . . . . . 150 5.6.3 Associative Classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.6.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.8.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6 Linear Classification and Regression for Text 159 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.1.1 Geometric Interpretation of Linear Models . . . . . . . . . . . . . . 160 6.1.2 Do We Need the Bias Variable? . . . . . . . . . . . . . . . . . . . . 161 6.1.3 A General Definition of Linear Models with Regularization . . . . . 162 6.1.4 Generalizing Binary Predictions to Multiple Classes . . . . . . . . . 163 6.1.5 Characteristics of Linear Models for Text . . . . . . . . . . . . . . . 164 6.1.5.1 Chapter Notations . . . . . . . . . . . . . . . . . . . . . . 165 6.1.5.2 Chapter Organization . . . . . . . . . . . . . . . . . . . . 165 6.2 Least-Squares Regression and Classification . . . . . . . . . . . . . . . . . 165 6.2.1 Least-Squares Regression with L2-Regularization . . . . . . . . . . 165 6.2.1.1 Efficient Implementation . . . . . . . . . . . . . . . . . . . 166 6.2.1.2 Approximate Estimation with Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 167 6.2.1.3 Relationship with Principal Components Regression . . . 167 6.2.1.4 The Path to Kernel Regression . . . . . . . . . . . . . . . 168 6.2.2 LASSO: Least-Squares Regression with L1-Regularization . . . . . 169 6.2.2.1 Interpreting LASSO as a Feature Selector . . . . . . . . . 170 6.2.3 Fisher’s Linear Discriminant and Least-Squares Classification . . . 170 6.2.3.1 Linear Discriminant with Multiple Classes . . . . . . . . . 173 6.2.3.2 Equivalence of Fisher Discriminant and Least-Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.2.3.3 Regularized Least-Squares Classification and LLSF . . . . 175 6.2.3.4 The Achilles Heel of Least-Squares Classification . . . . . 176 6.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.3.1 The Regularized Optimization Interpretation . . . . . . . . . . . . 178 6.3.2 The Maximum Margin Interpretation . . . . . . . . . . . . . . . . . 179 6.3.3 Pegasos: Solving SVMs in the Primal . . . . . . . . . . . . . . . . . 180 6.3.3.1 Sparsity-Friendly Updates . . . . . . . . . . . . . . . . . . 181 6.3.4 Dual SVM Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.3.5 Learning Algorithms for Dual SVMs . . . . . . . . . . . . . . . . . 184 6.3.6 Adaptive Nearest Neighbor Interpretation of Dual SVMs . . . . . . 185 6.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.4.1 The Regularized Optimization Interpretation . . . . . . . . . . . . 187 6.4.2 Training Algorithms for Logistic Regression . . . . . . . . . . . . . 189 6.4.3 Probabilistic Interpretation of Logistic Regression . . . . . . . . . . 189 6.4.3.1 Probabilistic Interpretation of Stochastic Gradient Descent Steps . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.4.3.2 Relationships Among Primal Updates of Linear Models . 191 6.4.4 Multinomial Logistic Regression and Other Generalizations . . . . 191 6.4.5 Comments on the Performance of Logistic Regression . . . . . . . . 192 6.5 Nonlinear Generalizations of Linear Models . . . . . . . . . . . . . . . . . 193 6.5.1 Kernel SVMs with Explicit Transformation . . . . . . . . . . . . . 195 6.5.2 Why Do Conventional Kernels Promote Linear Separability? . . . . 196 6.5.3 Strengths and Weaknesses of Different Kernels . . . . . . . . . . . . 197 6.5.3.1 Capturing Linguistic Knowledge with Kernels . . . . . . . 198 6.5.4 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.5.5 Systematic Application of the Kernel Trick . . . . . . . . . . . . . . 199 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.7.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 7 Classifier Performance and Evaluation 209 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 7.1.1 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 210 7.2 The Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . 210 7.2.1 A Formal View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 7.2.2 Telltale Signs of Bias and Variance . . . . . . . . . . . . . . . . . . 214 7.3 Implications of Bias-Variance Trade-Off on Performance . . . . . . . . . . 215 7.3.1 Impact of Training Data Size . . . . . . . . . . . . . . . . . . . . . 215 7.3.2 Impact of Data Dimensionality . . . . . . . . . . . . . . . . . . . . 217 7.3.3 Implications for Model Choice in Text . . . . . . . . . . . . . . . . 217 7.4 Systematic Performance Enhancement with Ensembles . . . . . . . . . . . 218 7.4.1 Bagging and Subsampling . . . . . . . . . . . . . . . . . . . . . . . 218 7.4.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 7.5 Classifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.5.1 Segmenting into Training and Testing Portions . . . . . . . . . . . 222 7.5.1.1 Hold-Out . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.5.1.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 224 7.5.2 Absolute Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . 224 7.5.2.1 Accuracy of Classification . . . . . . . . . . . . . . . . . . 224 7.5.2.2 Accuracy of Regression. . . . . . . . . . . . . . . . . . . . 225 7.5.3 Ranking Measures for Classification and Information Retrieval . . . 226 7.5.3.1 Receiver Operating Characteristic . . . . . . . . . . . . . 227 7.5.3.2 Top-Heavy Measures for Ranked Lists . . . . . . . . . . . 231 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.7.1 Connection of Boosting to Logistic Regression . . . . . . . . . . . . 232

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.