Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 Machine Learning for Text Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 Charu C. Aggarwal Machine Learning for Text 123 Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 CharuC.Aggarwal IBMT.J.WatsonResearchCenter YorktownHeights,NY,USA ISBN978-3-319-73530-6 ISBN978-3-319-73531-3 (eBook) https://doi.org/10.1007/978-3-319-73531-3 LibraryofCongressControlNumber:2018932755 ©SpringerInternationalPublishingAG2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,reproductionon microfilmsorinanyotherphysicalway,andtransmissionorinformationstorageandretrieval,electronicadaptation,com- putersoftware,orbysimilarordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoesnotimply, evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelawsandregulationsand thereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbookarebelievedtobe trueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditorsgiveawarranty,expressor implied,withrespecttothematerialcontainedhereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisher remainsneutralwithregardtojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbytheregisteredcompanySpringerInternationalPublishingAGpartofSpringerNature. Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 To my wife Lata, my daughter Sayani, and my late parents Dr. Prem Sarup and Mrs. Pushplata Aggarwal. Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 Preface “ If it is true that there is always more than one way of construing a text, it is not true that all interpretations are equal.” – Paul Ricoeur Therichareaoftextanalyticsdrawsideasfrominformationretrieval,machinelearning, and natural language processing. Each of these areas is an active and vibrant field in its own right, and numerous books have been written in each of these different areas. As a result, many of these books have covered some aspects of text analytics, but they have not covered all the areas that a book on learning from text is expected to cover. At this point, a need exists for a focussed book on machine learning from text. This book is a first attempt to integrate all the complexities in the areas of machine learning, information retrieval, and natural language processing in a holistic way, in order to create a coherent and integrated book in the area. Therefore, the chapters are divided into three categories: 1. Fundamental algorithms and models: Many fundamental applications in text analyt- ics, such as matrix factorization, clustering, and classification, have uses in domains beyond text. Nevertheless, these methods need to be tailored to the specialized char- acteristics of text. Chapters 1 through 8 will discuss core analytical methods in the context of machine learning from text. 2. Information retrieval and ranking: Many aspects of information retrieval and rank- ing are closely related to text analytics. For example, ranking SVMs and link-based ranking are often used for learning from text. Chapter 9 will provide an overview of information retrieval methods from the point of view of text mining. 3. Sequence- and natural language-centric text mining: Although multidimensional rep- resentations can be used for basic applications in text analytics, the true richness of the text representation can be leveraged by treating text as sequences. Chapters 10 through14willdiscusstheseadvancedtopicslikesequenceembedding,deeplearning, information extraction, summarization, opinionmining,textsegmentation, andevent extraction. Becauseofthediversityoftopicscoveredinthisbook,somecarefuldecisionshavebeenmade on the scope of coverage. A complicating factor is that many machine learning techniques vii Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 viii PREFACE depend on the use of basic natural language processing and information retrieval method- ologies. This is particularly true of the sequence-centric approaches discussed in Chaps.10 through 14 that are more closely related to natural language processing. Examples of an- alytical methods that rely on natural language processing include information extraction, event extraction, opinion mining, and text summarization, which frequently leverage basic natural language processing tools like linguistic parsing or part-of-speech tagging. Needless to say, natural language processing is a full fledged field in its own right (with excellent books dedicated to it). Therefore, a question arises on how much discussion should be pro- videdontechniquesthatlieontheinterfaceofnaturallanguageprocessingandtextmining without deviating from the primary scope of this book. Our general principle in making these choices has been to focus on mining and machine learning aspects. If a specific nat- ural language or information retrieval method (e.g., part-of-speech tagging) is not directly abouttextanalytics,wehaveillustratedhowtousesuchtechniques(asblack-boxes)rather thandiscussingtheinternalalgorithmicdetailsofthesemethods.Basictechniqueslikepart- of-speech tagging have matured in algorithmic development, and have been commoditized to the extent that many open-source tools are available with little difference in relative performance. Therefore, we only provide working definitions of such concepts in the book, andtheprimaryfocuswillbeontheirutilityasoff-the-shelftoolsinmining-centricsettings. The book provides pointers to the relevant books and open-source software in each chapter in order to enable additional help to the student and practitioner. Thebookiswrittenforgraduatestudents,researchers,andpractitioners.Theexposition has been simplified to a large extent, so that a graduate student with a reasonable under- standingoflinearalgebraandprobabilitytheorycanunderstandthebookeasily.Numerous exercises are available along with a solution manual to aid in classroom teaching. Throughoutthisbook,avectororamultidimensionaldatapointisannotatedwithabar, such as X or y. A vector or multidimensional point may be denoted by either small letters or capital letters, aslong asit hasabar. Vector dot productsaredenoted by centereddots, such as X·Y. A matrix is denoted in capital letters without a bar, such as R. Throughout the book, the n × d document-term matrix is denoted by D, with n documents and d dimensions. The individual documents in D are therefore represented as d-dimensional row vectors, which are the bag-of-words representations. On the other hand, vectors with one component for each data point are usually n-dimensional column vectors. An example is the n-dimensional column vector y of class variables of n data points. Yorktown Heights, NY, USA Charu C. Aggarwal Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3 Acknowledgments I would like to thank my family including my wife, daughter, and my parents for their love and support. I would also like to thank my manager Nagui Halim for his support during the writing of this book. This book has benefitted from significant feedback and several collaborations that i have had with numerous colleagues over the years. I would like to thank Quoc Le, Chih- Jen Lin, Chandan Reddy, Saket Sathe, Shai Shalev-Shwartz, Jiliang Tang, Suhang Wang, and ChengXiang Zhai for their feedback on various portions of this book and for answer- ing specific queries on technical matters. I would particularly like to thank Saket Sathe for commenting on several portions, and also for providing some sample output from a neural network to use in the book. For their collaborations, I would like to thank Tarek F. Abdelzaher, Jing Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneb- urg, Thomas Huang, Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M. Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Saket Sathe, Jaideep Srivastava, Karthik Subbian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jianyong Wang, Min Wang, Suhang Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao. I would particularly like to thank Professor ChengXiang Zhai for my earlier collaborations with him in text mining. I would also like to thank my advisor James B. Orlin for his guidance during my early years as a researcher. Finally, I would like to thank Lata Aggarwal for helping me with some of the figures created using PowerPoint graphics in this book. ix Download link for computers connected to selected institutions: https://rd.springer.com/book/10.1007/978-3-319-73531-3
Description: