ebook img

Random Forests with R PDF

107 Pages·2020·2.183 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Random Forests with R

Use R! Robin Genuer Jean-Michel Poggi Random Forests with R Use R! Series Editors Robert Gentleman, 23andMe Inc., South San Francisco, USA Kurt Hornik, Department of Finance, Accounting and Statistics, WU Wirtschaftsuniversität Wien, Vienna, Austria Giovanni Parmigiani, Dana-Farber Cancer Institute, Boston, USA Use R! This series of inexpensive and focused books on R will publish shorter books aimed at practitioners. Books can discuss the use of R in a particular subject area (e.g.,epidemiology,econometrics,psychometrics)orasitrelatestostatisticaltopics (e.g., missing data, longitudinal data). In most cases, books will combine LaTeX and R so that the code for figures and tables can be put on a website. Authors shouldassumeabackgroundassuppliedbyDalgaard’sIntroductoryStatisticswith R or other introductory books so that each book does not repeat basic material. More information about this series at http://www.springer.com/series/6991 Robin Genuer Jean-Michel Poggi (cid:129) Random Forests with R 123 RobinGenuer Jean-Michel Poggi ISPED Lab.Maths Orsay(LMO) University of Bordeaux Paris-Saclay University Bordeaux,France Orsay,France ISSN 2197-5736 ISSN 2197-5744 (electronic) UseR! ISBN978-3-030-56484-1 ISBN978-3-030-56485-8 (eBook) https://doi.org/10.1007/978-3-030-56485-8 Translation from the French language edition: Les forêts aléatoires avec R by Robin Genuer and Jean-MichelPoggi©PressesUniversitairesdeRennes2019AllRightsReserved ©SpringerNatureSwitzerlandAG2020 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained hereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregard tojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface RandomforestsareastatisticallearningmethodintroducedbyLeoBreimanin2001. Theyareextensivelyusedinmanyfieldsofapplication,dueforsurenotonlytotheir excellentpredictiveperformance,butalsototheirflexibility,withafewrestrictions on the nature of the data. Indeed, random forests are adapted to both supervised classificationproblemsandregressionproblems.Inaddition,theyallowtoconsider qualitative and quantitative explanatory variables together without preprocessing. Moreover, they can be used to process standard data for which the number of observationsishigherthanthenumberofvariables,whilealsoperformingverywell in the high dimensional case, where the number of variables is quite large in com- parisontothenumberofobservations.Consequently,theyarenowamongpreferred methodsin thetoolboxof statisticiansand other data scientists. Who Is This Book for? This book is an application-oriented statistical presentation of random forests. It is thereforeprimarilyintendednotonlyforstudentsinacademicfieldssuchasstatistical education, but also for practitioners in statistics and machine learning. A scientific undergraduatedegreeisquitesufficienttotakefulladvantageoftheconcepts,methods, andtoolscoveredbythebook.Intermsofcomputerscienceskills,littlebackground knowledgeisrequired,thoughanintroductiontotheRlanguageisrecommended. Book Content Random forests are part of the family of tree-based methods; accordingly, after an introductory chapter, Chap. 2 presents CART trees. The next three chapters are devoted to random forests. They focus on their presentation (Chap. 3), on the v vi Preface variableimportancetool(Chap.4),andonthevariableselectionproblem(Chap.5), respectively. Thestructureofthechapters(excepttheintroduction)isalwaysthesame.Aftera presentation of the concepts and methods, we illustrate their implementation on a running example. Then, various complements are provided before examining additional examples. Throughout the book, each result is given together with the R code that can be used to reproduce it. All lines of code are available online1 making things easy. Thus, the book offers readers essential information and concepts, together with examples and the software tools needed to analyze data using random forests. Orsay, France Robin Genuer June 2020 Jean-Michel Poggi 1https://RFwithR.robin.genuer.fr. Acknowledgements Our firstthanks gotoEva Hiripi who suggestedtheidea ofpublishingtheEnglish versionoftheFrenchoriginaleditionofthebookentitled“Lesforêtsaléatoiresavec R” (Presses Universitaires de Rennes Ed.). We would like to thank our colleagues who shared our thoughts about these topics through numerous collaborations, in particular, Sylvain Arlot, Servane Gey, Christine Tuleau-Malot, and Nathalie Villa-Vialaneix. WewouldalsoliketothankNicolasBousquetandFabienNavarrowhosavedus a lot of time by providing us a first raw translation, thanks to the tool they developedforalarge-scaleautomatictranslationconductedin2018.Ofcourse,the authors are entirely responsible for the final translated version. Finally, we thank three anonymous reviewers for their useful comments and insightful suggestions. vii Contents 1 Introduction to Random Forests with R . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Statistical Objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5.1 Running Example: Spam Detection. . . . . . . . . . . . . . . . . . 5 1.5.2 Ozone Pollution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.3 Genomic Data for a Vaccine Study. . . . . . . . . . . . . . . . . . 7 1.5.4 Dust Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 CART. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Maximal Tree Construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 The rpart Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Competing and Surrogate Splits . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 Competing Splits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Surrogate Splits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.3 Interpretability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6.1 Predicting Ozone Concentration . . . . . . . . . . . . . . . . . . . . 26 2.6.2 Analyzing Genomic Data. . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 General Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Instability of a Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 From a Tree to an Ensemble: Bagging . . . . . . . . . . . . . . . 37 3.2 Random Forest Random Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 The randomForest Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 ix x Contents 3.4 Out-Of-Bag Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Parameters Setting for Prediction. . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.1 The Number of Trees: ntree . . . . . . . . . . . . . . . . . . . . . 43 3.5.2 The Number of Variables Chosen at Each Node: mtry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6.1 Predicting Ozone Concentration . . . . . . . . . . . . . . . . . . . . 47 3.6.2 Analyzing Genomic Data. . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.3 Analyzing Dust Pollution. . . . . . . . . . . . . . . . . . . . . . . . . 52 4 Variable Importance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 Notions of Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Variable Importance Behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Behavior According to n and p. . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Behavior for Groups of Correlated Variables. . . . . . . . . . . 63 4.3 Tree Diversity and Variables Importance . . . . . . . . . . . . . . . . . . . 65 4.4 Influence of Parameters on Variable Importance. . . . . . . . . . . . . . 66 4.5 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.1 An Illustration by Simulation in Regression . . . . . . . . . . . 68 4.5.2 Predicting Ozone Concentration . . . . . . . . . . . . . . . . . . . . 69 4.5.3 Analyzing Genomic Data. . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5.4 Air Pollution by Dust: What Is the Local Contribution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Variable Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 The VSURF Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5 Parameter Setting for Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.6.1 Predicting Ozone Concentration . . . . . . . . . . . . . . . . . . . . 88 5.6.2 Analyzing Genomic Data. . . . . . . . . . . . . . . . . . . . . . . . . 90 References.... .... .... .... ..... .... .... .... .... .... ..... .... 93 Index .... .... .... .... .... ..... .... .... .... .... .... ..... .... 97

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.