Tree-Based Methods for Statistical Learning in R The book follows up most ideas and mathematical concepts with code-based examples in the R statistical language; with an emphasis on using as few external packages as possible. For example, users will be exposed to writing their own random forest and gradient tree boosting functions using simple for loops and basic tree fitting software (like rpart and party/partykit), and more. The core chapters also end with a detailed section on relevant software in both R and other opensource alternatives (e.g., Python, Spark, and Julia), and example usage on real data sets. While the book mostly uses R, it is meant to be equally acces- sible and useful to non-R programmers. Consumers of this book will have gained a solid foundation (and appreciation) for tree-based methods and how they can be used to solve practical problems and challenges data scientists often face in applied work. Features: • Thorough coverage, from the ground up, of tree-based methods (e.g., CART, conditional inference trees, bagging, boosting, and random forests). • A companion website containing additional supplementary material and the code to reproduce every example and figure in the book. • A companion R package, called treemisc, which contains several data sets and functions used throughout the book (e.g., there’s an implemen- tation of gradient tree boosting with LAD loss that shows how to per- form the line search step by updating the terminal node estimates of a fitted rpart tree). • Interesting examples that are of practical use; for example, how to con- struct partial dependence plots from a fitted model in Spark MLlib (using only Spark operations), or post-processing tree ensembles via the LASSO to reduce the number of trees while maintaining, or even improving performance. CHAPMAN & HALL/CRC DATA SCIENCE SERIES Reflecting the interdisciplinary nature of the field, this book series brings together researchers, practitioners, and instructors from statistics, computer science, machine learning, and analytics. The series will publish cutting-edge research, industry applications, and textbooks in data science. The inclusion of concrete examples, applications, and methods is highly encour- aged. The scope of the series includes titles in the areas of machine learning, pat- tern recognition, predictive analytics, business analytics, Big Data, visualization, programming, software, learning analytics, data wrangling, interactive graphics, and reproducible research. Published Titles An Introduction to IoT Analytics Harry G. Perros Data Analytics A Small Data Approach Shuai Huang and Houtao Deng Public Policy Analytics Code and Context for Data Science in Government Ken Steif Supervised Machine Learning for Text Analysis in R Emil Hvitfeldt and Julia Silge Massive Graph Analytics Edited by David Bader Data Science An Introduction Tiffany-Anne Timbers, Trevor Campbell and Melissa Lee Tree-Based Methods for Statistical Learning in R Brandon M. Greenwell Urban Informatics Using Big Data to Understand and Serve Communities Daniel T. O’Brien For more information about this series, please visit: https://www.routledge. com/Chapman--HallCRC-Data-Science-Series/book-series/CHDSS Tree-Based Methods for Statistical Learning in R Brandon M. Greenwell First edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2022 Brandon M. Greenwell Reasonable efforts have been made to publish reliable data and information, but the author and pub- lisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright. com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis- [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Names: Greenwell, Brandon M., author. Title: Tree-Based Methods for Statistical Learning in R / Brandon M. Greenwell. Description: First edition. | Boca Raton, FL : CRC Press, 2022. | Series: Chapman & Hall/CRC data science series | Includes bibliographical references and index. Identifiers: LCCN 2021059606 (print) | LCCN 2021059607 (ebook) | ISBN 9780367532468 (hbk) | ISBN 9780367543822 (pbk) | ISBN 9781003089032 (ebk) Subjects: LCSH: Decision making--Mathematics. | Decision trees--Data processing. | Trees (Graph theory)--Data processing. | R (Computer program language) Classification: LCC T57.95 .G73 2022 (print) | LCC T57.95 (ebook) | DDC 658.4/03--dc23/eng/20220217 LC record available at https://lccn.loc.gov/2021059606 LC ebook record available at https://lccn.loc.gov/2021059607 ISBN: 978-0-367-53246-8 (hbk) ISBN: 978-0-367-54382-2 (pbk) ISBN: 978-1-003-08903-2 (ebk) DOI: 10.1201/9781003089032 Typeset in LM Roman by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. To my son, Oliver. Contents Preface xiii 1 Introduction 1 1.1 Select topics in statistical and machine learning . . . . . . . 2 1.1.1 Statistical jargon and conventions . . . . . . . . . . . 3 1.1.2 Supervised learning . . . . . . . . . . . . . . . . . . . 4 1.1.2.1 Description . . . . . . . . . . . . . . . . . . . 5 1.1.2.2 Prediction . . . . . . . . . . . . . . . . . . . 6 1.1.2.3 Classification vs. regression . . . . . . . . . 7 1.1.2.4 Discrimination vs. prediction . . . . . . . . . 7 1.1.2.5 The bias-variance tradeoff . . . . . . . . . . . 8 1.1.3 Unsupervised learning . . . . . . . . . . . . . . . . . . 10 1.2 Why trees? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 A brief history of decision trees . . . . . . . . . . . . 12 1.2.2 The anatomy of a simple decision tree . . . . . . . . . 14 1.2.2.1 Example: survival on the Titanic . . . . . . . 15 1.3 Why R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.1 No really, why R? . . . . . . . . . . . . . . . . . . . . 17 1.3.2 Software information and conventions . . . . . . . . . 19 1.4 Some example data sets . . . . . . . . . . . . . . . . . . . . . 20 1.4.1 Swiss banknotes . . . . . . . . . . . . . . . . . . . . . 21 1.4.2 New York air quality measurements . . . . . . . . . . 21 1.4.3 The Friedman 1 benchmark problem . . . . . . . . . 23 1.4.4 Mushroom edibility . . . . . . . . . . . . . . . . . . . 24 1.4.5 Spam or ham? . . . . . . . . . . . . . . . . . . . . . . 25 1.4.6 Employee attrition . . . . . . . . . . . . . . . . . . . . 28 1.4.7 Predicting home prices in Ames, Iowa . . . . . . . . . 29 1.4.8 Wine quality ratings . . . . . . . . . . . . . . . . . . 30 1.4.9 Mayo Clinic primary biliary cholangitis study . . . . 31 1.5 There ain’t no such thing as a free lunch . . . . . . . . . . . 35 1.6 Outline of this book . . . . . . . . . . . . . . . . . . . . . . . 35 I Decision trees 37 2 Binary recursive partitioning with CART 39 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vii viii Contents 2.2 Classification trees . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2.1 Splits on ordered variables . . . . . . . . . . . . . . . . 43 2.2.1.1 So which is it in practice, Gini or entropy? . 47 2.2.2 Example: Swiss banknotes . . . . . . . . . . . . . . . . 48 2.2.3 Fitted values and predictions . . . . . . . . . . . . . . 51 2.2.4 Class priors and misclassification costs . . . . . . . . 52 2.2.4.1 Altered priors . . . . . . . . . . . . . . . . . 54 2.2.4.2 Example: employee attrition . . . . . . . . . 55 2.3 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.3.1 Example: New York air quality measurements . . . . 59 2.4 Categorical splits . . . . . . . . . . . . . . . . . . . . . . . . 62 2.4.1 Example: mushroom edibility . . . . . . . . . . . . . . 64 2.4.2 Be wary of categoricals with high cardinality . . . . . 67 2.4.3 To encode, or not to encode? . . . . . . . . . . . . . . 68 2.5 Building a decision tree . . . . . . . . . . . . . . . . . . . . . 69 2.5.1 Cost-complexity pruning . . . . . . . . . . . . . . . . 71 2.5.1.1 Example: mushroom edibility . . . . . . . . . 74 2.5.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . 77 2.5.2.1 The 1-SE rule . . . . . . . . . . . . . . . . . 78 2.6 Hyperparameters and tuning . . . . . . . . . . . . . . . . . . 78 2.7 Missing data and surrogate splits . . . . . . . . . . . . . . . 78 2.7.1 Other missing value strategies . . . . . . . . . . . . . 80 2.8 Variable importance . . . . . . . . . . . . . . . . . . . . . . . 82 2.9 Software and examples . . . . . . . . . . . . . . . . . . . . . 83 2.9.1 Example: Swiss banknotes . . . . . . . . . . . . . . . . 84 2.9.2 Example: mushroom edibility . . . . . . . . . . . . . . 88 2.9.3 Example: predicting home prices . . . . . . . . . . . . 96 2.9.4 Example: employee attrition. . . . . . . . . . . . . . . 100 2.9.5 Example: letter image recognition . . . . . . . . . . . 103 2.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 2.10.1 Advantages of CART. . . . . . . . . . . . . . . . . . . 105 2.10.2 Disadvantages of CART . . . . . . . . . . . . . . . . . 106 2.11 Recommended reading . . . . . . . . . . . . . . . . . . . . . 108 3 Conditional inference trees 111 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.2 Early attempts at unbiased recursive partitioning . . . . . . 112 3.3 A quick digression into conditional inference . . . . . . . . . 114 3.3.1 Example: X and Y are both univariate continuous . 117 3.3.2 Example: X and Y are both nominal categorical . . 118 3.3.3 Which test statistic should you use? . . . . . . . . . . 120 3.4 Conditional inference trees . . . . . . . . . . . . . . . . . . . 121 3.4.1 Selecting the splitting variable . . . . . . . . . . . . . 121 3.4.1.1 Example: New York air quality measurements 123 3.4.1.2 Example: Swiss banknotes . . . . . . . . . . 124 Contents ix 3.4.2 Finding the optimal split point . . . . . . . . . . . . . 125 3.4.2.1 Example: New York air quality measurements 126 3.4.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.4.4 Missing values . . . . . . . . . . . . . . . . . . . . . . 128 3.4.5 Choice of α, g(), and h() . . . . . . . . . . . . . . . . 128 3.4.6 Fitted values and predictions . . . . . . . . . . . . . . 131 3.4.7 Imbalanced classes . . . . . . . . . . . . . . . . . . . . 131 3.4.8 Variable importance . . . . . . . . . . . . . . . . . . . 132 3.5 Software and examples . . . . . . . . . . . . . . . . . . . . . 132 3.5.1 Example: New York air quality measurements . . . . 133 3.5.2 Example: wine quality ratings . . . . . . . . . . . . . . 137 3.5.3 Example: Mayo Clinic liver transplant data . . . . . . 140 3.6 Final thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4 The hitchhiker’s GUIDE to modern decision trees 147 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.2 A GUIDE for regression . . . . . . . . . . . . . . . . . . . . . 150 4.2.1 Piecewise constant models . . . . . . . . . . . . . . . . 150 4.2.1.1 Example: New York air quality measurements 152 4.2.2 Interaction tests . . . . . . . . . . . . . . . . . . . . . 153 4.2.3 Non-constant fits . . . . . . . . . . . . . . . . . . . . . 154 4.2.3.1 Example: predicting home prices . . . . . . 155 4.2.3.2 Bootstrap bias correction . . . . . . . . . . . 157 4.3 A GUIDE for classification . . . . . . . . . . . . . . . . . . . 157 4.3.1 Linear/oblique splits . . . . . . . . . . . . . . . . . . 157 4.3.1.1 Example: classifying the Palmer penguins . . 158 4.3.2 Priors and misclassification costs . . . . . . . . . . . . 161 4.3.3 Non-constant fits . . . . . . . . . . . . . . . . . . . . . 161 4.3.3.1 Kernel-based and k-nearest neighbor fits . . 162 4.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 4.5 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . 163 4.6 Fitted values and predictions . . . . . . . . . . . . . . . . . . 163 4.7 Variable importance . . . . . . . . . . . . . . . . . . . . . . 163 4.8 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.9 Software and examples . . . . . . . . . . . . . . . . . . . . . 165 4.9.1 Example: credit card default . . . . . . . . . . . . . . 165 4.10 Final thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . 172 II Tree-based ensembles 177 5 Ensemble algorithms 179 5.1 Bootstrap aggregating (bagging) . . . . . . . . . . . . . . . 181 5.1.1 When does bagging work? . . . . . . . . . . . . . . . . 184 5.1.2 Bagging from scratch: classifying email spam . . . . . 184 5.1.3 Sampling without replacement . . . . . . . . . . . . . 187