S e E c dio tin d o n Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python Peter Bruce, Andrew Bruce & Peter Gedeck SECOND EDITION Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python Peter Bruce, Andrew Bruce, and Peter Gedeck BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck Copyright © 2020 Peter Bruce, Andrew Bruce, and Peter Gedeck. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Tache Indexer: Ellen Troutman-Zaig Production Editor: Kristen Brown Interior Designer: David Futato Copyeditor: Piper Editorial Cover Designer: Karen Montgomery Proofreader: Arthur Johnson Illustrator: Rebecca Demarest May 2017: First Edition May 2020: Second Edition Revision History for the Second Edition 2020-04-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492072942 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Statistics for Data Scientists, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-07294-2 [LSI] Peter Bruce and Andrew Bruce would like to dedicate this book to the memories of our parents, Victor G. Bruce and Nancy C. Bruce, who cultivated a passion for math and science; and to our early mentors John W. Tukey and Julian Simon and our lifelong friend Geoff Watson, who helped inspire us to pursue a career in statistics. Peter Gedeck would like to dedicate this book to Tim Clark and Christian Kramer, with deep thanks for their scientific collaboration and friendship. Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Exploratory Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Elements of Structured Data 2 Further Reading 4 Rectangular Data 4 Data Frames and Indexes 6 Nonrectangular Data Structures 6 Further Reading 7 Estimates of Location 7 Mean 9 Median and Robust Estimates 10 Example: Location Estimates of Population and Murder Rates 12 Further Reading 13 Estimates of Variability 13 Standard Deviation and Related Estimates 14 Estimates Based on Percentiles 16 Example: Variability Estimates of State Population 18 Further Reading 19 Exploring the Data Distribution 19 Percentiles and Boxplots 20 Frequency Tables and Histograms 22 Density Plots and Estimates 24 Further Reading 26 Exploring Binary and Categorical Data 27 Mode 29 Expected Value 29 Probability 30 v Further Reading 30 Correlation 30 Scatterplots 34 Further Reading 36 Exploring Two or More Variables 36 Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data) 36 Two Categorical Variables 39 Categorical and Numeric Data 41 Visualizing Multiple Variables 43 Further Reading 46 Summary 46 2. Data and Sampling Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Random Sampling and Sample Bias 48 Bias 50 Random Selection 51 Size Versus Quality: When Does Size Matter? 52 Sample Mean Versus Population Mean 53 Further Reading 53 Selection Bias 54 Regression to the Mean 55 Further Reading 57 Sampling Distribution of a Statistic 57 Central Limit Theorem 60 Standard Error 60 Further Reading 61 The Bootstrap 61 Resampling Versus Bootstrapping 65 Further Reading 65 Confidence Intervals 65 Further Reading 68 Normal Distribution 69 Standard Normal and QQ-Plots 71 Long-Tailed Distributions 73 Further Reading 75 Student’s t-Distribution 75 Further Reading 78 Binomial Distribution 78 Further Reading 80 Chi-Square Distribution 80 Further Reading 81 F-Distribution 82 vi | Table of Contents Further Reading 82 Poisson and Related Distributions 82 Poisson Distributions 83 Exponential Distribution 84 Estimating the Failure Rate 84 Weibull Distribution 85 Further Reading 86 Summary 86 3. Statistical Experiments and Significance Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A/B Testing 88 Why Have a Control Group? 90 Why Just A/B? Why Not C, D,…? 91 Further Reading 92 Hypothesis Tests 93 The Null Hypothesis 94 Alternative Hypothesis 95 One-Way Versus Two-Way Hypothesis Tests 95 Further Reading 96 Resampling 96 Permutation Test 97 Example: Web Stickiness 98 Exhaustive and Bootstrap Permutation Tests 102 Permutation Tests: The Bottom Line for Data Science 102 Further Reading 103 Statistical Significance and p-Values 103 p-Value 106 Alpha 107 Type 1 and Type 2 Errors 109 Data Science and p-Values 109 Further Reading 110 t-Tests 110 Further Reading 112 Multiple Testing 112 Further Reading 116 Degrees of Freedom 116 Further Reading 118 ANOVA 118 F-Statistic 121 Two-Way ANOVA 123 Further Reading 124 Chi-Square Test 124 Table of Contents | vii Chi-Square Test: A Resampling Approach 124 Chi-Square Test: Statistical Theory 127 Fisher’s Exact Test 128 Relevance for Data Science 130 Further Reading 131 Multi-Arm Bandit Algorithm 131 Further Reading 134 Power and Sample Size 135 Sample Size 136 Further Reading 138 Summary 139 4. Regression and Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Simple Linear Regression 141 The Regression Equation 143 Fitted Values and Residuals 146 Least Squares 148 Prediction Versus Explanation (Profiling) 149 Further Reading 150 Multiple Linear Regression 150 Example: King County Housing Data 151 Assessing the Model 153 Cross-Validation 155 Model Selection and Stepwise Regression 156 Weighted Regression 159 Further Reading 161 Prediction Using Regression 161 The Dangers of Extrapolation 161 Confidence and Prediction Intervals 161 Factor Variables in Regression 163 Dummy Variables Representation 164 Factor Variables with Many Levels 167 Ordered Factor Variables 169 Interpreting the Regression Equation 169 Correlated Predictors 170 Multicollinearity 172 Confounding Variables 172 Interactions and Main Effects 174 Regression Diagnostics 176 Outliers 177 Influential Values 179 Heteroskedasticity, Non-Normality, and Correlated Errors 182 viii | Table of Contents