ebook img

OpenIntro Statistics 4th edition PDF

422 Pages·2020·20.02 MB·english
by  Diez
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview OpenIntro Statistics 4th edition

OpenIntro Statistics Fourth Edition David Diez Data Scientist OpenIntro Mine C¸etinkaya-Rundel Associate Professor of the Practice, Duke University Professional Educator, RStudio Christopher D Barr Investment Analyst Varadero Capital Copyright © 2019. Fourth Edition. Updated: April 12th, 2022. This book may be downloaded as a free PDF at openintro.org/os. This textbook is also available under a Creative Commons license, with the source files hosted on Github. 3 Table of Contents 1 Introduction to data 7 1.1 Case study: using stents to prevent strokes . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Sampling principles and strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2 Summarizing data 39 2.1 Examining numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2 Considering categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.3 Case study: malaria vaccine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3 Probability 79 3.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3 Sampling from a small population . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.4 Random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.5 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4 Distributions of random variables 131 4.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.3 Binomial distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.4 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5 Foundations for inference 168 5.1 Point estimates and sampling variability . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2 Confidence intervals for a proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.3 Hypothesis testing for a proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6 Inference for categorical data 206 6.1 Inference for a single proportion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 6.2 Difference of two proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.3 Testing for goodness of fit using chi-square. . . . . . . . . . . . . . . . . . . . . . . . 229 6.4 Testing for independence in two-way tables . . . . . . . . . . . . . . . . . . . . . . . 240 7 Inference for numerical data 249 7.1 One-sample means with the t-distribution . . . . . . . . . . . . . . . . . . . . . . . . 251 7.2 Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 7.3 Difference of two means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 7.4 Power calculations for a difference of means . . . . . . . . . . . . . . . . . . . . . . . 278 7.5 Comparing many means with ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 285 4 TABLE OF CONTENTS 8 Introduction to linear regression 303 8.1 Fitting a line, residuals, and correlation . . . . . . . . . . . . . . . . . . . . . . . . . 305 8.2 Least squares regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 8.3 Types of outliers in linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 8.4 Inference for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 9 Multiple and logistic regression 341 9.1 Introduction to multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 9.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 9.3 Checking model conditions using graphs . . . . . . . . . . . . . . . . . . . . . . . . . 358 9.4 Multiple regression case study: Mario Kart . . . . . . . . . . . . . . . . . . . . . . . 365 9.5 Introduction to logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 A Exercise solutions 384 B Data sets within the text 403 C Distribution tables 408 5 Preface OpenIntro Statistics covers a first course in statistics, providing a rigorous introduction to applied statistics that is clear, concise, and accessible. This book was written with the undergraduate level in mind, but it’s also popular in high schools and graduate courses. We hope readers will take away three ideas from this book in addition to forming a foundation of statistical thinking and methods. • Statistics is an applied field with a wide range of practical applications. • You don’t have to be a math guru to learn from real, interesting data. • Data are messy, and statistical tools are imperfect. But, when you understand the strengths and weaknesses of these tools, you can use them to learn about the world. Textbook overview The chapters of this book are as follows: 1. Introduction to data. Data structures, variables, and basic data collection techniques. 2. Summarizing data. Data summaries, graphics, and a teaser of inference using randomization. 3. Probability. Basic principles of probability. 4. Distributions of random variables. The normal model and other key distributions. 5. Foundations for inference. Generalideasforstatisticalinferenceinthecontextofestimating the population proportion. 6. Inference for categorical data. Inference for proportions and tables using the normal and chi-square distributions. 7. Inference for numerical data. Inferenceforoneortwosamplemeansusingthet-distribution, statisticalpowerforcomparingtwogroups,andalsocomparisonsofmanymeansusingANOVA. 8. Introduction to linear regression. Regression for a numerical outcome with one predictor variable. Most of this chapter could be covered after Chapter 1. 9. Multiple and logistic regression. Regression for numerical and categorical data using many predictors. OpenIntro Statistics supportsflexibilityinchoosingandorderingtopics. Ifthemaingoalistoreach multiple regression (Chapter 9) as quickly as possible, then the following are the ideal prerequisites: • Chapter 1, Sections 2.1, and Section 2.2 for a solid introduction to data structures and statis- tical summaries that are used throughout the book. • Section 4.1 for a solid understanding of the normal distribution. • Chapter 5 to establish the core set of inference tools. • Section 7.1 to give a foundation for the t-distribution • Chapter 8 for establishing ideas and principles for single predictor regression. 6 TABLE OF CONTENTS Examples and exercises Examples are provided to establish an understanding of how to apply methods EXAMPLE0.1 This is an example. When a question is asked here, where can the answer be found? The answer can be found here, in the solution section of the example! When we think the reader should be ready to try determining the solution to an example, we frame it as Guided Practice. GUIDEDPRACTICE0.2 The reader may check or learn the answer to any Guided Practice problem by reviewing the full solution in a footnote.1 Exercises are also provided at the end of each section as well as review exercises at the end of each chapter. Solutions are given for odd-numbered exercises in Appendix A. Additional resources Video overviews, slides, statistical software labs, data sets used in the textbook, and much more are readily available at openintro.org/os We also have improved the ability to access data in this book through the addition of Appendix B, whichprovidesadditionalinformationforeachofthedatasetsusedinthemaintextandisnewinthe Fourth Edition. Online guides to each of these data sets are also provided at openintro.org/data and through a companion R package. We appreciate all feedback as well as reports of any typos through the website. A short-link to report a new typo or review known typos is openintro.org/os/typos. For those focused on statistics at the high school level, consider Advanced High School Statistics, whichisaversionofOpenIntro Statistics thathasbeenheavilycustomizedbyLeahDorazioforhigh school courses and AP® Statistics. Acknowledgements Thisprojectwouldnotbepossiblewithoutthepassionanddedicationofmanymorepeoplebeyond those on the author list. The authors would like to thank the OpenIntro Staff for their involvement and ongoing contributions. We are also very grateful to the hundreds of students and instructors who have provided us with valuable feedback since we first started posting book content in 2009. We also want to thank the many teachers who helped review this edition, including Laura Acion, Matthew E. Aiello-Lammens, Jonathan Akin, Stacey C. Behrensmeyer, Juan Gomez, Jo Hardin, Nicholas Horton, Danish Khan, Peter H.M. Klaren, Jesse Mostipak, Jon C. New, Mario Orsi, Steve Phelps, and David Rockoff. We appreciate all of their feedback, which helped us tune the text in significant ways and greatly improved this book. 1Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnotesolutionforanyGuidedPractice. 7 Chapter 1 Introduction to data 1.1 Case study: using stents to prevent strokes 1.2 Data basics 1.3 Sampling principles and strategies 1.4 Experiments 8 Scientists seek to answer questions using rigorous methods and careful observations. These observations – collected from the likes of field notes, surveys, and experiments – form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data, and in this first chapter, we focus on both the properties of data and on the collection of data. For videos, slides, and other resources, please visit www.openintro.org/os 1.1. CASE STUDY: USING STENTS TO PREVENT STROKES 9 1.1 Case study: using stents to prevent strokes Section 1.1 introduces a classic challenge in statistics: evaluating the efficacy of a medical treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the text. The plan for now is simply to get a sense of the role statistics can play in practice. In this section we will consider an experiment that studies effectiveness of stents in treating patients at risk of stroke. Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death. Many doctors have hoped that there would be similar benefits for patients at risk of stroke. We start by writing the principal question the researchers hope to answer: Does the use of stents reduce the risk of stroke? The researchers who asked this question conducted an experiment with 451 at-risk patients. Each volunteer patient was randomly assigned to one of two groups: Treatment group. Patients in the treatment group received a stent and medical manage- ment. The medical management included medications, management of risk factors, and help in lifestyle modification. Control group. Patients in the control group received the same medical management as the treatment group, but they did not receive stents. Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. Inthisstudy,thecontrolgroupprovidesareferencepointagainstwhichwecanmeasurethemedical impact of stents in the treatment group. Researchersstudiedtheeffectofstentsattwotimepoints: 30daysafterenrollmentand365days after enrollment. The results of 5 patients are summarized in Figure 1.1. Patient outcomes are recorded as “stroke” or “no event”, representing whether or not the patient had a stroke at the end of a time period. Patient group 0-30 days 0-365 days 1 treatment no event no event 2 treatment stroke stroke 3 treatment no event no event . . . . . . . . . 450 control no event no event 451 control no event no event Figure 1.1: Results for five patients from the stent study. Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once. Figure 1.2 summarizes the raw data in a more helpful way. In this table, we can quickly see what happened over the entire study. For instance, to identify the number of patients in the treatment group who had a stroke within 30 days, we look on the left-side of the table at the intersection of the treatment and stroke: 33. 0-30 days 0-365 days stroke no event stroke no event treatment 33 191 45 179 control 13 214 28 199 Total 46 405 73 378 Figure 1.2: Descriptive statistics for the stent study. 10 CHAPTER 1. INTRODUCTION TO DATA GUIDEDPRACTICE1.1 Ofthe224patientsinthetreatmentgroup, 45hadastrokebytheendofthefirstyear. Usingthese two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year. (Please note: answers to all Guided Practice exercises are provided using footnotes.)1 We can compute summary statistics from the table. A summary statistic is a single number summarizing a large amount of data. For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups. Proportion who had a stroke in the treatment (stent) group: 45/224=0.20=20%. Proportion who had a stroke in the control group: 28/227=0.12=12%. These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the data show a “real” difference between the groups? This second question is subtle. Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably won’t observe exactly 50 heads. This type of fluctuationispartofalmostanytypeofdatageneratingprocess. Itispossiblethatthe8%difference inthestentstudyisduetothisnaturalvariation. However, thelargerthedifferenceweobserve(for a particular sample size), the less believable it is that the difference is due to chance. So what we are really asking is the following: is the difference so large that we should reject the notion that it was due to chance? While we don’t yet have our statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients. Be careful: Do not generalize the results of this study to all patients and all stents. This studylookedatpatientswithveryspecificcharacteristicswhovolunteeredtobeapartofthisstudy andwhomaynotberepresentativeofallstrokepatients. Inaddition,therearemanytypesofstents andthisstudyonlyconsideredtheself-expandingWingspanstent(BostonScientific). However,this study does leave us with an important lesson: we should keep our eyes open for surprises. 1Theproportionofthe224patientswhohadastrokewithin365days: 45/224=0.20.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.