ebook img

Probability and statistics for data science : math + R + data PDF

445 Pages·2020·6.318 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Probability and statistics for data science : math + R + data

Probability and Statistics for Data Science Math + R + Data CHAPMAN & HALL/CRC DATA SCIENCE SERIES Reflecting the interdisciplinary nature of the field, this book series brings together researchers, practitioners, and instructors from statistics, computer science, machine learning, and analytics. The series will publish cutting-edge research, industry applica- tions, and textbooks in data science. The inclusion of concrete examples, applications, and methods is highly encouraged. The scope of the series includes titles in the areas of machine learning, pattern recognition, predictive analytics, business analytics, Big Data, visualization, programming, software, learning analytics, data wrangling, interactive graphics, and reproducible research. Published Titles Feature Engineering and Selection: A Practical Approach for Predictive Models Max Kuhn and Kjell Johnson Probability and Statistics for Data Science: Math + R + Data Norman Matloff Probability and Statistics for Data Science Math + R + Data Norman Matloff CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2020 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-138-39329-5 (Paperback) International Standard Book Number-13: 978-0-367-26093-4 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Matloff, Norman S., author. Title: Probability and statistics for data science / Norman Matloff. Description: Boca Raton : CRC Press, Taylor & Francis Group, 2019. Identifiers: LCCN 2019008196 | ISBN 9781138393295 (pbk. : alk. paper) Subjects: LCSH: Probabilities--Textbooks. | Mathematical statistics--Textbooks. | Probabilities--Data processing. | Mathematical statistics--Data processing. Classification: LCC QA273 .M38495 2019 | DDC 519.5--dc23 LC record available at https://lccn.loc.gov/2019008196 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents About the Author xxiii To the Instructor xxv To the Reader xxxi I Fundamentals of Probability 1 1 Basic Probability Models 3 1.1 Example: Bus Ridership . . . . . . . . . . . . . . . . . . . 3 1.2 A“Notebook”View: theNotionofaRepeatableExperiment 4 1.2.1 Theoretical Approaches. . . . . . . . . . . . . . . . 5 1.2.2 A More Intuitive Approach . . . . . . . . . . . . . 5 1.3 Our Definitions . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 “Mailing Tubes” . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Example: Bus Ridership Model (cont’d.) . . . . . . . . . . 11 1.6 Example: ALOHA Network . . . . . . . . . . . . . . . . . 14 1.6.1 ALOHA Network Model Summary . . . . . . . . . 16 1.6.2 ALOHA Network Computations . . . . . . . . . . . 16 1.7 ALOHA in the Notebook Context . . . . . . . . . . . . . . 19 1.8 Example: A Simple Board Game . . . . . . . . . . . . . . 20 v vi CONTENTS 1.9 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.9.1 General Principle . . . . . . . . . . . . . . . . . . . 23 1.9.2 Example: Document Classification . . . . . . . . . 23 1.10 Random Graph Models . . . . . . . . . . . . . . . . . . . . 24 1.10.1 Example: Preferential Attachment Model . . . . . 25 1.11 Combinatorics-Based Computation . . . . . . . . . . . . . 26 1.11.1 Which Is More Likely in Five Cards, One King or Two Hearts? . . . . . . . . . . . . . . . . . . . . . . 26 1.11.2 Example: Random Groups of Students . . . . . . . 27 1.11.3 Example: Lottery Tickets . . . . . . . . . . . . . . 27 1.11.4 Example: Gaps between Numbers . . . . . . . . . . 28 1.11.5 Multinomial Coefficients . . . . . . . . . . . . . . . 29 1.11.6 Example: Probability of Getting Four Aces in a Bridge Hand . . . . . . . . . . . . . . . . . . . . . . 30 1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2 Monte Carlo Simulation 35 2.1 Example: Rolling Dice . . . . . . . . . . . . . . . . . . . . 35 2.1.1 First Improvement . . . . . . . . . . . . . . . . . . 36 2.1.2 Second Improvement . . . . . . . . . . . . . . . . . 37 2.1.3 Third Improvement . . . . . . . . . . . . . . . . . . 38 2.2 Example: Dice Problem . . . . . . . . . . . . . . . . . . . 39 2.3 Use of runif() for Simulating Events . . . . . . . . . . . . . 39 2.4 Example: Bus Ridership (cont’d.) . . . . . . . . . . . . . . 40 2.5 Example: Board Game (cont’d.) . . . . . . . . . . . . . . . 40 2.6 Example: Broken Rod . . . . . . . . . . . . . . . . . . . . 41 2.7 How Long Should We Run the Simulation? . . . . . . . . . 42 2.8 Computational Complements . . . . . . . . . . . . . . . . 42 CONTENTS vii 2.8.1 More on the replicate() Function . . . . . . . . . . 42 2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Discrete Random Variables: Expected Value 45 3.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . 46 3.3 Independent Random Variables . . . . . . . . . . . . . . . 46 3.4 Example: The Monty Hall Problem . . . . . . . . . . . . . 47 3.5 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 Generality—NotJustforDiscreteRandomVariables 50 3.5.2 Misnomer . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.3 Definition and Notebook View . . . . . . . . . . . . 50 3.6 Properties of Expected Value . . . . . . . . . . . . . . . . 51 3.6.1 Computational Formula . . . . . . . . . . . . . . . 51 3.6.2 Further Properties of Expected Value . . . . . . . . 54 3.7 Example: Bus Ridership . . . . . . . . . . . . . . . . . . . 58 3.8 Example: Predicting Product Demand . . . . . . . . . . . 58 3.9 Expected Values via Simulation . . . . . . . . . . . . . . . 59 3.10 Casinos, Insurance Companies and “Sum Users,” Compared to Others . . . . . . . . . . . . . . . . . . . . . 60 3.11 Mathematical Complements . . . . . . . . . . . . . . . . . 61 3.11.1 Proof of Property E . . . . . . . . . . . . . . . . . . 61 3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4 Discrete Random Variables: Variance 65 4.1 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.2 Central Importance of the Concept of Variance . . 69 viii CONTENTS 4.1.3 Intuition Regarding the Size of Var(X) . . . . . . . 69 4.1.3.1 Chebychev’s Inequality . . . . . . . . . . . 69 4.1.3.2 The Coefficient of Variation . . . . . . . . 70 4.2 A Useful Fact . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 IndicatorRandomVariables,andTheirMeansandVariances 74 4.4.1 Example: Return Time for Library Books, Version I 75 4.4.2 Example: Return Time for Library Books, Version II 76 4.4.3 Example: Indicator Variables in a Committee Problem . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.6 Mathematical Complements . . . . . . . . . . . . . . . . . 79 4.6.1 Proof of Chebychev’s Inequality . . . . . . . . . . . 79 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Discrete Parametric Distribution Families 83 5.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.1 Example: Toss Coin Until First Head . . . . . . . . 84 5.1.2 Example: Sum of Two Dice . . . . . . . . . . . . . 85 5.1.3 Example: Watts-Strogatz Random Graph Model . 85 5.1.3.1 The Model . . . . . . . . . . . . . . . . . 85 5.2 Parametric Families of Distributions . . . . . . . . . . . . 86 5.3 The Case of Importance to Us: Parameteric Families of pmfs 86 5.4 Distributions Based on Bernoulli Trials . . . . . . . . . . . 88 5.4.1 The Geometric Family of Distributions . . . . . . . 88 5.4.1.1 R Functions . . . . . . . . . . . . . . . . . 91 5.4.1.2 Example: A Parking Space Problem . . . 92 5.4.2 The Binomial Family of Distributions . . . . . . . . 94 CONTENTS ix 5.4.2.1 R Functions . . . . . . . . . . . . . . . . . 95 5.4.2.2 Example: Parking Space Model . . . . . . 96 5.4.3 The Negative Binomial Family of Distributions . . 96 5.4.3.1 R Functions . . . . . . . . . . . . . . . . . 97 5.4.3.2 Example: Backup Batteries . . . . . . . . 98 5.5 Two Major Non-Bernoulli Models . . . . . . . . . . . . . . 98 5.5.1 The Poisson Family of Distributions . . . . . . . . 99 5.5.1.1 R Functions . . . . . . . . . . . . . . . . . 99 5.5.1.2 Example: Broken Rod . . . . . . . . . . . 100 5.5.2 The Power Law Family of Distributions. . . . . . . 100 5.5.2.1 The Model . . . . . . . . . . . . . . . . . 100 5.5.3 Fitting the Poisson and Power Law Models to Data 102 5.5.3.1 Poisson Model . . . . . . . . . . . . . . . 102 5.5.3.2 Straight-LineGraphicalTestforthePower Law . . . . . . . . . . . . . . . . . . . . . 103 5.5.3.3 Example: DNC E-mail Data. . . . . . . . 103 5.6 Further Examples . . . . . . . . . . . . . . . . . . . . . . . 106 5.6.1 Example: The Bus Ridership Problem . . . . . . . 106 5.6.2 Example: Analysis of Social Networks . . . . . . . 107 5.7 Computational Complements . . . . . . . . . . . . . . . . 108 5.7.1 Graphics and Visualization in R . . . . . . . . . . . 108 5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6 Continuous Probability Models 113 6.1 A Random Dart . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Individual Values Now Have Probability Zero . . . . . . . 114 6.3 But Now We Have a Problem . . . . . . . . . . . . . . . . 115

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.