ebook img

The Data Science Workshop PDF

823 Pages·2020·24.161 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Data Science Workshop

The Data Science Workshop Second Edition Learn how you can build machine learning models and create your own real-world data science projects Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, and Dr. Samuel Asare The Data Science Workshop Second Edition Copyright © 2020 Packt Publishing All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Authors: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, and Dr. Samuel Asare Reviewers: Tianxiang Liu, Tiffany Ford, and Pritesh Tiwari Managing Editor: Snehal Tambe Acquisitions Editor: Sarah Lawton Production Editor: Salma Patel Editorial Board: Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray First published: January 2020 Second edition: August 2020 Production reference: 1280820 ISBN: 978-1-80056-692-7 Published by Packt Publishing Ltd. Livery Place, 35 Livery Street Birmingham B3 2PB, UK Table of Contents Preface i Chapter 1: Introduction to Data Science in Python 1 Introduction .............................................................................................. 2 Application of Data Science .................................................................... 3 What Is Machine Learning? ......................................................................... 4 Supervised Learning .............................................................................................4 Unsupervised Learning ........................................................................................6 Reinforcement Learning ......................................................................................7 Overview of Python ................................................................................. 7 Types of Variable .......................................................................................... 7 Numeric Variables ................................................................................................7 Text Variables ........................................................................................................8 Python List .............................................................................................................9 Python Dictionary ...............................................................................................11 Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms .................................................................. 14 Python for Data Science ....................................................................... 18 The pandas Package .................................................................................. 18 DataFrame and Series ........................................................................................18 CSV Files ...............................................................................................................20 Excel Spreadsheets .............................................................................................22 JSON ......................................................................................................................23 Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame ......................................................................... 24 Scikit-Learn ............................................................................................ 28 What Is a Model? .................................................................................................28 Model Hyperparameters ...................................................................................31 The sklearn API ...................................................................................................31 Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn .............................................................................................. 34 Activity 1.01: Train a Spam Detector Algorithm ..................................... 38 Summary ................................................................................................ 39 Chapter 2: Regression 41 Introduction ........................................................................................... 42 Simple Linear Regression ..................................................................... 44 The Method of Least Squares ................................................................... 46 Multiple Linear Regression .................................................................. 47 Estimating the Regression Coefficients ( and ) ........................ 47 0, 1, 2 3 Logarithmic Transformations of Variables ............................................. 47 β β β β Correlation Matrices .................................................................................. 48 Conducting Regression Analysis Using Python ................................. 48 Exercise 2.01: Loading and Preparing the Data for Analysis ................ 49 The Correlation Coefficient ...................................................................... 57 Exercise 2.02: Graphical Investigation of Linear Relationships Using Python ...................................................................... 59 Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python ........................................................................ 62 The Statsmodels formula API ................................................................... 65 Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API ............................................. 65 Analyzing the Model Summary ................................................................ 67 The Model Formula Language .................................................................. 68 Intercept Handling ..................................................................................... 70 Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels Formula API ................................................................... 70 Multiple Regression Analysis ............................................................... 71 Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels Formula API ............................................ 72 Assumptions of Regression Analysis .................................................. 74 Activity 2.02: Fitting a Multiple Log-Linear Regression Model ............. 76 Explaining the Results of Regression Analysis .................................. 77 Regression Analysis Checks and Balances .............................................. 78 The F-test .................................................................................................... 80 The t-test ..................................................................................................... 80 Summary ................................................................................................ 81 Chapter 3: Binary Classification 83 Introduction ........................................................................................... 84 Understanding the Business Context ................................................. 85 Business Discovery .................................................................................... 85 Exercise 3.01: Loading and Exploring the Data from the Dataset ....... 86 Testing Business Hypotheses Using Exploratory Data Analysis .......... 89 Visualization for Exploratory Data Analysis ........................................... 89 Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan .......................................................... 94 Intuitions from the Exploratory Analysis .............................................. 100 Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits ................. 100 Feature Engineering .......................................................................... 102 Business-Driven Feature Engineering ................................................... 102 Exercise 3.03: Feature Engineering – Exploration of Individual Features .............................................................................. 103 Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones .................................................................. 109 Data-Driven Feature Engineering ..................................................... 115 A Quick Peek at Data Types and a Descriptive Summary ................... 115 Correlation Matrix and Visualization ............................................... 118 Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data ....................................................... 118 Skewness of Data ..................................................................................... 121 Histograms ................................................................................................ 122 Density Plots ............................................................................................. 123 Other Feature Engineering Methods ..................................................... 124 Summarizing Feature Engineering ........................................................ 127 Building a Binary Classification Model Using the Logistic Regression Function ........................................................... 128 Logistic Regression Demystified ............................................................ 131 Metrics for Evaluating Model Performance .......................................... 132 Confusion Matrix ..................................................................................... 133 Accuracy .................................................................................................... 134 Classification Report ................................................................................ 135 Data Preprocessing .................................................................................. 135 Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank ......................... 136 Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables ............................................ 142 Next Steps ................................................................................................. 144 Summary .............................................................................................. 145 Chapter 4: Multiclass Classification with RandomForest 147 Introduction ......................................................................................... 148 Training a Random Forest Classifier ................................................. 148 Evaluating the Model's Performance ............................................... 152 Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance ..................................................... 154 Number of Trees Estimator .................................................................... 158 Exercise 4.02: Tuning n_estimators to Reduce Overfitting ................. 162 Maximum Depth ................................................................................. 166 Exercise 4.03: Tuning max_depth to Reduce Overfitting .................... 168 Minimum Sample in Leaf ................................................................... 171 Exercise 4.04: Tuning min_samples_leaf ............................................... 174 Maximum Features ............................................................................. 178 Exercise 4.05: Tuning max_features ...................................................... 181 Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset ............................................................................. 184 Summary .............................................................................................. 186 Chapter 5: Performing Your First Cluster Analysis 189 Introduction ......................................................................................... 190 Clustering with k-means .................................................................... 192 Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset ................................................................................. 194 Interpreting k-means Results ............................................................ 199 Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses ........................................................ 204 Choosing the Number of Clusters .................................................... 210 Exercise 5.03: Finding the Optimal Number of Clusters ..................... 215 Initializing Clusters ............................................................................. 221 Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome .............................................................. 225 Calculating the Distance to the Centroid ......................................... 230 Exercise 5.05: Finding the Closest Centroids in Our Dataset ............. 235 Standardizing Data ............................................................................. 244 Exercise 5.06: Standardizing the Data from Our Dataset ................... 250 Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means ......................................................................... 254 Summary .............................................................................................. 256 Chapter 6: How to Assess Performance 259 Introduction ......................................................................................... 260 Splitting Data ....................................................................................... 260 Exercise 6.01: Importing and Splitting Data ......................................... 261 Assessing Model Performance for Regression Models .................. 266 Data Structures – Vectors and Matrices ................................................ 268 Scalars ................................................................................................................268 Vectors ...............................................................................................................268 Matrices .............................................................................................................270 R2 Score ...................................................................................................... 272 Exercise 6.02: Computing the R2 Score of a Linear Regression Model ..................................................................................... 272 Mean Absolute Error ............................................................................... 277 Exercise 6.03: Computing the MAE of a Model ..................................... 277 Exercise 6.04: Computing the Mean Absolute Error of a Second Model ................................................................................... 281 Other Evaluation Metrics .................................................................................285 Assessing Model Performance for Classification Models .............. 286 Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics ............................................................... 287 The Confusion Matrix ......................................................................... 291 Exercise 6.06: Generating a Confusion Matrix for the Classification Model .................................................................... 292 More on the Confusion Matrix .............................................................. 293 Precision .................................................................................................... 294 Exercise 6.07: Computing Precision for the Classification Model ...... 295 Recall ......................................................................................................... 296 Exercise 6.08: Computing Recall for the Classification Model ............ 297 F1 Score ..................................................................................................... 298 Exercise 6.09: Computing the F1 Score for the Classification Model .......................................................................... 299 Accuracy .................................................................................................... 300 Exercise 6.10: Computing Model Accuracy for the Classification Model .......................................................................... 300 Logarithmic Loss ...................................................................................... 301 Exercise 6.11: Computing the Log Loss for the Classification Model .......................................................................... 302 Receiver Operating Characteristic Curve ......................................... 303 Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem ....................................................... 303 Area Under the ROC Curve ................................................................ 310 Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset .. 311

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.