ebook img

Machine Learning Using R PDF

580 Pages·2017·11.47 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Machine Learning Using R

Machine Learning Using R A Comprehensive Guide to Machine Learning — Karthik Ramasubramanian Abhishek Singh Machine Learning Using R Karthik Ramasubramanian Abhishek Singh Machine Learning Using R Karthik Ramasubramanian Abhishek Singh New Delhi, Delhi, India New Delhi, Delhi, India ISBN-13 (pbk): 978-1-4842-2333-8 ISBN-13 (electronic): 978-1-4842-2334-5 DOI 10.1007/978-1-4842-2334-5 Library of Congress Control Number: 2016961515 Copyright © 2017 Karthik Ramasubramanian and Abhishek Singh This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Acquisitions Editor: Celestin Suresh John Development Editor: James Markham Technical Reviewer: Jojo Moolayil Editorial Board: Steve Anglin, Pramila Balen, Laura Berendson, Aaron Black, Louise Corrigan, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal, James Markham, Susan McDermott, Matthew Moodie, Natalie Pao, Gwenan Spearing Coordinating Editor: Sanchita Mandal Copy Editor: Lori Jacobs Compositor: SPi Global Indexer: SPi Global Cover Image: Freepik Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springer.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail [email protected], or visit www.apress.com. Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales. Any source code or other supplementary materials referenced by the author in this text is available to readers at www.apress.com. For detailed information about how to locate your book’s source code, go to www.apress.com/source-code/. Printed on acid-free paper To our parents for being the guiding light and a strong pillar of support. And to our decade-long friendship. Contents at a Glance About the Authors ���������������������������������������������������������������������������xix About the Technical Reviewer ��������������������������������������������������������xxi Acknowledgments ������������������������������������������������������������������������xxiii ■ Chapter 1: Introduction to Machine Learning and R �����������������������1 ■ Chapter 2: Data Preparation and Exploration �������������������������������31 ■ Chapter 3: Sampling and Resampling Techniques �����������������������67 ■ Chapter 4: Data Visualization in R ����������������������������������������������129 ■ Chapter 5: Feature Engineering ��������������������������������������������������181 ■ Chapter 6: Machine Learning Theory and Practices �������������������219 ■ Chapter 7: Machine Learning Model Evaluation �������������������������425 ■ Chapter 8: Model Performance Improvement �����������������������������465 ■ Chapter 9: Scalable Machine Learning and Related Technologies �������������������������������������������������������������������������������519 Index ����������������������������������������������������������������������������������������������555 v Contents About the Authors ���������������������������������������������������������������������������xix About the Technical Reviewer ��������������������������������������������������������xxi Acknowledgments ������������������������������������������������������������������������xxiii ■ Chapter 1: Introduction to Machine Learning and R �����������������������1 1.1 Understanding the Evolution ..........................................................2 1.1.1 Statistical Learning ......................................................................................2 1.1.2 Machine Learning (ML) .................................................................................3 1.1.3 Artificial Intelligence (AI)...............................................................................3 1.1.4 Data Mining ..................................................................................................4 1.1.5 Data Science ................................................................................................5 1.2 Probability and Statistics ...............................................................6 1.2.1 Counting and Probability Definition ..............................................................7 1.2.2 Events and Relationships .............................................................................9 1.2.3 Randomness, Probability, and Distributions ...............................................12 1.2.4 Confidence Interval and Hypothesis Testing ...............................................13 1.3 Getting Started with R ..................................................................18 1.3.1 Basic Building Blocks .................................................................................18 1.3.2 Data Structures in R ...................................................................................19 1.3.3 Subsetting ..................................................................................................21 1.3.4 Functions and Apply Family ........................................................................23 vii ■ Contents 1.4 Machine Learning Process Flow ..................................................26 1.4.1 Plan ............................................................................................................26 1.4.2 Explore........................................................................................................26 1.4.3 Build ...........................................................................................................27 1.4.4 Evaluate ......................................................................................................27 1.5 Other Technologies .......................................................................28 1.6 Summary ......................................................................................28 1.7 References ...................................................................................28 ■ Chapter 2: Data Preparation and Exploration �������������������������������31 2.1 Planning the Gathering of Data ....................................................32 2.1.1 Variables Types ...........................................................................................32 2.1.2 Data Formats ..............................................................................................33 2.1.3 Data Sources ..............................................................................................40 2.2 Initial Data Analysis (IDA) .............................................................41 2.2.1 Discerning a First Look ...............................................................................41 2.2.2 Organizing Multiple Sources of Data into One ............................................43 2.2.3 Cleaning the Data .......................................................................................46 2.2.4 Supplementing with More Information .......................................................49 2.2.5 Reshaping ...................................................................................................50 2.3 Exploratory Data Analysis.............................................................51 2.3.1 Summary Statistics ....................................................................................52 2.3.2 Moment ......................................................................................................55 2.4 Case Study: Credit Card Fraud .....................................................61 2.4.1 Data Import .................................................................................................61 2.4.2 Data Transformation ...................................................................................62 2.4.3 Data Exploration .........................................................................................63 2.5 Summary ......................................................................................65 2.6 References ...................................................................................65 viii ■ Contents ■ Chapter 3: Sampling and Resampling Techniques �����������������������67 3.1 Introduction to Sampling ..............................................................68 3.2 Sampling Terminology ..................................................................69 3.2.1 Sample .......................................................................................................69 3.2.2 Sampling Distribution .................................................................................70 3.2.3 Population Mean and Variance ...................................................................70 3.2.4 Sample Mean and Variance ........................................................................70 3.2.5 Pooled Mean and Variance .........................................................................70 3.2.6 Sample Point ..............................................................................................71 3.2.7 Sampling Error ...........................................................................................71 3.2.8 Sampling Fraction ......................................................................................72 3.2.9 Sampling Bias ............................................................................................72 3.2.10 Sampling Without Replacement (SWOR) ..................................................72 3.2.11 Sampling with Replacement (SWR) ..........................................................72 3.3 Credit Card Fraud: Population Statistics.......................................73 3.3.1 Data Description .........................................................................................73 3.3.2 Population Mean .........................................................................................74 3.3.3 Population Variance ....................................................................................74 3.3.4 Pooled Mean and Variance .........................................................................75 3.4 Business Implications of Sampling ..............................................78 3.4.1 Features of Sampling .................................................................................79 3.4.2 Shortcomings of Sampling .........................................................................79 3.5 Probability and Non-Probability Sampling ...................................79 3.5.1 T ypes of Non-Probability Sampling.............................................................80 3.6 Statistical Theory on Sampling Distributions ...............................81 3.6.1 Law of Large Numbers: LLN ......................................................................81 3.6.2 Central Limit Theorem ................................................................................85 ix ■ Contents 3.7 Probability Sampling Techniques .................................................89 3.7.1 Population Statistics ...................................................................................89 3.7.2 Simple Random Sampling ..........................................................................93 3.7.3 Systematic Random Sampling .................................................................100 3.7.4 Stratified Random Sampling.....................................................................104 3.7.5 Cluster Sampling ......................................................................................111 3.7.6 Bootstrap Sampling ..................................................................................117 3.8 Monte Carlo Method: Acceptance-Rejection Method .................124 3.9 A Qualitative Account of Computational Savings by Sampling ...126 3.10 Summary ..................................................................................127 ■ Chapter 4: Data Visualization in R ����������������������������������������������129 4.1 Introduction to the ggplot2 Package ..........................................130 4.2 World Development Indicators ...................................................132 4.3 Line Chart ...................................................................................132 4.4 Stacked Column Charts ..............................................................138 4.5 Scatterplots ...............................................................................144 4.6 Boxplots .....................................................................................145 4.7 Histograms and Density Plots ....................................................148 4.8 Pie Charts ...................................................................................152 4.9 Correlation Plots .........................................................................154 4.10 HeatMaps .................................................................................156 4.11 Bubble Charts ...........................................................................158 4.12 Waterfall Charts ........................................................................162 4.13 Dendogram ...............................................................................165 4.14 Wordclouds...............................................................................167 4.15 Sankey Plots ............................................................................169 4.16 T ime Series Graphs ..................................................................170 x ■ Contents 4.17 Cohort Diagrams ......................................................................172 4.18 Spatial Maps ............................................................................174 4.19 Summary ..................................................................................178 4.20 References ...............................................................................179 ■ Chapter 5: Feature Engineering ��������������������������������������������������181 5.1 Introduction to Feature Engineering ...........................................182 5.1.1 Filter Methods ..........................................................................................184 5.1.2 Wrapper Methods .....................................................................................184 5.1.3 Embedded Methods ..................................................................................184 5.2 Understanding the Working Data ...............................................185 5.2.1 Data Summary ..........................................................................................186 5.2.2 Properties of Dependent Variable .............................................................186 5.2.3 Features Availability: Continuous or Categorical ......................................189 5.2.4 Setting Up Data Assumptions ...................................................................191 5.3 Feature Ranking .........................................................................191 5.4 Variable Subset Selection ..........................................................195 5.4.1 Filter Method ............................................................................................195 5.4.2 Wrapper Methods .....................................................................................199 5.4.3 Embedded Methods ..................................................................................206 5.5 Dimensionality Reduction ..........................................................210 5.6 Feature Engineering Checklist ...................................................215 5.7 Summary ....................................................................................217 5.8 References .................................................................................217 ■ Chapter 6: Machine Learning Theory and Practices �������������������219 6.1 Machine Learning Types ............................................................222 6.1.1 Supervised Learning ................................................................................222 6.1.2 Unsupervised Learning .............................................................................223 xi

Description:
This book is inspired by the Machine Learning Model Building Process Flow, which provides the reader the ability to understand a ML algorithm and apply the entire process of building a ML model from the raw data.This new paradigm of teaching Machine Learning will bring about a radical change in perc
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.