ebook img

Data Wrangling with R PDF

237 Pages·2016·7.044 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Wrangling with R

UseR! Bradley C. Boehmke Data Wrangling with R Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani More information about this series at h ttp://www.springer.com/series/6991 Use R! Wickham: ggplot2 Moore: Applied Survival Analysis Using R Luke: A User’s Guide to Network Analysis in R Monogan: Political Analysis Using R Cano/M. Moguerza/Prieto Corcoba: Quality Control with R Schwarzer/Carpenter/Rücker: Meta-Analysis with R Gondro: Primer to Analysis of Genomic Data Using R Chapman/Feit: R for Marketing Research and Analytics Willekens: Multistate Analysis of Life Histories with R Cortez: Modern Optimization with R Kolaczyk/Csárdi: Statistical Analysis of Network Data with R Swenson/Nathan: Functional and Phylogenetic Ecology in R Nolan/Temple Lang: XML and Web Technologies for Data Sciences with R Nagarajan/Scutari/Lèbre: Bayesian Networks in R van den Boogaart/Tolosana-Delgado: Analyzing Compositional Data with R Bivand/Pebesma/Gómez-Rubio: A pplied Spatial Data Analysis with R (2nd ed. 2013) Eddelbuettel: Seamless R and C++ Integration with Rcpp Knoblauch/Maloney: Modeling Psychophysical Data in R Lin/Shkedy/Yekutieli/Amaratunga/Bijnens: M odeling Dose-Response Microarray Data in Early Drug Development Experiments Using R Cano/M. Moguerza/Redchuk: Six Sigma with R Soetaert/Cash/Mazzia: Solving Differential Equations in R Bradley C. Boehmke Data Wrangling with R Bradley C. Boehmke, Ph.D. Air Force Institute of Technology Dayton , OH , USA ISSN 2197-5736 ISSN 2197-5744 (electronic) Use R! ISBN 978-3-319-45598-3 ISBN 978-3-319-45599-0 (eBook) DOI 10.1007/978-3-319-45599-0 Library of Congress Control Number: 2016953509 © Springer International Publishing Switzerland 2016 T his work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. T he use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. T he publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Pref ace W elcome to Data Wrangling w ith R! In this book, I will help you learn the essentials of preprocessing data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information. Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc., can be a painstakingly laborious process. In fact, it has been stated that up to 80 % of data analysis is spent on the process of cleaning and preparing data (cf. Wickham 2014 ; Dasu and Johnson 2003 ). However, being a prerequisite to the rest of the data analysis workfl ow (visualization, modeling, reporting), it’s essential that you become fl uent and effi cient in data wrangling techniques. T his book will guide you through the data wrangling process along with giving you a solid foundation of the basics of working with data in R . My goal is to teach you how to easily wrangle your data, so you can spend more time focused on under- standing the content of your data via visualization, modeling, and reporting your results. By the time you fi nish reading this book, you will have learned: • How to work with the different types of data such as numerics, characters, regu- lar expressions , factors , and dates. • T he difference between the various d ata structures and how to create, add addi- tional components to, and how to subset each data structure. • How to acquire and parse data from locations you may not have been able to access before such as web scraping or leveraging APIs. • H ow to develop your own functions and use loop control structures to reduce code redundancy. • How to use pipe operators to simplify your code and make it more readable. • H ow to reshape the layout of your data, and manipulate, summarize, and join data sets. N ot only will you learn many base R functions, you’ll also learn how to use some of the latest data wrangling p ackages such as tidyr , d plyr , httr , s tringr , lubridate , r eadr , rvest , magrittr , xlsx , r eadxl and others. In essence, you will have the data wrangling toolbox required for modern day data analysis. v vi Preface Who This Book Is for T his book is meant to establish the baseline R vocabulary and knowledge for the primary d ata wrangling processes. This captures a wide range of programming activities which covers the full spectrum from understanding basic data objects in R to writing your own functions, applying loops, and web scraping. As a result, this book can be benefi cial to all l evels of R programmers. Beginner R programmers will gain a basic understanding of the functionality of R along with learning how to work with data using R. Intermediate and advanced R programmers will likely fi nd the early chapters reiterating established knowledge; however, these programmers will benefi t from the mid and latter chapters by learning newer and more effi cient data wrangling techniques. What You Need for This Book Obviously to gain and retain knowledge from this book, it is highly recommended that you follow along and practice the code examples yourself. Furthermore, this book assumes that you will actually be performing data wrangling in R; therefore, it is assumed that you have or plan to have R installed on your computer. You will fi nd the latest ver sion of R for Linux, Mac OS, and Windows at https://cran.r-project.org . It is also recommended that you use an integrated development environment (IDE) as it will simplify and organize your coding environment greatly. There are several to choose from; however, I highly recommend the RStudio IDE which you can download at h ttps://www.rstudio.com . Reader Feedback R eader comments are greatly appreciated. Please send any feedback regarding typos, mistakes, confusing statements, or opportunities for improvement to wran- [email protected]. Bibliography D asu, T., & Johnson, T. (2003). E xploratory Data Mining and Data Cleaning (Vol. 479). John Wiley & Sons. Wickham, H. (2014). Tidy data. J ournal of Statistical Software, 59 (i10). Contents Part I Introduction 1 The Role of Data Wrangling .................................................................. 3 2 Introduction to R ..................................................................................... 7 2.1 Open Source ..................................................................................... 7 2.2 Flexibility ......................................................................................... 8 2.3 Community ...................................................................................... 9 3 The Basics ................................................................................................ 11 3.1 Installing R and RStudio .................................................................. 11 3.2 Understanding the Console .............................................................. 13 3.2.1 Script Editor ......................................................................... 13 3.2.2 Workspace Environment ...................................................... 13 3.2.3 Console ................................................................................ 15 3.2.4 Misc. Displays...................................................................... 15 3.2.5 Workspace Options and Shortcuts ....................................... 15 3.3 Getting Help ..................................................................................... 16 3.3.1 General Help ........................................................................ 16 3.3.2 Getting Help on Functions ................................................... 16 3.3.3 Getting Help from the Web .................................................. 17 3.4 Working with Packages .................................................................... 17 3.4.1 Installing Packages ............................................................... 18 3.4.2 Loading Packages ................................................................ 18 3.4.3 Getting Help on Packages .................................................... 19 3.4.4 Useful Packages ................................................................... 19 3.5 Assignment and Evaluation ............................................................. 19 3.6 R as a Calculator .............................................................................. 21 3.6.1 Vectorization ........................................................................ 22 vii viii Contents 3.7 Styling Guide ................................................................................... 24 3.7.1 Notation and Naming ........................................................... 24 3.7.2 Organization ......................................................................... 25 3.7.3 Syntax .................................................................................. 26 Part II Working with Different Types of Data in R 4 Dealing with Numbers ............................................................................ 31 4.1 Integer vs. Double ............................................................................ 31 4.1.1 Creating Integer and Double Vectors ................................... 31 4.1.2 Converting Between Integer and Double Values ................. 32 4.2 Generating Sequence of Non-random Numbers .............................. 32 4.2.1 Specifying Numbers Within a Sequence ............................. 32 4.2.2 Generating Regular Sequences ............................................ 33 4.3 Generating Sequence of Random Numbers ..................................... 33 4.3.1 Uniform Numbers ................................................................ 34 4.3.2 Normal Distribution Numbers ............................................. 34 4.3.3 Binomial Distribution Numbers ........................................... 35 4.3.4 Poisson Distribution Numbers ............................................. 36 4.3.5 Exponential Distribution Numbers ...................................... 36 4.3.6 Gamma Distribution Numbers ............................................. 37 4.4 Setting the Seed for Reproducible Random Numbers ..................... 37 4.5 Comparing Numeric Values ............................................................. 37 4.5.1 Comparison Operators ......................................................... 38 4.5.2 Exact Equality ...................................................................... 39 4.5.3 Floating Point Comparison .................................................. 39 4.6 Rounding Numbers .......................................................................... 39 5 Dealing with Character Strings ............................................................. 41 5.1 Character String Basics .................................................................... 41 5.1.1 Creating Strings ................................................................... 41 5.1.2 Converting to Strings ........................................................... 42 5.1.3 Printing Strings .................................................................... 43 5.1.4 Counting String Elements and Characters ........................... 45 5.2 String Manipulation with Base R ..................................................... 46 5.2.1 Case Conversion................................................................... 46 5.2.2 Simple Character Replacement ............................................ 46 5.2.3 String Abbreviations ............................................................ 47 5.2.4 Extract/ Replace Substrings .................................................. 47 5.3 String Manipulation with s tringr ............................................... 49 5.3.1 Basic Operations .................................................................. 49 5.3.2 Duplicate Characters Within a String .................................. 51 5.3.3 Remove Leading and Trailing Whitespace .......................... 51 5.3.4 Pad a String with Whitespace .............................................. 52 Contents ix 5.4 Set Operatons for Character Strings ................................................ 52 5.4.1 Set Union ............................................................................. 52 5.4.2 Set Intersection..................................................................... 52 5.4.3 Identifying Different Elements ............................................ 53 5.4.4 Testing for Element Equality ............................................... 53 5.4.5 Testing for E xact Equality ................................................... 53 5.4.6 Identifying If Elements Are Contained in a String .............. 54 5.4.7 Sorting a String .................................................................... 54 6 Dealing with Regular Expressions ......................................................... 55 6.1 Regex Syntax ................................................................................... 55 6.1.1 Metacharacters ..................................................................... 56 6.1.2 Sequences ............................................................................. 56 6.1.3 Character Classes ................................................................. 57 6.1.4 POSIX Character Classes .................................................... 58 6.1.5 Quantifi ers ............................................................................ 59 6.2 Regex Functions ............................................................................... 60 6.2.1 Main Regex Functions in R ................................................. 60 6.2.2 Regex Functions in s tringr ............................................. 63 6.3 Additional Resources ....................................................................... 66 7 Dealing with Factors ............................................................................... 67 7.1 Creating, Converting and Inspecting Factors ................................... 67 7.2 Ordering Levels ................................................................................ 68 7.3 Revalue Levels ................................................................................. 69 7.4 Dropping Levels ............................................................................... 69 8 Dealing with Dates .................................................................................. 71 8.1 Getting Current Date and Time ........................................................ 71 8.2 Converting Strings to Dates ............................................................. 72 8.2.1 Convert Strings to Dates ...................................................... 72 8.2.2 Create Dates by Merging Data ............................................. 73 8.3 Extract and Manipulate Parts of Dates ............................................. 73 8.4 Creating Date Sequences ................................................................. 75 8.5 Calculations with Dates ................................................................... 76 8.6 Dealing with Time Zones and Daylight Savings ............................. 77 8.7 Additional Resources ....................................................................... 78 Part III Managing Data Structures in R 9 Data Structure Basics ............................................................................. 81 9.1 Identifying the Structure .................................................................. 81 9.2 Attributes .......................................................................................... 82

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.