Information Technology / Database D u P n r With this book, managers and decision makers are given the tools to make more n i i e informed decisions about big data purchasing initiatives. Big Data Analytics: A g s a Practical Guide for Managers not only supplies descriptions of common tools, n BIG DATA but also surveys the various products and vendors that supply the big data market. Comparing and contrasting the different types of analysis commonly conducted B with big data, this accessible reference presents clear-cut explanations of the general ANALYTICS workings of big data tools. Instead of spending time on HOW to install specific I packages, it focuses on the reasons WHY readers would install a given package. G The book provides authoritative guidance on a range of tools, including open source and proprietary systems. It details the strengths and weaknesses of incorporating big data analysis into decision-making and explains how to leverage the strengths D while mitigating the weaknesses. A Practical Guide A • Describes the benefits of distributed computing in simple terms for Managers • Includes substantial vendor/tool material, especially for open source decisions T • Covers prominent software packages, including Hadoop and Oracle Endeca A • Examines GIS and machine learning applications • Considers privacy and surveillance issues A The book further explores basic statistical concepts that, when misapplied, can be the source of errors. Time and again, big data is treated as an oracle that discovers N results nobody would have imagined. While big data can serve this valuable function, Kim H. Pries all too often these results are incorrect yet are still reported unquestioningly. The A probability of having erroneous results increases as a larger number of variables are Robert Dunnigan compared unless preventative measures are taken. L The approach taken by the authors is to explain these concepts so managers can Y ask better questions of their analysts and vendors about the appropriateness of the methods used to arrive at a conclusion. Because the world of science and medicine T has been grappling with similar issues in the publication of studies, the authors draw on their efforts and apply them to big data. I C S K23000 6000 Broken Sound Parkway, NW Suite 300, Boca Raton, FL 33487 ISBN: 978-1-4822-3451-0 711 Third Avenue 90000 New York, NY 10017 an informa business 2 Park Square, Milton Park www.crcpress.com Abingdon, Oxon OX14 4RN, UK 9 781482 234510 www.auerbach-publications.com K23000 mech rev.indd 1 12/29/14 10:12 AM BIG DATA ANALYTICS A Practical Guide for Managers BIG DATA ANALYTICS A Practical Guide for Managers Kim H. Pries Robert Dunnigan MATLAB® and Simulink® are trademarks of The MathWorks, Inc. and are used with permission. The Math- Works does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® and Simulink® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® and Simulink® software. CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2015 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20141024 International Standard Book Number-13: 978-1-4822-3452-7 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface .................................................................................................xiii Acknowledgments ................................................................................xv Authors ................................................................................................xvii Chapter 1 Introduction .......................................................................1 So What Is Big Data? ...................................................................1 Growing Interest in Decision Making ......................................4 What This Book Addresses ........................................................6 The Conversation about Big Data..............................................7 Technological Change as a Driver of Big Data ......................12 The Central Question: So What? .............................................13 Our Goals as Authors ...............................................................18 References ...................................................................................19 Chapter 2 The Mother of Invention’s Triplets: Moore’s Law, the Proliferation of Data, and Data Storage Technology .......21 Moore’s Law................................................................................22 Parallel Computing, between and within Machines ............25 Quantum Computing ...............................................................31 Recap of Growth in Computing Power ..................................31 Storage, Storage Everywhere ....................................................32 Grist for the Mill: Data Used and Unused .............................39 Agriculture ................................................................................40 Automotive ................................................................................42 Marketing in the Physical World ............................................45 Online Marketing ......................................................................49 Asset Reliability and Efficiency ..............................................54 Process Tracking and Automation.........................................56 Toward a Definition of Big Data ..............................................58 Putting Big Data in Context ....................................................62 Key Concepts of Big Data and Their Consequences ............64 Summary ....................................................................................67 References ...................................................................................67 v vi • Contents Chapter 3 Hadoop. .............................................................................73 Power through Distribution ....................................................75 Cost Effectiveness of Hadoop .............................................79 Not Every Problem Is a Nail ....................................................81 Some Technical Aspects ......................................................81 Troubleshooting Hadoop .........................................................83 Running Hadoop ......................................................................84 Hadoop File System ..................................................................84 MapReduce ...........................................................................86 Pig and Hive ..............................................................................90 Installation .................................................................................91 Current Hadoop Ecosystem .....................................................91 Hadoop Vendors ........................................................................94 Cloudera .................................................................................94 Amazon Web Services (AWS) .................................................95 Hortonworks ..............................................................................97 IBM ..............................................................................................97 Intel .............................................................................................99 MapR ........................................................................................100 Microsoft ..................................................................................100 Running Pig Latin Using Powershell ...............................101 Pivotal .......................................................................................103 References .................................................................................104 Chapter 4 HBase and Other Big Data Databases...........................105 Evolution from Flat File to the Three V’s .............................105 Flat File .................................................................................106 Hierarchical Database ........................................................110 Network Database ..............................................................110 Relational Database ............................................................111 Object-Oriented Databases ...............................................114 Relational-Object Databases .............................................114 Transition to Big Data Databases ..........................................115 What Is Different about HBase? .......................................116 What Is Bigtable? ................................................................119 What Is MapReduce? ........................................................120 What Are the Various Modalities for Big Data Databases? ...........................................................................122 Contents • vii Graph Databases .....................................................................123 How Does a Graph Database Work? ...............................123 What Is the Performance of a Graph Database? ...........124 Document Databases .............................................................124 Key-Value Databases ...............................................................131 Column-Oriented Databases .................................................138 HBase ....................................................................................138 Apache Accumulo ..............................................................142 References .................................................................................149 Chapter 5 Machine Learning ..........................................................151 Machine Learning Basics .......................................................151 Classifying with Nearest Neighbors .....................................153 Naive Bayes ..............................................................................154 Support Vector Machines .......................................................155 Improving Classification with Adaptive Boosting .............156 Regression .................................................................................157 Logistic Regression ..................................................................158 Tree-Based Regression ............................................................160 K-Means Clustering ................................................................161 Apriori Algorithm ...................................................................162 Frequent Pattern-Growth .......................................................164 Principal Component Analysis (PCA) .................................165 Singular Value Decomposition ..............................................166 Neural Networks .....................................................................168 Big Data and MapReduce .......................................................173 Data Exploration .....................................................................175 Spam Filtering ..........................................................................176 Ranking ....................................................................................177 Predictive Regression ..............................................................177 Text Regression ........................................................................178 Multidimensional Scaling ......................................................179 Social Graphing .......................................................................182 References .................................................................................191 Chapter 6 Statistics ..........................................................................193 Statistics, Statistics Everywhere .............................................193 Digging into the Data .............................................................195 viii • Contents Standard Deviation: The Standard Measure of Dispersion ................................................................................200 The Power of Shapes: Distributions ......................................201 Distributions: Gaussian Curve .............................................205 Distributions: Why Be Normal? ............................................214 Distributions: The Long Arm of the Power Law ................220 The Upshot? Statistics Are Not Bloodless ...........................227 Fooling Ourselves: Seeing What We Want to See in the Data ..........................................................................................228 We Can Learn Much from an Octopus ................................232 Hypothesis Testing: Seeking a Verdict ................................234 Two-Tailed Testing ............................................................240 Hypothesis Testing: A Broad Field ........................................241 Moving On to Specific Hypothesis Tests ............................242 Regression and Correlation ...................................................247 p Value in Hypothesis Testing: A Successful Gatekeeper? ............................................................................254 Specious Correlations and Overfitting the Data ................268 A Sample of Common Statistical Software Packages .........273 Minitab .................................................................................273 SPSS ......................................................................................274 R ............................................................................................275 SAS .......................................................................................277 Big Data Analytics ........................................................277 Hadoop Integration .......................................................278 Angoss ..................................................................................278 Statistica ...............................................................................279 Capabilities .....................................................................279 Summary .................................................................................280 References ................................................................................282 Chapter 7 Google .............................................................................285 Big Data Giants .......................................................................285 Google ......................................................................................286 Go .........................................................................................292 Android ................................................................................293 Google Product Offerings .................................................294 Google Analytics ...............................................................299
Description: