ebook img

Introducing Data Science: Big Data, Machine Learning and More, Using Python tools PDF

322 Pages·2016·11.37 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introducing Data Science: Big Data, Machine Learning and More, Using Python tools

Big data, machine learning, and more, using Python tools Davy Cielen Arno D. B. Meysman Mohamed Ali M A N N I N G Introducing Data Science Introducing Data Science BIG DATA, MACHINE LEARNING, AND MORE, USING PYTHON TOOLS DAVY CIELEN ARNO D. B. MEYSMAN MOHAMED ALI MANNING SHELTER ISLAND For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: [email protected] ©2016 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Dan Maharry 20 Baldwin Road Technical development editors: Michael Roberts, Jonathan Thoms PO Box 761 Copyeditor: Katie Petito Shelter Island, NY 11964 Proofreader: Alyson Brener Technical proofreader: Ravishankar Rajagopalan Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781633430037 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 21 20 19 18 17 16 brief contents 1 ■ Data science in a big data world 1 2 ■ The data science process 22 3 ■ Machine learning 57 4 ■ Handling large data on a single computer 85 5 ■ First steps in big data 119 6 ■ Join the NoSQL movement 150 7 ■ The rise of graph databases 190 8 ■ Text mining and text analytics 218 9 ■ Data visualization to the end user 253 v contents preface xiii acknowledgments xiv about this book xvi about the authors xviii about the cover illustration xx 1 Data science in a big data world 1 1.1 Benefits and uses of data science and big data 2 1.2 Facets of data 4 Structured data 4 ■ Unstructured data 5 Natural language 5 ■ Machine-generated data 6 Graph-based or network data 7 ■ Audio, image, and video 8 Streaming data 8 1.3 The data science process 8 Setting the research goal 8 ■ Retrieving data 9 Data preparation 9 ■ Data exploration 9 Data modeling or model building 9 ■ Presentation and automation 9 1.4 The big data ecosystem and data science 10 Distributed file systems 10 ■ Distributed programming framework 12 ■ Data integration framework 12 vii viii CONTENTS Machine learning frameworks 12 ■ NoSQL databases 13 Scheduling tools 14 ■ Benchmarking tools 14 System deployment 14 ■ Service programming 14 Security 14 1.5 An introductory working example of Hadoop 15 1.6 Summary 20 2 The data science process 22 2.1 Overview of the data science process 22 Don’t be a slave to the process 25 2.2 Step 1: Defining research goals and creating a project charter 25 Spend time understanding the goals and context of your research 26 Create a project charter 26 2.3 Step 2: Retrieving data 27 Start with data stored within the company 28 ■ Don’t be afraid to shop around 28 ■ Do data quality checks now to prevent problems later 29 2.4 Step 3: Cleansing, integrating, and transforming data 29 Cleansing data 30 ■ Correct errors as early as possible 36 Combining data from different data sources 37 Transforming data 40 2.5 Step 4: Exploratory data analysis 43 2.6 Step 5: Build the models 48 Model and variable selection 48 ■ Model execution 49 Model diagnostics and model comparison 54 2.7 Step 6: Presenting findings and building applications on top of them 55 2.8 Summary 56 3 Machine learning 57 3.1 What is machine learning and why should you care about it? 58 Applications for machine learning in data science 58 Where machine learning is used in the data science process 59 Python tools used in machine learning 60 CONTENTS ix 3.2 The modeling process 62 Engineering features and selecting a model 62 ■ Training your model 64 ■ Validating a model 64 ■ Predicting new observations 65 3.3 Types of machine learning 65 Supervised learning 66 ■ Unsupervised learning 72 3.4 Semi-supervised learning 82 3.5 Summary 83 4 Handling large data on a single computer 85 4.1 The problems you face when handling large data 86 4.2 General techniques for handling large volumes of data 87 Choosing the right algorithm 88 ■ Choosing the right data structure 96 ■ Selecting the right tools 99 4.3 General programming tips for dealing with large data sets 101 Don’t reinvent the wheel 101 ■ Get the most out of your hardware 102 ■ Reduce your computing needs 102 4.4 Case study 1: Predicting malicious URLs 103 Step 1: Defining the research goal 104 ■ Step 2: Acquiring the URL data 104 ■ Step 4: Data exploration 105 Step 5: Model building 106 4.5 Case study 2: Building a recommender system inside a database 108 Tools and techniques needed 108 ■ Step 1: Research question 111 ■ Step 3: Data preparation 111 Step 5: Model building 115 ■ Step 6: Presentation and automation 116 4.6 Summary 118 5 First steps in big data 119 5.1 Distributing data storage and processing with frameworks 120 Hadoop: a framework for storing and processing large data sets 121 Spark: replacing MapReduce for better performance 123

Description:
Summary Introducing Data Science teaches you how to accomplish the fundamental tasks that occupy data scientists. Using the Python language and common Python libraries, you'll experience firsthand the challenges of dealing with data at scale and gain a solid foundation in data science. Purchase of t
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.