Big Data for Chimps Philip Kromer and Russell Jurney Big Data for Chimps by Philip Kromer and Russell Jurney Copyright © 2016 Philip Kromer and Russell Jurney. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Mike Loukides Editors: Meghan Blanchette and Amy Jollymore Production Editor: Matthew Hacker Copyeditor: Jasmine Kwityn Proofreader: Rachel Monaghan Indexer: Wendy Catalano Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator: Rebecca Demarest October 2015: First Edition Revision History for the First Edition 2015-09-25: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491923948 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Big Data for Chimps, the cover image of a chimpanzee, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-92394-8 [LSI] Preface Big Data for Chimps will explain a practical, actionable view of big data. This view will be centered on tested best practices as well as give readers street-fighting smarts with Hadoop. Readers will come away with a useful, conceptual idea of big data. Insight is data in context. The key to understanding big data is scalability: infinite amounts of data can rest upon distinct pivot points. We will teach you how to manipulate data about these pivot points. Finally, the book will contain examples with real data and real problems that will bring the concepts and applications for business to life. What This Book Covers Big Data for Chimps shows you how to solve important problems in large-scale data processing using simple, fun, and elegant tools. Finding patterns in massive event streams is an important, hard problem. Most of the time, there aren’t earthquakes — but the patterns that will let you predict one in advance lie within the data from those quiet periods. How do you compare the trillions of subsequences in billions of events, each to each other, to find the very few that matter? Once you have those patterns, how do you react to them in real time? We’ve chosen case studies anyone can understand, and that are general enough to apply to whatever problems you’re looking to solve. Our goal is to provide you with the following: The ability to think at scale—equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations Detailed example programs applying Hadoop to interesting problems in context Advice and best practices for efficient software development All of the examples use real data, and describe patterns found in many problem domains, as you: Create statistical summaries Identify patterns and groups in the data Search, filter, and herd records in bulk The emphasis on simplicity and fun should make this book especially appealing to beginners, but this is not an approach you’ll outgrow. We’ve found it’s the most powerful and valuable approach for creative analytics. One of our maxims is “robots are cheap, humans are important”: write readable, scalable code now and find out later whether you want a smaller cluster. The code you see is adapted from programs we write at Infochimps and Data Syndrome to solve enterprise-scale business problems, and these simple high- level transformations meet our needs. Many of the chapters include exercises. If you’re a beginning user, we highly recommend you work through at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you read it than from having the book next to you while you write code inspired by it. There are sample solutions and result datasets on the book’s website. Who This Book Is For We’d like for you to be familiar with at least one programming language, but it doesn’t have to be Python or Pig. Familiarity with SQL will help a bit, but isn’t essential. Some exposure to working with data in a business intelligence or analysis background will be helpful. Most importantly, you should have an actual project in mind that requires a big-data toolkit to solve — a problem that requires scaling out across multiple machines. If you don’t already have a project in mind but really want to learn about the big-data toolkit, take a look at Chapter 3, which uses baseball data. It makes a great dataset for fun exploration.