ebook img

Pro Hadoop PDF

442 Pages·2009·7.67 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Pro Hadoop

cyan yelloW MaGenTa Black panTone 123 c Books for professionals By professionals® The eXperT’s Voice® in open source Companion eBook Available Pro Hadoop P Dear Reader, r Pro Hadoop is a guide to using Hadoop Core, a wonderful tool that allows you o Pro to use ordinary hardware to solve extraordinary problems. In the course of my work, I have needed to build applications that would not fit on a single afford- H able machine, creating custom scaling and distribution tools in the process. With the advent of Hadoop and MapReduce, I have been able to focus on my a applications instead of worrying about how to scale them. Hadoop It took some time before I had learned enough about Hadoop Core to actu- d ally be effective. This book is a distillation of that knowledge, and a book I wish was available to me when I first started using Hadoop Core. o I begin by showing you how to get started with Hadoop and the Hadoop Core shared file system, HDFS. Then you will see how to write and run func- o tional and effective MapReduce jobs on your clusters, as well as how to tune p your jobs and clusters for optimum performance. I provide recipes for unit test- ing and details on how to debug MapReduce jobs. I also include examples of using advanced features such as map-side joins and chain mapping. To bring everything together, I take you through the step-by-step development of a nontrivial MapReduce application. This will give you insight into a real-world Hadoop project. It is my sincere hope that this book provides you an enjoyable learning expe- rience and with the knowledge you need to be the local Hadoop Core wizard. Jason Venner Build scalable, distributed applications in the cloud THE APRESS ROADMAP Pro Amazon EC2 and WS Companion eBook Beginning Scala Pro Hadoop Beginning Google App Engine The Definitive Guide See last page for details to Terracotta on $10 eBook version SOURCE CODE ONLINE Jason Venner www.apress.com V ISBN 978-1-4302-1942-2 e n 53999 n US $39.99 e r Shelve in Software Engineering/ Software Development User level: 9 781430 219422 Intermediate–Advanced www.it-ebooks.info this print for content only—size & color not accurate spine = 0.844" 440 page count www.it-ebooks.info Pro Hadoop Jason Venner www.it-ebooks.info Pro Hadoop Copyright © 2009 by Jason Venner All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. ISBN-13 (pbk): 978-1-4302-1942-2 ISBN-13 (electronic): 978-1-4302-1943-9 Printed and bound in the United States of America 9 8 7 6 5 4 3 2 1 Trademarked names may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, we use the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. Java™ and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc., in the US and other countries. Apress, Inc., is not affiliated with Sun Microsystems, Inc., and this book was written without endorsement from Sun Microsystems, Inc. Lead Editor: Matthew Moodie Technical Reviewer: Steve Cyrus Editorial Board: Clay Andres, Steve Anglin, Mark Beckner, Ewan Buckingham, Tony Campbell, Gary Cornell, Jonathan Gennick, Michelle Lowman, Matthew Moodie, Duncan Parkes, Jeffrey Pepper, Frank Pohlmann, Douglas Pundick, Ben Renow-Clarke, Dominic Shakeshaft, Matt Wade, Tom Welsh Project Manager: Richard Dal Porto Copy Editors: Marilyn Smith, Nancy Sixsmith Associate Production Director: Kari Brooks-Copony Production Editor: Laura Cheu Compositor: Linda Weidemann, Wolf Creek Publishing Services Proofreader: Linda Seifert Indexer: Becky Hornyak Artist: Kinetic Publishing Services Cover Designer: Kurt Krames Manufacturing Director: Tom Debolski Distributed to the book trade worldwide by Springer-Verlag New York, Inc., 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax 201-348-4505, e-mail [email protected], or visit http://www.springeronline.com. For information on translations, please contact Apress directly at 2855 Telegraph Avenue, Suite 600, Berkeley, CA 94705. Phone 510-549-5930, fax 510-549-5939, e-mail [email protected], or visit http://www.apress.com. Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at http://www.apress.com/info/bulksales. The information in this book is distributed on an “as is” basis, without warranty. Although every pre- caution has been taken in the preparation of this work, neither the author(s) nor Apress shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in this work. The source code for this book is available to readers at http://www.apress.com. You may need to answer questions pertaining to this book in order to successfully download the code. www.it-ebooks.info This book is dedicated to Joohn Choe. He had the idea, walked me through much of the process, trusted me to write the book, and helped me through the rough spots. www.it-ebooks.info www.it-ebooks.info Contents at a Glance About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix About the Technical Reviewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv CHaPter 1 Getting Started with Hadoop Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 CHaPter 2 The Basics of a MapReduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 CHaPter 3 The Basics of Multimachine Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 CHaPter 4 HDFS Details for Multimachine Clusters . . . . . . . . . . . . . . . . . . . . . . . . .97 CHaPter 5 MapReduce Details for Multimachine Clusters . . . . . . . . . . . . . . . . . . 127 CHaPter 6 Tuning Your MapReduce Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 CHaPter 7 Unit Testing and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 CHaPter 8 Advanced and Alternate MapReduce Techniques . . . . . . . . . . . . . . .239 CHaPter 9 Solving Problems with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 CHaPter 10 Projects Based On Hadoop and Future Directions . . . . . . . . . . . . . . .329 aPPendix a The JobConf Object in Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 v www.it-ebooks.info www.it-ebooks.info Contents About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix About the Technical Reviewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv CHaPter 1 Getting Started with Hadoop Core . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introducing the MapReduce Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introducing Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Hadoop Core MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Installing Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Getting Hadoop Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Checking Your Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Running Hadoop Examples and Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Hadoop Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 Hadoop Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 CHaPter 2 the Basics of a Mapreduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . .27 The Parts of a Hadoop MapReduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Input Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 A Simple Map Function: IdentityMapper . . . . . . . . . . . . . . . . . . . . . . .31 A Simple Reduce Function: IdentityReducer . . . . . . . . . . . . . . . . . . . . 34 Configuring a Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36 Specifying Input Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45 Setting the Output Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Configuring the Reduce Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Running a Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53 vii www.it-ebooks.info viii ■CONTENTS Creating a Custom Mapper and Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . .56 Setting Up a Custom Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 After the Job Finishes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61 Creating a Custom Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Why Do the Mapper and Reducer Extend MapReduceBase? . . . . . . 66 Using a Custom Partitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 CHaPter 3 the Basics of Multimachine Clusters . . . . . . . . . . . . . . . . . . . . . . 71 The Makeup of a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 Cluster Administration Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74 Hadoop Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75 Hadoop Core Server Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . .76 A Sample Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80 Configuration Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Configuration Files for the Sample Cluster . . . . . . . . . . . . . . . . . . . . .82 Distributing the Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86 Verifying the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Formatting HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88 Starting HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Correcting Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91 The Web Interface to HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92 Starting MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92 Running a Test Job on the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . .94 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 CHaPter 4 HdFS details for Multimachine Clusters . . . . . . . . . . . . . . . . . . . 97 Configuration Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97 HDFS Installation for Multimachine Clusters . . . . . . . . . . . . . . . . . . . . . . . . . 98 Building the HDFS Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Distributing Your Installation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Formatting Your HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Starting Your HDFS Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104 Verifying HDFS Is Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105 www.it-ebooks.info

Description:
You've heard the hype about Hadoop: it runs petabyte - scale data mining tasks insanely fast, it runs gigantic tasks on clouds for absurdly cheap, it's been heavily committed to by tech giants like IBM, Yahoo!, and the Apache Project, and it's completely open-source. But what exactly is it, and more
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.