Avoiding Data Pitfalls Avoiding Data Pitfalls How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations Ben Jones Copyright © 2020 Ben Jones. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748- 6008, or online at www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762- 2974, outside the United States at (317) 572-3993, or fax (317) 572-4002. Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http:// booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com. Library of Congress Cataloging-in-Publication Data Names: Jones, Ben, 1978– author. Title: Avoiding data pitfalls : how to steer clear of common blunders when working with data and presenting analysis and visualizations / Ben Jones. Description: Hoboken, New Jersey : John Wiley & Sons, Inc., [2020] | Includes index. Identifiers: LCCN 2019033232 (print) | LCCN 2019033233 (ebook) | ISBN 9781119278160 (paperback) | ISBN 9781119278191 (adobe pdf) | ISBN 9781119278177 (epub) Subjects: LCSH: Information visualization. | Quantitative research. Classification: LCC QA76.9.I52 J6639 2020 (print) | LCC QA76.9.I52 (ebook) | DDC 001.4/226—dc23 LC record available at https://lccn.loc.gov/2019033232 LC ebook record available at https://lccn.loc.gov/2019033233 Cover Design: Wiley Cover Image: © deomis/iStock.com Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1 Contents Preface ix Chapter 1 The Seven Types of Data Pitfalls 1 Seven Types of Data Pitfalls 3 Pitfall 1: Epistemic Errors: How We Think About Data 3 Pitfall 2: Technical Traps: How We Process Data 4 Pitfall 3: Mathematical Miscues: How We Calculate Data 4 Pitfall 4: Statistical Slipups: How We Compare Data 5 Pitfall 5: Analytical Aberrations: How We Analyze Data 5 Pitfall 6: Graphical Gaffes: How We Visualize Data 6 Pitfall 7: Design Dangers: How We Dress up Data 6 Avoiding the Seven Pitfalls 7 “I’ve Fallen and I Can’t Get Up” 8 Chapter 2 Pitfall 1: Epistemic Errors 11 How We Think About Data 11 Pitfall 1A: The Data-Reality Gap 12 Pitfall 1B: All Too Human Data 24 Pitfall 1C: Inconsistent Ratings 32 Pitfall 1D: The Black Swan Pitfall 39 Pitfall 1E: Falsifiability and the God Pitfall 43 Avoiding the Swan Pitfall and the God Pitfall 44 v vi Contents Chapter 3 Pitfall 2: Technical Trespasses 47 How We Process Data 47 Pitfall 2A: The Dirty Data Pitfall 48 Pitfall 2B: Bad Blends and Joins 67 Chapter 4 Pitfall 3: Mathematical Miscues 74 How We Calculate Data 74 Pitfall 3A: Aggravating Aggregations 75 Pitfall 3B: Missing Values 83 Pitfall 3C: Tripping on Totals 88 Pitfall 3D: Preposterous Percents 93 Pitfall 3E: Unmatching Units 102 Chapter 5 Pitfall 4: Statistical Slipups 107 How We Compare Data 107 Pitfall 4A: Descriptive Debacles 109 Pitfall 4B: Inferential Infernos 131 Pitfall 4C: Slippery Sampling 136 Pitfall 4D: Insensitivity to Sample Size 142 Chapter 6 Pitfall 5: Analytical Aberrations 148 How We Analyze Data 148 Pitfall 5A: The Intuition/Analysis False Dichotomy 149 Pitfall 5B: Exuberant Extrapolations 157 Pitfall 5C: Ill-Advised Interpolations 163 Pitfall 5D: Funky Forecasts 166 Pitfall 5E: Moronic Measures 168 Chapter 7 Pitfall 6: Graphical Gaffes 173 How We Visualize Data 173 Pitfall 6A: Challenging Charts 175 Pitfall 6B: Data Dogmatism 202 Pitfall 6C: The Optimize/Satisfice False Dichotomy 207 Contents vii Chapter 8 Pitfall 7: Design Dangers 212 How We Dress up Data 212 Pitfall 7A: Confusing Colors 214 Pitfall 7B: Omitted Opportunities 222 Pitfall 7C: Usability Uh-Ohs 227 Chapter 9 Conclusion 237 Avoiding Data Pitfalls Checklist 241 The Pitfall of the Unheard Voice 243 Index 247 Preface There’s a passage in The Dhammapada, which is a collection of say- ings of the Buddha in verse form, which has this to say: If you see someone wise, who can steer you away from the wrong path, follow that person, as you would one who can reveal hid- den treasures. Only good can come out of it. Let them admonish, or instruct, or restrain you from what is wrong. — The Dhammapada, Verse 76 Most ancient wisdom texts have a similar exhortation to find a wise person and follow their advice. If you listen to their words carefully, you’ll avoid horrible mistakes and the associated pain and discom- fort that these mistakes will cause in your life. We all need a mentor, guide, or guru from time to time. Unfortunately, I may not be that wise person you’re looking for. I’d say I’m more like a person who has taken his bumps and bruises while searching for such a wise individual myself. So I’m more like the person in the words to a favorite song of mine: No, I don’t claim to be a wise man, a poet or a saint. I’m just another man who’s searching for a better way. — Jon Bon Jovi, “Bang a Drum” ix x Preface I figured it just might be helpful, however, to capture and convey the types of mistakes I’ve made, and seen others make, many times over the course of my data working career. I’ve worked with data in manufacturing and transactional environments, on the shop floor, in the boardroom, and in newsrooms all over the country, and I’ve trained and taught people how to work with data in corporate con- ference rooms, virtual chat rooms, and in the halls of academia, from coast to coast. But who, exactly, am I trying to help with this book? Well, me, for one. Every time I write a blog post, record a tutorial, or craft a presentation, I find myself going back to these materials over and over. Whenever I do, I stop and wonder how the past version of myself was so much smarter than the current version of me, and I tip my hat for the favor. And I hope it will be helpful for you. If you’re starting out in your data journey, you’ll come across many of these pitfalls, I promise you. My hope is that you’ll be more likely to recognize them for what they are when you see them – sometimes from the top and other times from the bottom. And if you’ve been working with data for some time, you’ll read a section here or there, and you’ll nod knowingly, glancing down at a scar or two that you earned by falling headfirst into the pit with me. And your brow may furrow when you read about other pitfalls, a sinking feeling coming over you that you may have made that mistake without recognizing it. If so, know that I feel your pain. It’s really important, though, that we learn to pick ourselves up and dust off our jeans, clean off those scuff marks, ice any bruises we may have suffered, and carry on, a bit wiser for the experience. Equal in importance is that we show others the same grace. It’s just too easy to get it wrong, and it’s bound to happen. Even experts fall into data pitfalls on a fairly regular basis. Just like a well-worn path that marks the best route across a terrain, the more we admit and talk about our shortcomings, the less likely others are to fall into the same trap we fell into. We’ll be leaving warning signs for them. Preface xi You and I may have to swallow our pride when we raise our hand and share the mea culpa, but we can think of it as a gift for future generations. They may shake their heads and wonder how we messed up so badly, but you and I will know that the only reason they’ve evolved as far as they have is that we paved the way for them with our blunders. What’s more important than the protection of my ego or yours is the evolution of our species into an efficient and effective data working force for the planet. We’re not there, yet. Not even close. I’d like to dedicate this book to my father, Richard Jones. My father had a truly wonderful mind, and, like the rest of us, he fell into many pitfalls in his life. When I confronted him about some things he had done that were particularly hurtful to me, he admitted it, and he apologized. I’ll never forget that gift he gave me. It was very liberating. I’ve been much better at admitting my own failures ever since – per- sonal and professional. Thank you, Dad. I love you. I signed the contract to write this book right before his health took a turn for the worse. He passed away of glioblastoma half a year later, and I lost all sight of the reason for wanting to write this book in the first place. My editors and the team at John Wiley & Sons were very gracious to me during this process. It took a while – four years to be exact – but eventually I rediscovered my passion for this topic and finished the book. On the positive side, all that extra time meant that I was able to discover even more pitfalls to add to these chapters, mostly by fall- ing into them. So I hope this book proves to be helpful for you. May you stride forward onto the road paved with data, alert to pitfalls around every turn. May you show yourself compassion whenever you get it wrong, and may you share the lessons of mistakes made with others around you. Ultimately, may you reach great heights on the