ebook img

Statistical and machine-learning data mining: techniques for better predictive modeling PDF

691 Pages·2017·3.192 MB·English
by  RATNERBRUCE
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Statistical and machine-learning data mining: techniques for better predictive modeling

Statistical and Machine-Learning Data Mining Techniques for Better Predictive Modeling and Analysis of Big Data Third Edition Statistical and Machine-Learning Data Mining Techniques for Better Predictive Modeling and Analysis of Big Data Third Edition Bruce Ratner CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-4987-9760-3 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copy- right holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho- tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Ratner, Bruce, author. Title: Statistical and machine-learning data mining / Bruce Ratner. Description: Third Edition. | Boca Raton, FL : CRC Press, 2017. | Revised edition of the author’s Statistical and machine-learning data mining, c2003. Identifiers: LCCN 2016048787 | ISBN 9781498797603 (978-1-4987-9760-3) Subjects: LCSH: Database marketing--Statistical methods. | Data mining--Statistical methods. | Big data--Statistical methods. Classification: LCC HF5415.126 .R38 2017 | DDC 658.8/72--dc23 LC record available at https://lccn.loc.gov/2016048787 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com This book is dedicated to: My father, Isaac—always encouraging, my role model who taught me by doing, not saying, other than to think positive. My mother, Leah—always nurturing, my friend who taught me to love love and hate hate. My daughter, Amanda—always for me, my crowning and most significant result. Contents Preface to Third Edition ...........................................................................................................xxiii Preface of Second Edition .......................................................................................................xxvii Acknowledgments ....................................................................................................................xxxi Author .......................................................................................................................................xxxiii 1. Introduction .............................................................................................................................1 1.1 The Personal Computer and Statistics .......................................................................1 1.2 Statistics and Data Analysis ........................................................................................3 1.3 EDA .................................................................................................................................4 1.4 The EDA Paradigm .......................................................................................................5 1.5 EDA Weaknesses ...........................................................................................................6 1.6 Small and Big Data ........................................................................................................7 1.6.1 Data Size Characteristics ................................................................................7 1.6.2 Data Size: Personal Observation of One .......................................................8 1.7 Data Mining Paradigm .................................................................................................8 1.8 Statistics and Machine Learning ................................................................................9 1.9 Statistical Data Mining ...............................................................................................10 References ...............................................................................................................................11 2. Science Dealing with Data: Statistics and Data Science ..............................................13 2.1 Introduction .................................................................................................................13 2.2 Background ..................................................................................................................13 2.3 The Statistics and Data Science Comparison ..........................................................15 2.3.1 Statistics versus Data Science .......................................................................15 2.4 Discussion: Are Statistics and Data Science Different? .........................................21 2.4.1 Analysis: Are Statistics and Data Science Different? ................................22 2.5 Summary ......................................................................................................................23 2.6 Epilogue ........................................................................................................................23 References ...............................................................................................................................23 3. Two Basic Data Mining Methods for Variable Assessment ........................................25 3.1 Introduction .................................................................................................................25 3.2 Correlation Coefficient ...............................................................................................25 3.3 Scatterplots ...................................................................................................................27 3.4 Data Mining .................................................................................................................28 3.4.1 Example 3.1 .....................................................................................................28 3.4.2 Example 3.2 .....................................................................................................29 3.5 Smoothed Scatterplot..................................................................................................30 3.6 General Association Test ............................................................................................33 3.7 Summary ......................................................................................................................34 References ...............................................................................................................................35 vii viii Contents 4. CHAID-Based Data Mining for Paired-Variable Assessment ....................................37 4.1 Introduction .................................................................................................................37 4.2 The Scatterplot .............................................................................................................37 4.2.1 An Exemplar Scatterplot ...............................................................................38 4.3 The Smooth Scatterplot ..............................................................................................38 4.4 Primer on CHAID .......................................................................................................39 4.5 CHAID-Based Data Mining for a Smoother Scatterplot .......................................40 4.5.1 The Smoother Scatterplot .............................................................................42 4.6 Summary ......................................................................................................................45 Reference .................................................................................................................................45 5. The Importance of Straight Data Simplicity and Desirability for Good Model-Building Practice .....................................................................................................47 5.1 Introduction .................................................................................................................47 5.2 Straightness and Symmetry in Data ........................................................................47 5.3 Data Mining Is a High Concept ................................................................................48 5.4 The Correlation Coefficient .......................................................................................48 5.5 Scatterplot of (xx3, yy3) ..............................................................................................50 5.6 Data Mining the Relationship of (xx3, yy3) .............................................................50 5.6.1 Side-by-Side Scatterplot ................................................................................53 5.7 What Is the GP-Based Data Mining Doing to the Data? .......................................53 5.8 Straightening a Handful of Variables and a Baker’s Dozen of Variables ...........53 5.9 Summary ......................................................................................................................54 References ...............................................................................................................................54 6. Symmetrizing Ranked Data: A Statistical Data Mining Method for Improving the Predictive Power of Data .........................................................................55 6.1 Introduction .................................................................................................................55 6.2 Scales of Measurement ...............................................................................................55 6.3 Stem-and-Leaf Display ...............................................................................................57 6.4 Box-and-Whiskers Plot ...............................................................................................58 6.5 Illustration of the Symmetrizing Ranked Data Method .......................................58 6.5.1 Illustration 1 ....................................................................................................59 6.5.1.1 Discussion of Illustration 1 ...........................................................59 6.5.2 Illustration 2 ....................................................................................................61 6.5.2.1 Titanic Dataset .................................................................................62 6.5.2.2 Looking at the Recoded Titanic Ordinal Variables CLASS_, AGE_, GENDER_, CLASS_AGE_, and CLASS_GENDER_ ..........62 6.5.2.3 Looking at the Symmetrized-Ranked Titanic Ordinal Variables rCLASS_, rAGE_, rGENDER_, rCLASS_AGE_, and rCLASS_GENDER_.......................................................................64 6.5.2.4 Building a Preliminary Titanic Model .........................................65 6.6 Summary ......................................................................................................................68 References ...............................................................................................................................68 7. Principal Component Analysis: A Statistical Data Mining Method for Many-Variable Assessment ................................................................................................69 7.1 Introduction .................................................................................................................69 7.2 EDA Reexpression Paradigm ....................................................................................69 Contents ix 7.3 What Is the Big Deal? ..................................................................................................70 7.4 PCA Basics ...................................................................................................................70 7.5 Exemplary Detailed Illustration ...............................................................................71 7.5.1 Discussion .......................................................................................................71 7.6 Algebraic Properties of PCA .....................................................................................72 7.7 Uncommon Illustration ..............................................................................................73 7.7.1 PCA of R_CD Elements (X, X , X, X, X,X) .............................................74 1 2 3 4 5 6 7.7.2 Discussion of the PCA of R_CD Elements .................................................74 7.8 PCA in the Construction of Quasi-Interaction Variables ......................................76 7.8.1 SAS Program for the PCA of the Quasi-Interaction Variable ..................78 7.9 Summary ......................................................................................................................80 8. Market Share Estimation: Data Mining for an Exceptional Case ..............................81 8.1 Introduction .................................................................................................................81 8.2 Background ..................................................................................................................81 8.3 Data Mining for an Exceptional Case ......................................................................82 8.3.1 Exceptional Case: Infant Formula YUM .....................................................82 8.4 Building the RAL-YUM Market Share Model ........................................................83 8.4.1 Decile Analysis of YUM_3mos MARKET-SHARE Model ......................92 8.4.2 Conclusion of YUM_3mos MARKET-SHARE Model ..............................92 8.5 Summary ......................................................................................................................93 Appendix 8.A Dummify PROMO_Code ...............................................................................93 Appendix 8.B PCA of PROMO_Code Dummy Variables ...............................................94 Appendix 8.C Logistic Regression YUM_3mos on PROMO_Code Dummy Variables .......................................................................................................94 Appendix 8.D Creating YUM_3mos_wo_PROMO_CodeEff .........................................94 Appendix 8.E Normalizing a Variable to Lie Within [0, 1] .............................................95 References ...............................................................................................................................96 9. The Correlation Coefficient: Its Values Range between Plus and Minus 1, or Do They? ............................................................................................................................97 9.1 Introduction .................................................................................................................97 9.2 Basics of the Correlation Coefficient ........................................................................97 9.3 Calculation of the Correlation Coefficient ...............................................................99 9.4 Rematching ..................................................................................................................99 9.5 Calculation of the Adjusted Correlation Coefficient ............................................101 9.6 Implication of Rematching ......................................................................................102 9.7 Summary ....................................................................................................................102 10. Logistic Regression: The Workhorse of Response Modeling ...................................105 10.1 Introduction ...............................................................................................................105 10.2 Logistic Regression Model .......................................................................................106 10.2.1 Illustration .....................................................................................................106 10.2.2 Scoring an LRM ...........................................................................................107 10.3 Case Study ..................................................................................................................109 10.3.1 Candidate Predictor and Dependent Variables .......................................110 10.4 Logits and Logit Plots ...............................................................................................110 10.4.1 Logits for Case Study ..................................................................................111 10.5 The Importance of Straight Data ............................................................................112

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.