Charu C. Aggarwal Saket Sathe Outlier Ensembles An Introduction Outlier Ensembles Charu C. Aggarwal Saket Sathe (cid:129) Outlier Ensembles An Introduction 123 CharuC. Aggarwal SaketSathe IBMT. J.Watson Research Center IBMT. J.Watson Research Center Yorktown Heights, NY Yorktown Heights, NY USA USA ISBN978-3-319-54764-0 ISBN978-3-319-54765-7 (eBook) DOI 10.1007/978-3-319-54765-7 LibraryofCongressControlNumber:2017934615 ©SpringerInternationalPublishingAG2017 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Charu C. Aggarwal dedicates this book to his wife Lata and his daughter Sayani. Saket Sathe dedicates this book to his wife Aishwarya and his son Devansh. Preface Talent wins games, but teamwork and intelligence wins championships. MichaelJordan Ensembleanalysisisawidelyusedclassofmeta-algorithmsformanydatamining problems such as classification and clustering. Numerous ensemble-based algo- rithms have been proposed in the literature for these problems. Compared to the clustering and classification problems, ensemble analysis has been studied in a limited way in the context of outlier detection. It is only in recent years that the problem of outlier ensembles has been recognized formally, and some techniques for these problems have been proposed. However, the material in the field has not been formally recognized, and it is expected that this book will play a significant roleincreatingastructuredexpositionofthetopic.Thisbookdiscussesavarietyof methods for outlier ensembles and organizes them by the specific principles with whichaccuracyimprovementsareachieved.Inaddition,adiscussionisprovidedon the techniques with which such methods can be made more effective. A formal classification of these methods has been provided, and the circumstances in which they work well are discussed. A discussion is also provided on how outlier ensembles relate (both theoretically and practically) to the ensemble techniques used commonly for other data mining problems such as classification. The simi- laritiesand(subtle)differencesintheensembletechniquesfortheclassificationand outlier detection problems are discussed. These subtle differences do impact the design of ensemble algorithms for the latter problem. Ensemble techniques are increasingly finding their way into the curricula of many data mining courses. This book can be used for such courses. Many illus- trative examples and exercises are provided in order to facilitate classroom teach- ing.Afamiliarityisassumedtotheoutlierdetectionproblemandalsotothegeneric problem of ensemble analysis in classification. This is because many of the ensemblemethodsdiscussedinthisbookareadaptationsfromtheircounterpartsin the classification domain. Some techniques discussed in this book, such as vii viii Preface wagging, randomized feature weighting, and geometric subsampling, provide new insights that are notavailable elsewhere. Wehavealso provided ananalysis ofthe performanceofvarioustypesofbasedetectorsandtheirrelativeeffectiveness.This book combines a textbook-style discussion of older topics with new insights. Therefore, we believe that the book will also be of interest to researchers and practitioners for leveraging ensemble methods into optimal algorithmic design. Yorktown Heights, NY, USA Charu C. Aggarwal January 2017 Saket Sathe Acknowledgements We would like to thank our families for their support during the writing of this book. Charu C. Aggarwal would like to thank his wife Lata and his daughter Sayani. Saket Sathe would like to thank his wife Aishwarya for her love and support. He would like to especially thank his four-year-old son Devansh for his affection and playfulness. We are grateful to the support of our management at IBM. In particular, we would like to thank Nagui Halim and Deepak Turaga for their support during the writing of this book. We would like to thank our numerous collaborators whose insights have helped make this book a success. Charu Aggarwal would like to thank Tarek F. Abdelzaher, Jing Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang, Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M. Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Jaideep Srivastava, Karthik Subbian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jianyong Wang, Min Wang, Suhang Wang, Joel Wolf, Xifeng Yan, Philip Yu, Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao. Charu Aggarwal would also like to thank LataAggarwalforherhelpinsomeofthediagramsdrawnusingPowerpointinthe book. Saket Sathe would like to thank Karl Aberer, Timos Sellis, Dipanjan Chakraborty,ArunVishwanath,HoyoungJeung,SueAnnChen,andTianGuofor their collaborations over the years. We also thank Leman Akoglu for several general discussions on ensemble methods. ix Contents 1 An Introduction to Outlier Ensembles.... .... .... .... ..... .... 1 1.1 Introduction .... .... ..... .... .... .... .... .... ..... .... 1 1.1.1 Motivations for Ensemble Methods in Outlier Analysis ... 3 1.1.2 Common Settings for Existing Ensemble Methods... .... 4 1.1.3 Types of Ensemble Methods... .... .... .... ..... .... 6 1.1.4 Overview of Outlier Ensemble Design ... .... ..... .... 7 1.2 Categorization by Component Independence .... .... ..... .... 8 1.2.1 Sequential Ensembles .... .... .... .... .... ..... .... 9 1.2.2 Independent Ensembles... .... .... .... .... ..... .... 11 1.3 Categorization by Constituent Components . .... .... ..... .... 12 1.3.1 Model-Centered Ensembles.... .... .... .... ..... .... 12 1.3.2 Data-Centered Ensembles . .... .... .... .... ..... .... 14 1.3.3 Discussion of Categorization Schemes ... .... ..... .... 16 1.4 Categorization by Theoretical Approach.... .... .... ..... .... 17 1.4.1 Variance Reduction in Outlier Ensembles. .... ..... .... 18 1.4.2 Bias Reduction in Outlier Ensembles .... .... ..... .... 18 1.5 Defining Combination Functions . .... .... .... .... ..... .... 19 1.5.1 Normalization Issues. .... .... .... .... .... ..... .... 19 1.5.2 Combining Scores from Different Models. .... ..... .... 20 1.6 Research Overview and Book Organization. .... .... ..... .... 23 1.6.1 Overview of Book... .... .... .... .... .... ..... .... 27 1.7 Conclusions and Discussion. .... .... .... .... .... ..... .... 30 References.. .... .... .... ..... .... .... .... .... .... ..... .... 31 2 Theory of Outlier Ensembles ... .... .... .... .... .... ..... .... 35 2.1 Introduction .... .... ..... .... .... .... .... .... ..... .... 35 2.2 The Bias-Variance Trade-Offfor Outlier Detection.... ..... .... 37 2.2.1 Relationship of Ensemble Analysis to Bias-Variance Trade-Off .... ..... .... .... .... .... .... ..... .... 42 2.2.2 Out-of-Sample Issues .... .... .... .... .... ..... .... 43 xi xii Contents 2.2.3 Understanding How Ensemble Analysis Works ..... .... 44 2.2.4 Data-Centric View Versus Model-Centric View ..... .... 50 2.3 Examples and Applications of the Bias-Variance Tradeoff... .... 58 2.3.1 Bagging and Subsampling. .... .... .... .... ..... .... 59 2.3.2 Feature Bagging .... .... .... .... .... .... ..... .... 60 2.3.3 Boosting . .... ..... .... .... .... .... .... ..... .... 61 2.4 Experimental Illustration of Bias-Variance Theory.... ..... .... 61 2.4.1 Understanding the Effects of Ensembles on Data-Centric Bias and Variance.. .... .... ..... .... 62 2.4.2 Experimental Examples of Bias-Variance Decomposition ..... .... .... .... .... .... ..... .... 68 2.5 Conclusions .... .... ..... .... .... .... .... .... ..... .... 72 References.. .... .... .... ..... .... .... .... .... .... ..... .... 73 3 Variance Reduction in Outlier Ensembles. .... .... .... ..... .... 75 3.1 Introduction .... .... ..... .... .... .... .... .... ..... .... 75 3.2 Motivations for Basic Variance Reduction Framework. ..... .... 78 3.3 Variance Reduction Is Not a Panacea.. .... .... .... ..... .... 83 3.3.1 When Does Data-Centric Variance Reduction Help?.. .... 84 3.3.2 When Does Model-Centric Variance Reduction Help? .... 91 3.3.3 The Subtle Differences Between AUCs and MSEs... .... 93 3.4 Variance Reduction Methods .... .... .... .... .... ..... .... 93 3.4.1 Feature Bagging (FB) for High-Dimensional Outlier Detection. .... ..... .... .... .... .... .... ..... .... 94 3.4.2 Rotated Bagging (RB).... .... .... .... .... ..... .... 99 3.4.3 Projected Clustering and Subspace Histograms. ..... .... 100 3.4.4 The Point-Wise Bagging and Subsampling Class of Methods ... ..... .... .... .... .... .... ..... .... 107 3.4.5 Wagging (WAG).... .... .... .... .... .... ..... .... 130 3.4.6 Data-Centric and Model-Centric Perturbation .. ..... .... 131 3.4.7 Parameter-Centric Ensembles .. .... .... .... ..... .... 131 3.4.8 Explicit Randomization of Base Models.. .... ..... .... 132 3.5 Some New Techniques for Variance Reduction .. .... ..... .... 134 3.5.1 Geometric Subsampling (GS) .. .... .... .... ..... .... 134 3.5.2 Randomized Feature Weighting (RFW) .. .... ..... .... 136 3.6 Forcing Stability by Reducing Impact of Abnormal Detector Executions . .... .... ..... .... .... .... .... .... ..... .... 137 3.6.1 Performance Analysis of Trimmed Combination Methods . .... ..... .... .... .... .... .... ..... .... 140 3.6.2 Discussion of Commonly Used Combination Methods.... 143 3.7 Performance Analysis of Methods .... .... .... .... ..... .... 145 3.7.1 Data Set Descriptions .... .... .... .... .... ..... .... 145 3.7.2 Comparison of Variance Reduction Methods .. ..... .... 147
Description: