THE`SE DE DOCTORAT DE L’UNIVERSITE´ PARIS-SUD SPE´CIALITE´ : INFORMATIQUE pr´esent´ee par Xiangliang ZHANG pour obtenir le grade de DOCTEUR DE L’UNIVERSITE´ PARIS-SUD Sujet de la th`ese : Contributions to Large Scale Data Clustering and Streaming with Affinity Propagation. Application to Autonomic Grids. Soutenue le 28/07/2010, devant le jury compos´e de : Mr Cyril Furtlehner Examinateur (Charg´e de Recherche INRIA au LRI, France) Mme C´ecile Germain-Renaud Directrice de these (Professeur Universit´e Paris-Sud, France) Mr Aris Gionis Rapporteur (Research Scientist, Yahoo! Labs Research, Spain) Mr Charles Loomis Examinateur (Research Engineer, CNRS au LAL, France) Mr Eric Moulines Rapporteur (Professeur, T´el´ecom ParisTech ENST, France) Mme Brigitte Rozoy Examinateur (Professeur Universit´e Paris-Sud, France) Mme Mich`ele Sebag Directrice de these (Directeur de Recherche CNRS au LRI, France) Abstract In this dissertation, we present our study of the clustering issues on large-scale data and streaming data, with the applicative purpose of building autonomic grid computing system. Motivated by the application requirements, we employ a new clustering method called Affinity Propagation (AP) for modeling the grid jobs, the first step towards the long-term goal: Autonomic Grid Computing System. AP fits our clustering requirements by i) guaranteeing better clustering performance; ii) providing representative exemplars as actual data items (especially suitable for non-numerical data space). However, AP suffers from the quadratic computational cost caused by the clustering process. This problem severely hinders its usage on large scale datasets. We firstly proposed the weighted AP(WAP) method. WAP integrates the neighboring points together and keeps spatial structure between them and other points. It makes sure that the clustering results of WAP on integrated points are equal to that of AP on non-integrated points. The merit of WAP is the lower computational complexity. Hierarchical AP (Hi-AP) is the algorithm we proposed to solve the large-scale clustering problem. It uses AP and our extension of weighted AP (WAP) in the Divide-and-Conquer schema. In detail, it partitions the dataset, runs AP on each subset, and applies WAP to the collection of exemplars constructed from each subset. Through theoretical proof and experimental validation, Hi-AP was shown to significantly decrease the computational cost (from quadratic to quasi-linear), with negligible increasing of distortion. Streaming AP (StrAP) is our proposed understandable, stable and computationally efficient data stream clustering method. The online updating clustering model is maintained by StrAP through i) when a data item arrives, checking its fitness against the model; ii) if fitting, simply updating the corresponding cluster in the model, otherwise putting it into the reservoir. Restart criteria are used to monitor the changes of stream distribution. If changes are detected, the stream model is rebuild by applying WAP on the current model and the data in the reservoir. StrAP was validated on the Intrusion Detection benchmark data and was shown tooutperformthereference methodDenStreaminterms of clustering quality. Based on Hi-AP and StrAP, we proposed a multi-scale online grid monitoring system called G-StrAP. This monitoring system provides an understandabledescriptionofthejobflowrunningonthegridandenablesthe system administratorto spotonline somesources offailures. It hastheonline level to provide the EGEE administrator with a real-time dashboard of the jobdataflowandenablethediscovery of anomalies. It hasalso offlinelevel to inspect the global temporal patterns of the data flow, and help to detect the long-run trends in the EGEE traffic. Applying G-StrAP on 5-million job tracefromEGEEgrid,throughthemonitoringoutputs,itwasshownthatG- StrAP discovers device problems (e.g., clogging of LogMonitor) with good clustering quality (clustering accuracy > 85% and purity > 90%). ii Acknowledgements It is a pleasure to acknowledge my supervisors and some of the colleagues and friends who have contributed to this dissertation. My first, most earnest acknowledgment must go to my supervisors, Mich`ele Sebag and C´ecile Germain-Renaud. Their encouragement, supervi- sion and support enabled me to grow up as a PhD for doing research. During my PhD studies, they teach me how to do research, give me suggestions when I met problems, and support me to attend summer schools and to visit research partner. I benefited a lot from their profound knowledge and rigorous attitude toward scientific research. I would like to thank the reviewers of my dissertation, Dr. Aris Gionis, Dr. Charles Loomis and Prof. Eric Moulines. Their comments and suggestions are very helpful for improving this dissertation. I am grateful to Cyril Furtlehner and Julien Perez for their valuable discuss, which gives me a lot of inspiration. I am really happy to collaborate with you. I thank team leader Marc Schoenauer, my kind colleagues C´edric Hart- land, Alexandre Devert, Munteanu Alexandru Lonut, Fei Jiang, Raymond Ros, etc. They helped me on both my research work and my life in these years. I also wish to thank Dr. Francoise Carre from INSERM for providing me the 6-month financial support to help me finish my thesis. I give my thanks to my parents who have always unconditionally sup- ported me and cared me throughout my studies. Last, I thank my husband for his everlasting support, encouragement, and companionship in my study and life. ii Contents 1 Introduction 1 1.1 Mining data streams . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Application field: Autonomic Computing . . . . . . . . . . . . 2 1.3 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 State of the art 7 2.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Clustering for Exploratory Data Analysis . . . . . . . . 8 2.1.2 Formal Background and Clustering Criterion . . . . . . 9 2.1.3 Distance and Similarity measures . . . . . . . . . . . . 10 2.1.4 Main Categories of Clustering Algorithms . . . . . . . 11 2.1.4.1 Partitioning methods . . . . . . . . . . . . . . 12 2.1.4.2 Hierarchical methods . . . . . . . . . . . . . . 12 2.1.4.3 Density-based methods . . . . . . . . . . . . . 15 2.1.4.4 Grid-based methods . . . . . . . . . . . . . . 17 2.1.4.5 Model-based methods . . . . . . . . . . . . . 20 2.1.4.6 Spectral clustering methods . . . . . . . . . . 22 2.1.5 Selecting the Number of Clusters . . . . . . . . . . . . 25 2.2 Scalability of Clustering Methods . . . . . . . . . . . . . . . . 31 2.2.1 Divide-and-Conquer strategy . . . . . . . . . . . . . . . 31 2.2.2 BIRCH for large-scale data by using CF-tree . . . . . . 32 2.2.3 Scalability of spectral clustering . . . . . . . . . . . . . 33 2.2.4 Online clustering . . . . . . . . . . . . . . . . . . . . . 34 2.3 Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2 Change detection . . . . . . . . . . . . . . . . . . . . . 35 2.3.3 Data stream clustering . . . . . . . . . . . . . . . . . . 38 2.3.3.1 One-scan Divide-and-Conquer approaches . . 38 2.3.3.2 Online tracking and offline clustering . . . . . 40 2.3.3.3 Decision tree learner of data streams . . . . . 42 2.3.3.4 Binary data streams clustering . . . . . . . . 43 iii CONTENTS 2.3.4 Dealing with streaming time series . . . . . . . . . . . 44 3 The Hierarchical AP (Hi-AP): clustering large-scale data 47 3.1 Affinity Propagation . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.2 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Weighted Affinity Propagation . . . . . . . . . . . . . . . . . . 51 3.3 Hi-AP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Distortion Regret of Hi-AP . . . . . . . . . . . . . . . . . . . 55 3.4.1 Distribution of µ¯ µˆ . . . . . . . . . . . . . . . . . 56 n n | − | 3.4.2 The extreme value distribution . . . . . . . . . . . . . 58 3.4.3 Hi-AP Distortion Loss . . . . . . . . . . . . . . . . . . 59 3.5 Validation of Hi-AP . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 Experiments goals and settings . . . . . . . . . . . . . 63 3.5.2 Experimental results . . . . . . . . . . . . . . . . . . . 64 3.6 Partial conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 65 4 Streaming AP (StrAP): clustering data streams 67 4.1 StrAP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.1 AP-based Model and Update . . . . . . . . . . . . . . 68 4.1.2 Restart Criterion . . . . . . . . . . . . . . . . . . . . . 69 4.1.2.1 MaxR and Page-Hinkley (PH) test . . . . . . 69 4.1.2.2 Definition of p in PH test . . . . . . . . . . . 71 t 4.1.2.3 Online adaption of threshold λ in PH test . . 72 4.1.3 Model Rebuild . . . . . . . . . . . . . . . . . . . . . . 75 4.1.4 Evaluation Criterion . . . . . . . . . . . . . . . . . . . 77 4.1.5 Parameter setting of StrAP . . . . . . . . . . . . . . 78 4.2 Grid monitoring G-StrAP system . . . . . . . . . . . . . . . 78 4.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.2 Monitoring Outputs . . . . . . . . . . . . . . . . . . . 79 5 Validation of StrAP and Grid monitoring system G-StrAP 81 5.1 Validation of Hi-AP on EGEE jobs . . . . . . . . . . . . . . . 81 5.2 Validation of StrAP Algorithm . . . . . . . . . . . . . . . . . 83 5.2.1 Data used . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2.2 Experimentation on Synthetic Data Stream . . . . . . 85 5.2.3 Experimentation on Intrusion Detection Dataset . . . . 87 5.2.4 Online performance and comparison with DenStream . 91 5.3 Discussion of StrAP . . . . . . . . . . . . . . . . . . . . . . . 93 5.4 G-StrAP Grid Monitoring System . . . . . . . . . . . . . . . 94 5.4.1 Related work . . . . . . . . . . . . . . . . . . . . . . . 94 iv CONTENTS 5.4.2 The gLite Workload Management System . . . . . . . 95 5.4.3 Job Streams . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4.4 Data Pre-processing and Experimental Settings . . . . 98 5.4.5 Clustering Quality . . . . . . . . . . . . . . . . . . . . 99 5.4.6 Rupture Steps . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.7 Online Monitoring on the First Level . . . . . . . . . . 102 5.4.8 Off-line Analysis on the Second Level . . . . . . . . . . 104 6 Conclusion and Perspectives 107 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2.1 Algorithmic perspectives . . . . . . . . . . . . . . . . . 109 6.2.2 Applicative perspectives . . . . . . . . . . . . . . . . . 110 Appendices A Schematic proof of Proposition 3.3.4 113 Bibliography 114 v CONTENTS vi
Description: