ebook img

MSc Dissertation Koliopoulos Aris-Kyriakos mbaxkak4 PDF

121 Pages·2014·3.12 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview MSc Dissertation Koliopoulos Aris-Kyriakos mbaxkak4

Big Data Mining: Towards implementing Weka-on-Spark A dissertation submitted to The University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences 2014 Koliopoulos Aris Kyriakos School of Computer Science List of Contents List of Contents....................................................................................................... 2 List of Figures......................................................................................................... 5 List of Tables........................................................................................................... 7 List of Abbreviations............................................................................................... 8 Abstract................................................................................................................. 10 Declaration............................................................................................................ 12 Intellectual Property Statement............................................................................. 13 Acknowledgements............................................................................................... 14 1 Introduction......................................................................................................... 16 1.1 Distributed Computing Frameworks............................................................ 17 1.2 Data Mining Tools........................................................................................ 18 1.3 Project Objectives........................................................................................ 19 1.4 Implementation Summary............................................................................ 19 1.5 Evaluation Strategy...................................................................................... 21 1.6 Project Achievements................................................................................... 22 1.7 Overview of Dissertation............................................................................. 23 2 Literature Review................................................................................................ 24 2.1 Data Mining................................................................................................. 24 2.1.1 Classification......................................................................................... 24 2.1.2 Regression............................................................................................. 25 2.1.3 Clustering.............................................................................................. 26 2.1.4 Association Rule Learning.................................................................... 26 2.1.5 Data Mining System Development....................................................... 27 2.1.5.1 Tooltkit Based Approaches............................................................. 27 2.1.5.2 Statistical Language Based Approaches.........................................28 2.1.5.3 Approach Selection........................................................................ 28 2.1.6 Partitioning and Parallel Performance..................................................29 2.2 Distributed Computing Frameworks............................................................ 30 2.2.1 MapReduce........................................................................................... 30 2.2.2 Hadoop.................................................................................................. 31 2.2.2.1 Beyond Hadoop.............................................................................. 32 2.2.3 Iterative MapReduce............................................................................. 33 2 Of 121 2.2.4 Distributed Systems for In-Memory Computations..............................34 2.2.4.1 In-Memory Data Grids................................................................... 34 2.2.4.2 Piccolo............................................................................................ 34 2.2.4.3 GraphLab........................................................................................ 35 2.2.4.4 Spark.............................................................................................. 37 2.2.5 Distributed Computing Framework Selection......................................39 2.3 Distributed Data Mining.............................................................................. 39 2.3.1 Data Mining on MapReduce................................................................. 40 2.3.2 R on MapReduce................................................................................... 41 2.3.3 Distributed Weka................................................................................... 43 2.3.4 MLBase................................................................................................. 44 2.4 Summary...................................................................................................... 46 3 System Architecture............................................................................................ 48 3.1 Required Architectural Components............................................................ 48 3.2 Multi-tier Architecture................................................................................. 48 3.2.1 Infrastructure Layer............................................................................... 49 3.2.2 Distributed Storage Layer..................................................................... 50 3.2.3 Batch Execution Layer.......................................................................... 52 3.2.3.1 Spark and Main-memory Caching................................................. 53 3.2.4 Application Layer.................................................................................. 54 3.2.5 CLI........................................................................................................ 55 3.3 Cluster Monitoring....................................................................................... 55 3.4 Summary...................................................................................................... 56 4 Execution Model................................................................................................. 57 4.1 Weka on MapReduce.................................................................................... 57 4.2 Task Initialisation......................................................................................... 58 4.3 Headers......................................................................................................... 61 4.4 Classification and Regression...................................................................... 63 4.4.1 Model Training...................................................................................... 63 4.4.2 Model Testing and Evaluation............................................................... 65 4.5 Association Rules......................................................................................... 66 4.6 Clustering..................................................................................................... 69 4.7 Summary...................................................................................................... 69 5 System Evaluation............................................................................................... 71 5.1 Evaluation Metrics....................................................................................... 71 5.2 System Configuration................................................................................... 72 3 Of 121 5.3 Evaluation Results........................................................................................ 74 5.3.1 Execution Time..................................................................................... 74 5.3.2 Scaling Efficiency................................................................................. 79 5.3.2.1 Weak Scaling.................................................................................. 79 5.3.2.2 Strong Scaling................................................................................ 80 5.3.3 Main-Memory Caching......................................................................... 84 5.3.3.1 Caching overheads......................................................................... 84 5.3.3.2 Caching and Performance.............................................................. 86 5.3.4 IO Utilisation......................................................................................... 89 5.4 Caching Strategy Selection Algorithm......................................................... 92 5.5 Summary...................................................................................................... 96 6 Concluding remarks............................................................................................ 98 6.1 Summary...................................................................................................... 98 6.2 Further Work................................................................................................ 99 6.2.1 Clustering.............................................................................................. 99 6.2.2 Stream Processing............................................................................... 100 6.2.3 Declarative Data Mining..................................................................... 102 6.3 Conclusion.................................................................................................. 103 References........................................................................................................... 104 Appendix 1 – Benchmarking Data...................................................................... 111 Appendix 2 – Installation Guide.......................................................................... 115 Appendix 3 – User Guide.................................................................................... 117 Appendix 4 – Main-Memory Monitoring using CloudWatch.............................120 Total Word Count: 23003 4 Of 121 List of Figures Figure 1.1: Cluster Architecture [5]...................................................................... 17 Figure 1.2: System Architecture............................................................................ 20 Figure 2.1: The Data Mining Process.................................................................... 24 Figure 2.2: Supervised Learning Process [22]...................................................... 25 Figure 2.3: MapReduce Execution Overview [8]................................................. 31 Figure 2.4: Hadoop Tech Stack [36]..................................................................... 32 Figure 2.5: HaLoop and MapReduce [35]............................................................ 33 Figure 2.6: GraphLab Consistency Mechanisms[44]............................................36 Figure 2.7: RDD Lineage Graph [14]................................................................... 37 Figure 2.8: Ricardo [52]........................................................................................ 42 Figure 2.9: Distributed Weka [59]......................................................................... 44 Figure 2.10: MLBase Architecture [60]................................................................ 45 Figure 3.1: System Architecture............................................................................ 49 Figure 3.2: HDFS Architecture [64]...................................................................... 51 Figure 3.3: Initialisation Process........................................................................... 53 Figure 4.1: Execution Model................................................................................. 58 Figure 4.2:WekaOnSpark's main thread................................................................ 59 Figure 4.3: Task Executor..................................................................................... 60 Figure 4.4: Lineage Graph.................................................................................... 61 Figure 4.5: Header creation MapReduce job........................................................ 62 Figure 4.6: Header Creation Map Function.......................................................... 62 Figure 4.7: Header Creation Reduce Function...................................................... 62 Figure 4.8: Model Training Map Function............................................................ 64 Figure 4.9: Model Aggregation Reduce Function................................................. 65 Figure 4.10: Classifier Evaluation Map Function................................................. 66 Figure 4.11: Evaluation Reduce Function............................................................. 66 Figure 4.12: Association Rules job on Spark........................................................ 67 Figure 4.13: Candidate Generation Map Function................................................ 67 Figure 4.14: Candidate Generation and Validation Reduce Function...................68 Figure 4.15: Validation Phase Map Function........................................................ 68 Figure 5.1: Execution times for SVM................................................................... 75 Figure 5.2: Weak Scaling Efficiencies.................................................................. 80 5 Of 121 Figure 5.3: Strong Scaling for SVM..................................................................... 81 Figure 5.4: Strong Scaling for Linear Regression................................................. 81 Figure 5.5: Strong Scaling for FP-Growth............................................................ 82 Figure 5.6: Strong Scaling on Weka-On-Hadoop.................................................83 Figure 5.7:Main-memory time-line....................................................................... 86 Figure 5.8: Main-Memory Use Reduction............................................................ 87 Figure 5.9: Execution Time Overhead.................................................................. 87 Figure 5.10: Average per-instance disk writes...................................................... 88 Figure 5.11: Network Traffic................................................................................. 90 Figure 5.12: Per-instance average of network and disk utilisation.......................91 Figure 5.13: CPU utilisation................................................................................. 92 Figure 5.14: Storage Level Selection Process....................................................... 93 6 Of 121 List of Tables Table 5.1: Execution Times for SVM on Weka-On-Spark....................................74 Table 5.2: Execution Times for SVM on Weka-On-Hadoop................................74 Table 5.3: Speed-up............................................................................................... 76 Table 5.4: CPU Utilisation of Weka-On-Spark..................................................... 77 Table 5.5: CPU Utilisation of Weka-On-Hadoop.................................................. 77 Table 5.6: Main-memory utilisation of Weka-On-Spark.......................................78 Table 5.7: Main-memory utilisation of Weka-On-Hadoop...................................78 Table 5.8: RDD size as percentage of the original on-disk value (I)....................85 Table 5.9: RDD size as percentage of the original on-disk value (II)...................85 Table 5.10: Execution Times................................................................................. 96 Table 5.11: Failed Tasks........................................................................................ 97 7 Of 121 List of Abbreviations AMI – Amazon Machine Images API – Application Programming Interface AWS – Amazon Web Services BDM – Big Data Mining CLI – Command Line Interface CPU – Central Processing Unit EBS – Elastic Block Store EC2 – Elastic Compute Cloud ECU – EC2 Compute Unit EMR – Elastic Map Reduce GC – Garbage Collector GUI – Graphical User Interface HDFS – Hadoop Distributed File System IMDG -In-Memory Data Grids IO – Input/Output JNI – Java Native Interface JVM – Java Virtual Machines MPI – Message Passing Interface RDD – Resilient Distributed Datasets SSD – Solid State Drives SSH – Secure SHell SVM – Support Vector Machines VM – Virtual Machine 8 Of 121 WEKA – Waikato Environment of Knowledge Analysis YARN – Yet Another Resource Negotiator 9 Of 121 Abstract Data generation and collection across all domains increase in size exponen- tially. Knowledge discovery and decision making demand the ability to pro- cess and extract insights from “Big” Data in a scalable and efficient manner. The traditional cluster-based Big Data platform Hadoop provides a scalable solution but imposes performance overheads due to only supporting on-disk data. The Data Mining algorithms used in knowledge discovery usually re- quire multiple iterations over the dataset and thus, multiple, slow, disk ac- cesses. In contrast, modern clusters possess increasing amounts of main- memory that can provide performance benefits by efficiently using main- memory caching mechanisms. Apache Spark is an innovative distributed computing framework that sup- ports in-memory computations. The objective of this dissertation is to design and develop a scalable Data Mining framework to run on top of Spark and to identify and document the advantages and disadvantages of main-memory caching on Data Mining workloads. The workloads consisted of distributed implementations of Weka's Data Mining algorithms. Benchmarking was performed by testing seven different caching strategies on different workloads, measuring elapsed time and monit- oring resource utilisation. The project contributions are three-fold: 1. Design and development of a distributed Data Mining framework that achieves near-linear scaling in executing Data Mining workloads in parallel; 2. Analysis of the behaviour of distributed main-memory caching mech- anisms on different Data Mining execution scenarios; 3. Design and development of an automated caching strategy selection mechanism that assesses dataset and cluster characteristics and selects an appropriate caching scheme. 10 Of 121

Description:
Intellectual Property Statement 2.3.1 Data Mining on MapReduce. Appendix 4 – Main-Memory Monitoring using CloudWatch..
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.