ebook img

Data Engineering: Mining, Information and Intelligence PDF

451 Pages·2010·6.158 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Engineering: Mining, Information and Intelligence

International Series in Operations Research & Management Science Volume 132 SeriesEditor FrederickS.Hillier StanfordUniversity,CA,USA SeriesEditorialConsultant CamillePrice StephenF.AustinStateUniversity,TX,USA Forfurthervolumes: http://www.springer.com/series/6161 Data Engineering Mining, Information and Intelligence Edited by Yupo Chan John R. Talburt Terry M. Talley 123 Editors YupoChan JohnR.Talburt DepartmentofSystemsEngineering DepartmentofInformationScience UniversityofArkansas UniversityofArkansas,LittleRock DonagheyCollegeofInfoSci. 2801SouthUniversityAve. 2801SouthUniversityAvenue LittleRockAR72204-1099 LittleRockAR72204-1099 USA USA [email protected] [email protected] TerryM.Talley AcxiomCorporation P.O.Box2000 ConwayAR72033-2000 USA [email protected] ISBN978-1-4419-0175-0 e-ISBN978-1-4419-0176-7 DOI10.1007/978-1-4419-0176-7 SpringerNewYorkDordrechtHeidelbergLondon LibraryofCongressControlNumber:2009934450 (cid:2)c SpringerScience+BusinessMedia,LLC2010 Allrightsreserved.Thisworkmaynotbetranslatedorcopiedinwholeorinpartwithoutthewritten permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA),except for brief excerpts in connection with reviews orscholarly analysis. Usein connection with any form of information storage and retrieval, electronic adaptation, computer software,orbysimilarordissimilarmethodologynowknownorhereafterdevelopedisforbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not theyaresubjecttoproprietaryrights. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Background Many of us live in a world which is immersed in an overabundance of data. Not all this data, however, is either welcome or useful to the individual. The key is to bring the appropriate data together in a comprehensive and comprehensible way, then distill the data into information that is useful - information that will provide intelligence and insight for actionable strategies. This volume describes applied research targeted at the task of collecting data and distilling useful information from that data. The work emanates primarily from applied research completed through collaborations between Acxiom Corpo- ration and its academic research partners under the aegis of the Acxiom Labora- tory for Applied Research (ALAR). Acxiom Corporation is a global leader in customer and information- management solutions with international offices and operational units in the Unit- ed States, Europe, Australia, and China. ALAR was created as a means to engage the intellectual resources of the academic community in applied research that ad- dresses specific industry-defined problems and interests. The ALAR collabora- tions began in 2001 with several of the local universities near Acxiom’s corporate headquarters in Little Rock, Arkansas, such as the University of Arkansas at Little Rock, the University of Central Arkansas, and the University of Arkansas, Fa- yetteville. Since then, the partnership has expanded to include other universities such as the Massachusetts Institute of Technology and Vanderbilt University. ALAR holds an annual conference for its university partners, in which papers are solicited, refereed and compiled into Proceedings. The Laboratory also spon- sors a focused annual conference for an invited audience where research results are presented. The top working papers from these conferences comprise the basis of this monograph. A working paper, once accepted, is required to go through an- other review for quality assurance. Primary Audience The issues discussed in these working papers include data integration, data man- agement, data mining, and data visualization. These subjects are of interest to all those interested in data engineering, whether they are from academia, market- research firms, or business-intelligence companies. Notable research products are vi Preface captured in this volume, consisting of reorganized chapters and extended results from those reported in the chosen working papers. The findings reported here rep- resent not only the key issues facing Acxiom, but also other information- management concerns in the world. Because of the background from which these chapters are generated, the volume is ideally suited for researchers, practitioners, and postgraduate students alike. The contributions in this volume have their roots in problems arising from industry, rather than emanating from a basic research perspective. Each chapter includes Exercises at the end, making the book suitable for classroom use. Together with extensive references, and subject index, it can serve the academic, the research and industrial audiences. Organization The chapters in this book are roughly ordered to follow the logical sequence of the transformation of data from raw input data streams to refined information. The first half of the book addresses challenges associated with data integration and da- ta management. These chapters particularly focus on two key problems within da- ta integration: entity resolution and the challenges associated with very large scale data integration. The second half of the book is focused on the distillation of knowledge or information from data, with a particular focus on data mining, ana- lytics and data visualization. Specifically, we have organized our discussions using the following taxonomy: • Data Integration and Information Quality • Grid Computing • Data Mining • Visualization In other words, we start out with how we organize the data and end up with how to view the information extracted out of the data. Following this taxonomy, we will further delineate (and in many cases reinforce) the future directions in data inte- gration, grid computing, data mining and visualization. Data integration and information quality are complementary areas that are ma- turing together. As is common in science, fields of study become increasingly in- ter-disciplinary as they mature. Information quality was essentially born from the 1980’s movement to aggregate disparate operational data stored into a single re- pository (data warehouse), with the goal of extracting non-obvious business intel- ligence through data mining and other analytical techniques. From these efforts it became clear that independently held and maintained data stores were not easily integrated into a single, logically-consistent knowledgebase. It exposed problems with data accuracy, completeness, consistency, timeliness, and a myriad of other issues. In its earliest form, information quality was focused on methods and tech- niques to clean “dirty data” to the point that it could be properly integrated. There has been a dramatic increase in the interest and visibility of grid comput- ing during the preparation of this book. Once largely confined to the research community and a few targeted business domains, many of the concepts and tech- niques of grid computing have emerged as mainstream trends. Adoption of high throughput application paradigms such as Map-Reduce and high performance file systems such as the Google File System are becoming increasingly common Preface vii across a variety of application domains. While some of the initial interest perhaps resulted from the publicity around the phenomenal demonstrated success of Google in exploiting massive processing power on very large amounts of data, the interest has been sustained and increased by the fact that these techniques have proven effectively and efficiently applicable to a wide range of problems. With the advent of processing chips containing many processing cores and with the de- velopment and refinement of both virtual machine hardware and software, parallel processing is increasingly considered the standard computing paradigm. A focus of this book is to identify the correct person or object based on the data attributes associated with them. Future research into Entity Resolution could con- centrate on integrating data provenance policies, i.e., when to insert, update, and delete based on sources. For law enforcement, for example, it could further refine the path-based prospecting approach, and positive or negative associations be- tween entities. For instance, who bought guns from dealer X and also lived with known suspect Y? The last part of our book examines how useful information can be shown visu- ally to the user, including the authenticity of a document. In the “Image Water- marking” chapter, for example, a new, temper-proof method for image watermark- ing is developed using the phase spectrum of digital images. The watermarked image is obtained after multi-level “Inverse Difference Pyramid (IDP) decomposi- tion with 2D Complex Hadamard Transform (CHT). The watermark data inserted in the consecutive decomposition layers do not interact, and the watermark could be detected without using the original image. The main advantages of the pre- sented method, which result from the high stability of their phase spectrum, are the following. Perhaps the most important are the perceptual transparency of the inserted resistant watermark; its large information capacity; and the high resis- tance against noises in the communication channel and pirates’ attacks for its re- moval. The conclusion chapter lays out our vision of how each of these four areas— Data Integration and Information Quality, Grid Computing, Data Mining, and Visualization—are heading. Again, we combine both academic and industrial perspectives. In the editors’ and authors’ opinion, this volume complements other in the Int. Series In Operations Research & Management Science in the fol- lowing ways. Most modeling procedures that are covered comprehensively in this series are built upon an available database. Yet few concern themselves with the compilation of the database. Still fewer people worry about the quality of the data once they are compiled. Recent interests shown in Analytics suggest that busi- nesses and other enterprises are increasingly concerned with the extraction of ac- tionable decisions from data. We view the current volume as providing a much needed link in the long chain that connects data and decision-making. Acknowledgements The editors would like to thank the authors for their hard work and patience during the compilation of this volume. We are also grateful to Acxiom Corporation for their active engagement in and support of academic research through the Acxiom Laboratory for Applied Research (ALAR) without which this book would not have been possible. We would also like to thank Natalie Rego for her assistance in proof reading and editing suggestions. Table of Contents 1 Introduction........................................................................................................1 1.1 Common Problem........................................................................................1 1.2 Data Integration and Data Management......................................................3 1.2.1 Information Quality Overview.............................................................3 1.2.2 Customer Data Integration...................................................................4 1.2.3 Data Management.................................................................................8 1.2.4 Practical Problems to Data Integration and Management...................9 1.3 Analytics....................................................................................................10 1.3.1 Model Development...........................................................................10 1.3.2 Current Modeling and Optimization Techniques...............................11 1.3.3 Specific Algorithms and Techniques for Improvement......................12 1.3.4 Incremental or Evolutionary Updates.................................................13 1.3.5 Visualization.......................................................................................15 1.4 Conclusion.................................................................................................15 1.5 References..................................................................................................16 2 A Declarative Approach to Entity Resolution................................................17 2.1 Introduction................................................................................................17 2.2 Background................................................................................................18 2.2.1 Entity Resolution Definition...............................................................18 2.2.2 Entity Resolution Defense..................................................................18 2.2.3 Entity Resolution Terminology..........................................................19 2.2.4 Declarative Languages.......................................................................20 2.3 The Declarative Taxonomy: The Nouns....................................................20 2.3.1 Attributes............................................................................................21 2.3.2 References..........................................................................................21 2.3.3 Paths and Match Functions.................................................................22 2.3.4 Entities................................................................................................24 2.3.5 Super Groups......................................................................................25 2.3.6 Matching Graphs................................................................................26 2.4 A Declarative Taxonomy: The Adjectives.................................................27 2.4.1 Attribute Adjectives...........................................................................27 2.4.2 Reference Adjectives..........................................................................29 2.5 The Declarative Taxonomy: The Verbs.....................................................29 2.5.1 Attribute Verbs...................................................................................29 2.5.2 Reference Verbs.................................................................................30 2.5.3 Entity Verbs........................................................................................32 x Table of Contents 2.6 A Declarative Representation....................................................................33 2.6.1 The XML Schema..............................................................................34 2.6.2 A Representation for the Operations..................................................36 2.7 Conclusion.................................................................................................37 2.8 Exercises....................................................................................................37 2.9 References.................................................................................................37 3 Transitive Closure of Data Records: Application and Computation...........39 3.1 Introduction...............................................................................................39 3.1.1 Motivation..........................................................................................40 3.1.2 Literature Review...............................................................................42 3.2 Problem Definition....................................................................................43 3.3 Sequential Algorithms...............................................................................45 3.3.1 A Breadth First Search Based Algorithm...........................................45 3.3.2 A Sorting and Disjoint Set Based Algorithm.....................................47 3.3.3 Experiment.........................................................................................51 3.4 Parallel and Distributed Algorithms..........................................................53 3.4.1 An Overview of a Parallel and Distributed Scheme...........................53 3.4.2 Generate Matching Pairs....................................................................55 3.4.3 Conversion Process............................................................................55 3.4.4 Closure Process..................................................................................56 3.4.5 A MPI Based Parallel and Distributed Algorithm..............................62 3.4.6 Experiment.........................................................................................64 3.5 Conclusion.................................................................................................70 3.6 Exercises....................................................................................................71 3.7 Acknowledgments.....................................................................................73 3.8 References.................................................................................................74 4 Semantic Data Matching: Principles and Performance................................77 4.1 Introduction...............................................................................................77 4.2 Problem Statement: Data Matching for Customer Data Integration..........78 4.3 Semantic Data Matching............................................................................78 4.3.1 Background on Latent Semantic Analysis.........................................78 4.3.2 Analysis..............................................................................................80 4.4 Effect of Shared Terms..............................................................................81 4.4.1 Fundamental Limitations on Data Matching......................................81 4.4.2 Experiments........................................................................................82 4.5 Results.......................................................................................................83 4.6 Conclusion.................................................................................................87 4.7 Exercises....................................................................................................89 4.8 Acknowledgments.....................................................................................89 4.9 References.................................................................................................89 5 Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data......................................................................................................................91 5.1 Introduction...............................................................................................91 5.2 Background................................................................................................92 Table of Contents xi 5.2.1 Techniques used for General Spelling Error Correction....................93 5.2.2 Domain-Specific Correction...............................................................95 5.3 Individual Name Spelling Correction Algorithm: the Personal Name Recognition Strategy (PNRS)..........................................................................96 5.3.1 Experiment Results.............................................................................98 5.4 Conclusion.................................................................................................99 5.5 Exercises....................................................................................................99 5.6 References................................................................................................100 6 A Parallel General-Purpose Synthetic Data Generator..............................103 6.1 Introduction..............................................................................................103 6.2 SDDL.......................................................................................................104 6.2.1 Min/Max Constraints........................................................................105 6.2.2 Distribution Constraints...................................................................106 6.2.3 Formula Constraints.........................................................................106 6.2.4 Iterations...........................................................................................106 6.2.5 Query Pools......................................................................................108 6.3 Pools........................................................................................................108 6.4 Parallel Data Generation..........................................................................110 6.4.1 Generation Algorithm 1....................................................................111 6.4.2 Generation Algorithm 2....................................................................112 6.5 Performance and Applications.................................................................113 6.6 Conclusion and Future Directions............................................................114 6.7 Exercises..................................................................................................116 6.8 References................................................................................................117 7 A Grid Operating Environment for CDI......................................................119 7.1 Introduction..............................................................................................119 7.2 Grid-Based Service Deployment.............................................................120 7.2.1 Evolution of the Acxiom Grid (A Case Study)................................120 7.2.2 Services Grid....................................................................................122 7.2.3 Grid Management.............................................................................124 7.3 Grid-Based Batch Processing..................................................................127 7.3.1 Workflow Grid.................................................................................127 7.3.2 I/O Constraints.................................................................................133 7.3.3 Data Grid..........................................................................................135 7.3.4 Database Grid...................................................................................137 7.3.5 Data Management.............................................................................138 7.4 Conclusion...............................................................................................140 7.5 Exercises..................................................................................................141 8 Parallel File Systems.......................................................................................143 8.1 Introduction..............................................................................................143 8.2 Commercial Data and Access Patterns....................................................144 8.2.1 Large File Access Patterns...............................................................145 8.2.2 File System Interfaces......................................................................146 xii Table of Contents 8.3 Basics of Parallel File Systems................................................................147 8.3.1 Common Storage System Hardware................................................148 8.4 Design Challenges...................................................................................149 8.4.1 Performance.....................................................................................150 8.4.2 Consistency Semantics.....................................................................150 8.4.3 Fault Tolerance.................................................................................151 8.4.4 Interoperability.................................................................................152 8.4.5 Management Tools...........................................................................153 8.4.6 Traditional Design Challenges.........................................................154 8.5 Case Studies.............................................................................................154 8.5.1 Multi-Path File System (MPFS).......................................................154 8.5.2 Parallel Virtual File System (PVFS)................................................157 8.5.3 The Google File System (GFS)........................................................160 8.5.4 pNFS................................................................................................163 8.6 Conclusion...............................................................................................167 8.7 Exercises..................................................................................................167 8.8 References...............................................................................................168 9 Performance Modeling of Enterprise Grids................................................169 9.1 Introduction and Background..................................................................169 9.1.1 Performance Modeling.....................................................................169 9.1.2 Capacity Planning Tools and Methodology.....................................171 9.2 Measurement Collection and Preliminary Analysis.................................173 9.3 Workload Characterization......................................................................174 9.3.1 K-means Clustering..........................................................................176 9.3.2 Hierarchical Workload Characterization..........................................181 9.3.3 Other Issues in Workload Characterization......................................182 9.4 Baseline System Models and Tool Construction.....................................184 9.4.1 Analytic Models...............................................................................184 9.4.2 Simulation Tools for Enterprise Grid Systems.................................191 9.5 Enterprise Grid Capacity Planning Case Study.......................................192 9.5.1 Data Collection and Preliminary Analysis.......................................194 9.5.2 Workload Characterization...............................................................194 9.5.3 Development and Validation of the Baseline Model........................195 9.5.4 Model Predictions............................................................................196 9.6 Summary..................................................................................................199 9.7 Exercises..................................................................................................199 9.8 References...............................................................................................200 10 Delay Characteristics of Packet Switched Networks.................................203 10.1 Introduction...........................................................................................203 10.2 High-Speed Packet Switching Systems.................................................204 10.2.1 Packet Switched General Organization..........................................204 10.2.2 Switching Fabric Structures for Packet Switches...........................205 10.2.3 Queuing Schemes for Packet Switches..........................................206 10.3 Technical Background...........................................................................207 10.3.1 Packet Scheduling in Packet Switches...........................................207 10.3.2 Introduction to Network Calculus..................................................208

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.