Table Of Content

Approximation of OLAP queries on data warehouses Phuong Thao Cao To cite this version: Phuong Thao Cao. Approximation of OLAP queries on data warehouses. Other [cs.OH]. Université Paris Sud - Paris XI, 2013. English. NNT: 2013PA112091. tel-00905292 HAL Id: tel-00905292 https://theses.hal.science/tel-00905292 Submitted on 18 Nov 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. universite´ paris-sud ećole doctorale informatique paris-sud Laboratoire de recherche en informatique discipline: informatique the`se de doctorat soutenue le 20/06/2013 par Phuong Thao CAO Approximation of OLAP queries on data warehouses Jury Directeur de thèse Pr. Michel De Rougemont Université Panthéon-Assas Co-directeur de thèse Pr. Nicolas Spyratos Université Paris-Sud Président Pr. Khaldoun Al Agha Université Paris-Sud Rapporteur Pr. Mohand-Said Hacid Université Lyon 1 Rapporteur Pr. Kavé Salamatian Université Savoie Examinateur Pr. Halim Msilti Université Panthéon-Assas i Abstract WestudytheapproximateanswerstoOLAPqueriesondatawarehouses. Weconsider the relative answers to OLAP queries on a schema, as distributions with the L 1 distance and approximate the answers without storing the entire data warehouse. We first introduce three specific methods: the uniform sampling, the measure -based sampling and the statistical model. We introduce also an edit distance between data warehouses with edit operations adapted for data warehouses. Then, in the OLAP data exchange, we study how to sample each source and combine the samples to approximate any OLAP query. We next consider a streaming context, where a data warehouse is built by streams of different sources. We show a lower bound on the size of the memory necessary to approximate queries. In this case, we approximate OLAP queries with a finite memory. We describe also a method to discover the statistical dependencies, a new notion we introduce. We are looking for them based on the decision tree. We apply the method to two data warehouses. The first one simulates the data of sensors, which provide weather parameters over time and location from different sources. The second one is the collection of RSS from the web sites on Internet. Keywords: OLAP, approximate query answering, OLAP data exchange, streaming data, edit distance, sampling algorithm, statistical dependencies, statistical model. ii Résumé Nous étudions les réponses proches a` des requêtes OLAP sur les entrepoˆts de données. Nous considérons les réponses relatives aux requêtes OLAP sur un schéma, comme les distributions avec la distance L et rapprocher les réponses sans stocker totale- 1 ment l’entrepôt de données. Nous présentons d’abord trois méthodes spécifiques: l’échantillonnage uniforme, l’échantillonnage basé sur la mesure et le modèle statistique. Nous introduisons également une distance d’édition entre les entrepôts de données avec des opérations d’édition adaptées aux entrepoˆts de données. Puis, dans l’échange de données OLAP, nous étudions comment échantillonner chaque source et combiner les échantillons pour rapprocher toutes requêtes OLAP. Nous examinons ensuite un contexte streaming, ou` un entrepôt de données est construit par les flux de différentes sources. Nous montrons une borne inférieure de la taille de la mémoire nécessaire aux requêtes approximatives. Dans ce cas, nous avons les réponses pour les requêtes OLAP avec une mémoire finie. Nous décrivons également une méthode pour découvrir les dépendances statistique, une nouvelle notion que nous introduisons. Nous recherchons ces dépendances en basant sur l’arbre de décision. Nous appliquons la méthode a` deux entrepoˆts de données. Le premier simule les données de capteurs, qui fournissent des paramètres météorologiques au fil du temps et de l’emplacement a` partir de différentes sources. Le deuxième est la collecte de RSS a` partir des sites web sur Internet. Mots-clés: OLAP, réponses proches de la requête, échange de données OLAP, des flux de données, distance d’édition, algorithme d’échantillonnage, dépendances statistiques, modèle statistique. iii Acknowledgments First and foremost, I am very grateful to professor Dr. Michel de Rougemont, my supervisor, for his guidance and encouragement throughout this thesis work. His rich experience and support has been invaluable to me. I would like to thank my supervisor, professor Dr. Nicolas Spyratos, for introducing me to database team, for review my work and making useful suggestions. Deep thanks to Tsuyoshi Sugibuchi for his priceless technical support and advices. Also, I would like to take this opportunity to personally thank all of my colleagues and friends in database team for their help. I would like to acknowledge professor Dr. Philippe Dague, Director of LRI, for his financial support. I would like to acknowledge professor Dr. Mohand-Said Hacid, professor Dr. Kavé Salamatian, professor Dr. Halim Msilti and professor Dr. Khaldoun Al Agha for being my thesis committee members and for their advice and comments. Finally, 1 love to thank my parents and my family for their love, encouragement and support. Without their understanding, my research could not have been realized. Contents Contents iv List of Figures vii List of Tables viii 1 Introduction 1 1.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Preliminaries 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 On-Line Analytical Processing (OLAP) . . . . . . . . . . . . . . . . . 5 2.2.1 Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Relational representation . . . . . . . . . . . . . . . . . . . . . 6 2.2.3 Data warehouses . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.4 Queries and answers . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Approximate query answering . . . . . . . . . . . . . . . . . . 9 2.3.2 Sampling methods . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Probabilistic tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.1 Chernoff bounds . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.2 Hoeffding bounds . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.3 Number of samples . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Data exchange . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.2 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Approximation by Sampling a Data Warehouse 17 3.1 Uniform sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Close answers for each component . . . . . . . . . . . . . . . . 19 iv CONTENTS v 3.1.2 Close answers of an OLAP query . . . . . . . . . . . . . . . . 20 3.2 Measure-based sampling . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 Close answers for each component . . . . . . . . . . . . . . . . 22 3.2.2 Close answers of an OLAP query . . . . . . . . . . . . . . . . 23 3.3 Important comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Edit Distance on Data Warehouses 27 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Edit operations and cost functions . . . . . . . . . . . . . . . . . . . . 27 4.3 Distance between two data warehouses . . . . . . . . . . . . . . . . . 28 4.4 Distance between two answers . . . . . . . . . . . . . . . . . . . . . . 29 4.5 Properties of distance between two data warehouses . . . . . . . . . . 29 4.6 Continuity theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Statistical Model 35 5.1 Statistical dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Relation between the statistical model and the query answering . . . 36 5.3 Approximation by the statistical model . . . . . . . . . . . . . . . . . 36 5.3.1 Analysis of the measure on one dimension . . . . . . . . . . . 36 5.3.2 Analysis of the measure on two dimensions . . . . . . . . . . . 42 5.3.3 Analysis of the measure on k dimensions . . . . . . . . . . . . 44 5.4 Learning a statistical model by sampling . . . . . . . . . . . . . . . . 47 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6 OLAP Data Exchange 51 6.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Approximation with the uniform distribution . . . . . . . . . . . . . . 51 6.3 Approximation with the measure-based distribution . . . . . . . . . . 53 6.4 Approximation by the statistical model in data exchange . . . . . . . 54 6.4.1 Conditions for the approximation . . . . . . . . . . . . . . . . 54 6.4.2 Different evaluation methods . . . . . . . . . . . . . . . . . . . 55 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7 Streaming Data 59 7.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2 Lower Bounds on the space complexity for unbounded domains . . . 60 7.3 Bounded domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.4 Design of counters for the analysis on more than one dimension . . . 63 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 vi CONTENTS 8 Mining Statistical Dependencies 65 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.2 Entropy and information gain . . . . . . . . . . . . . . . . . . . . . . 66 8.3 Criteria of mining statistical dependencies . . . . . . . . . . . . . . . 67 8.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8.5 Graphic representation of statistical dependencies . . . . . . . . . . . 69 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 9 Implementation 71 9.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9.1.1 Mondrian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9.1.2 JPivot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 9.1.3 Mondrian Schema Workbench . . . . . . . . . . . . . . . . . . 75 9.2 Sensors data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . 76 9.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 9.2.2 Approximate answers for the different sampling methods . . . 79 9.2.3 Approximate answer from the statistical model . . . . . . . . 81 9.2.4 Mining of statistical dependencies . . . . . . . . . . . . . . . . 86 9.3 RSS data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 9.3.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 9.3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 9.3.3 Schemas of sets of samples . . . . . . . . . . . . . . . . . . . . 89 9.3.4 Quality of approximation in the different methods . . . . . . . 90 9.3.5 Mining of statistical dependencies . . . . . . . . . . . . . . . . 94 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 10 Conclusion and Future Work 101 11 Approximation of OLAP queries on data warehouses SYNTHESE EN FRANCAISE 103 Bibliography 111 List of Figures 2.1 An OLAP schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Representations of answers. . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 OLAP data exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 OLAP streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1 Relation between statistical model and query answering. . . . . . . . . . 37 6.1 An OLAP data exchange context. . . . . . . . . . . . . . . . . . . . . . . 52 6.2 Statistical computation by the target. . . . . . . . . . . . . . . . . . . . . 56 6.3 Statistical computation by the sources. . . . . . . . . . . . . . . . . . . . 56 6.4 Approximate answer at each source. . . . . . . . . . . . . . . . . . . . . . 57 7.1 Streaming model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.2 Design of counters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.1 Generalize the decision tree . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.2 Statistical decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 9.1 Mondrian architecture [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 9.2 Jpivot interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.3 Analysis of sun over country and manufacturer. . . . . . . . . . . . . . . 75 9.4 Interface of Mondrian Schema Workbench. . . . . . . . . . . . . . . . . . 76 9.5 OLAP schema of sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.6 Distribution of sun on Marseillle and London. . . . . . . . . . . . . . . . 80 9.7 Q2: Analysis of sun on manufacturer on the set Sensor (Exact answer). 80 9.8 Analysis of sun on manufacturer. . . . . . . . . . . . . . . . . . . . . . . 81 9.9 Statistical computation by the target. . . . . . . . . . . . . . . . . . . . . 82 9.10 Statistical computation by the sources. . . . . . . . . . . . . . . . . . . . 82 9.11 Approximate answer at each source. . . . . . . . . . . . . . . . . . . . . . 83 9.12 Exact answer and QS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 C 9.13 Quality of mining of city (cid:47) sun. . . . . . . . . . . . . . . . . . . . . . . . 88 9.14 RSS streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 9.15 RSS OLAP schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 9.16 Analysis of preference on source. . . . . . . . . . . . . . . . . . . . . . . . 92 9.17 Analysis of preference on domain . . . . . . . . . . . . . . . . . . . . . . 93 9.18 Analysis of preference on region and domain. . . . . . . . . . . . . . . . . 94 9.19 Quality of mining of (region, domain) (cid:47)preference. . . . . . . . . . . . . . 96 vii

Description:

Jun 20, 2013 the relative answers to OLAP queries on a schema, as distributions with the L1 distance and .. 9.4 Interface of Mondrian Schema Workbench.

Approximation of OLAP queries on data warehouses - HAL-Inria PDF

124 Pages·2016·2.87 MB·English

by Phuong Thao Cao

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Approximation of OLAP queries on data warehouses - HAL-Inria

Description:

Jun 20, 2013 the relative answers to OLAP queries on a schema, as distributions with the L1 distance and .. 9.4 Interface of Mondrian Schema Workbench.

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.