ebook img

MANAGING AND MINING UNCERTAIN DATA - Charu Aggarwal PDF

513 Pages·2008·3.77 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview MANAGING AND MINING UNCERTAIN DATA - Charu Aggarwal

MANAGING AND MINING UNCERTAIN DATA MANAGING AND MINING UNCERTAIN DATA Editedby CHARUC.AGGARWAL IBMT.J.WatsonResearchCenter,Hawthorne,NY10532 KluwerAcademicPublishers Boston/Dordrecht/London Preface Uncertaindatamanagementhasseenarevivalininterestinrecentyearsbe- causeofanumberofnewfieldswhichutilizethiskindofdata. Forexample,in fields such as privacy-preserving data mining, additional errors may be added to data in order to mask the identity of the records. Often the data may be imputed using statistical methods such as forecasting. In such cases, the data is uncertain in nature. Such data sets may often be probabilistic in nature. In othercases,databasesmayshowexistentialuncertaintyinwhichoneormore records may be present or absent from the data set. Such data sets lead to a numberofuniquechallengesinprocessingandmanagingtheunderlyingdata. The field of uncertain data management has been studied in the traditional database literature, but the field has seen a revival in recent years because of newwaysofcollectingdata. Thefieldofuncertaindatamanagementpresents a number of challenges in terms of collecting, modeling, representing, query- ing, indexing and mining the data. We further note that many of these issues areinter-relatedandcannoteasilybeaddressedindependently. Whilemanyof theseissueshavebeenaddressedinrecentresearch,theresearchinthisareais oftenquitevariedinitsscope. Forexample, eventheunderlyingassumptions of uncertainty are different across different papers. It is often difficult for re- searchers and students to find a single place containing a coherent discussion onthetopic. Thisbookisdesignedtoprovideacoherenttreatmentofthetopicofuncer- taindatamanagementbyprovidingsurveysofthekeytopicsinthisfield. The book is structured as an edited volume containing surveys by prominent re- searchersinthefield. Thechoiceofchaptersiscarefullydesigned,sothatthe overall content of the uncertain data management and mining field is covered reasonablywell. Eachchaptercontainsthekeyresearchcontentonaparticular topic,alongwithpossibleresearchdirections. Thisincludesabroadoverview of the topic, the different models and systems for uncertain data, discussions ondatabaseissuesformanaginguncertaindata,andminingissueswithuncer- taindata. Twoofthemostprominentsystemsforuncertaindatahavealsobeen describedinthebookinordertoprovideanideahowrealuncertaindataman- agement systems might work. The idea is to structurally organize the topic, vi MANAGINGANDMININGUNCERTAINDATA and provide insights which are not easily available otherwise. It is hoped that this structural organization and survey approach will be a great help to stu- dents,researchers,andpractitionersinthefieldofuncertaindatamanagement andmining. Contents Preface v ListofFigures xv ListofTables xxi 1 AnIntroductiontoUncertainDataAlgorithmsandApplications 1 CharuC.Aggarwal 1. Introduction 1 2. AlgorithmsforUncertainData 3 3. Conclusions 6 References 7 2 ModelsforIncompleteandProbabilisticInformation 9 ToddJ.Green 1. Introduction 9 2. IncompleteInformationandRepresentationSystems 13 RA 3. -CompletenessandFiniteCompleteness 14 4. ClosureUnderRelationalOperations 18 5. AlgebraicCompletion 19 6. ProbabilisticDatabasesandRepresentationSystems 21 7. Probabilistic?-TablesandProbabilisticOr-SetTables 22 8. Probabilisticc-tables 24 9. QueriesonAnnotatedRelations 25 10. K-Relations 27 11. PolynomialsforProvenance 30 12. QueryContainment 33 13. RelatedWork 34 14. ConclusionandFurtherWork 34 References 37 15. Appendix 41 3 RelationalModelsandAlgebraforUncertainData 45 SumitSarkarandDebabrataDey 1. Introduction 45 viii MANAGINGANDMININGUNCERTAINDATA 2. DifferentProbabilisticDataModels 48 2.1 Point-ValuedProbabilityMeasuresAssignedtoEachTuple 48 2.2 Point-ValuedProbabilityMeasuresAssignedtoAttributesandAt- tributeSets 52 2.3 Interval-ValuedProbabilityMeasuresAssignedtoAttribute-Values 54 2.4 Interval-valuedProbabilityMeasuresAssignedtoTuples 56 3. ProbabilisticRelationalAlgebra 56 3.1 BasicDefinitions 56 3.2 PrimaryandForeignKeys 58 3.3 RelationalOperations 59 3.4 RelationalAlgebra 62 3.5 IncompleteDistributionandNullValues 63 4. AlgebraicImplicationsoftheDifferentRepresentationsandAssociated Assumptions 67 4.1 PointValuedProbabilityMeasuresinTuples 68 4.2 Point-ValuedProbabilityMeasuresinAttributes 69 4.3 Interval-ValuedProbabilityMeasuresinAttributes 70 4.4 Interval-ValuedProbabilityMeasuresinTuples 71 4.5 ObservationsonTupleIndependence 72 5. ConcludingRemarks 72 References 75 4 GraphicalModelsforUncertainData 77 AmolDeshpande,LiseGetoorandPrithvirajSen 1. Introduction 78 2. GraphicalModels: Overview 80 2.1 DirectedGraphicalModels: BayesianNetworks 82 2.2 UndirectedGraphicalModels: MarkovNetworks 84 2.3 InferenceQueries 85 3. RepresentingUncertaintyusingGraphicalModels 87 3.1 PossibleWorldSemantics 88 3.2 SharedFactors 91 3.3 RepresentingProbabilisticRelations 91 4. QueryEvaluationoverUncertainData 92 4.1 Example 94 4.2 GeneratingFactorsduringQueryEvaluation 96 4.3 QueryEvaluationasInference 99 4.4 Optimizations 100 5. RelatedWorkandDiscussion 101 5.1 SafePlans 101 5.2 RepresentingUncertaintyusingLineage 102 5.3 ProbabilisticRelationalModels 102 5.4 LiftedInference 104 5.5 ScalableInferenceusingaRelationalDatabase 104 6. Conclusions 105 References 107 Contents ix 5 Trio: ASystemforData,Uncertainty,andLineage 113 JenniferWidom 1. ULDBs: Uncertainty-LineageDatabases 116 1.1 Alternatives 117 1.2 ‘?’ (Maybe)Annotations 117 1.3 Confidences 118 1.4 Lineage 119 1.5 RelationalQueries 121 2. TriQL:TheTrioQueryLanguage 122 2.1 OperationalSemantics 122 2.2 QueryingConfidences 124 2.3 QueryingLineage 124 2.4 DuplicateElimination 125 2.5 Aggregation 126 2.6 ReorganizingAlternatives 127 2.7 HorizontalSubqueries 128 2.8 Query-DefinedResultConfidences 128 2.9 OtherTriQLQueryConstructs 129 3. DataModificationsinTrio 130 3.1 Inserts 130 3.2 Deletes 131 3.3 Updates 131 3.4 DataModificationsandVersioning 133 4. ConfidenceComputation 133 5. AdditionalTrioFeatures 135 6. TheTrioSystem 136 6.1 EncodingULDBData 138 6.2 BasicQueryTranslationScheme 140 6.3 DuplicateElimination 141 6.4 Aggregation 141 6.5 ReorganizingAlternatives 142 6.6 HorizontalSubqueries 142 6.7 Built-InPredicatesandFunctions 143 6.8 Query-DefinedResultConfidences 144 6.9 RemainingConstructs 144 References 147 6 MayBMS:ASystemforManagingLargeProbabilisticDatabases 149 ChristophKoch 1. Introduction 149 2. ProbabilisticDatabases 151 3. QueryLanguageDesiderata 152 4. TheAlgebra 153 5. RepresentingProbabilisticData 159 6. ConceptualQueryEvaluation 164 7. TheMayBMSQueryandUpdateLanguage 170 8. TheMayBMSSystem 175 9. ConclusionsandOutlook 178 x MANAGINGANDMININGUNCERTAINDATA References 181 7 UncertaintyinDataIntegration 185 AnishDasSarma,XinDongandAlonHalevy 1. Introduction 185 2. OverviewoftheSystem 187 2.1 Uncertaintyindataintegration 187 2.2 Systemarchitecture 188 2.3 Sourceofprobabilities 189 2.4 Outlineofthechapter 190 3. UncertaintyinMappings 190 3.1 Motivatingprobabilisticmappings 190 3.2 DefinitionandSemantics 192 3.3 QueryAnswering 197 3.4 CreatingP-mappings 201 3.5 BroaderClassesofMappings 204 3.6 OtherTypesofApproximateSchemaMappings 206 4. UncertaintyinMediatedSchema 207 4.1 P-Med-SchemaMotivatingExample 207 4.2 ProbabilisticMediatedSchema 210 4.3 P-med-schemaCreation 212 4.4 Consolidation 214 4.5 Otherapproaches 216 5. FutureDirections 217 References 219 8 SketchingAggregatesoverProbabilisticStreams 223 ErikVee 1. Introduction 223 1.1 Aggregatesoverprobabilisticstreams 225 1.2 Organization 226 2. TheProbabilisticStreamModel 226 2.1 Problemdefinitions 228 2.2 FrequencyMomentsandQuantiles 229 3. Overviewoftechniquesandsummaryofresults 231 4. UniversalSampling 235 5. Frequencymoments: DISTINCTandREPEAT-RATE 236 5.1 DISTINCT 236 5.2 REPEAT-RATE 238 6. Heavy-Hitters,Quantiles,andMEDIAN 240 7. ABinningTechniqueforMINandMAX 241 8. EstimatingAVGusinggeneratingfunctions 244 8.1 Generatingfunctions 244 8.2 EstimatingAVG 245 8.3 ApproximatingAVGbySUM/COUNT 248 9. Discussion 252 References 253

Description:
MANAGING AND MINING UNCERTAIN DATA Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Hawthorne, NY 10532 Kluwer Academic Publishers Boston/Dordrecht/London
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.