ebook img

DATA STREAMS: MODELS AND ALGORITHMS - Charu C. Aggarwal PDF

372 Pages·2008·3.89 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DATA STREAMS: MODELS AND ALGORITHMS - Charu C. Aggarwal

DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS: MODELS AND ALGORITHMS Editedby CHARUC.AGGARWAL IBMT.J.WatsonResearchCenter,YorktownHeights,NY10598 KluwerAcademicPublishers Boston/Dordrecht/London Contents ListofFigures xi ListofTables xv Preface xvii 1 AnIntroductiontoDataStreams 1 CharuC.Aggarwal 1. Introduction 1 2. StreamMiningAlgorithms 2 3. ConclusionsandSummary 6 References 7 2 OnClusteringMassiveDataStreams: ASummarizationParadigm 9 CharuC.Aggarwal,JiaweiHan,JianyongWangandPhilipS.Yu 1. Introduction 10 2. TheMicro-clusteringBasedStreamMiningFramework 12 3. ClusteringEvolvingDataStreams: AMicro-clusteringApproach 17 3.1 Micro-clusteringChallenges 18 3.2 Online Micro-cluster Maintenance: The CluStream Algo- rithm 19 3.3 HighDimensionalProjectedStreamClustering 22 4. ClassificationofDataStreams: AMicro-clusteringApproach 23 4.1 On-DemandStreamClassification 24 5. OtherApplicationsofMicro-clusteringandResearchDirections 26 6. PerformanceStudyandExperimentalResults 27 7. Discussion 36 References 36 3 ASurveyofClassificationMethodsinDataStreams 39 MohamedMedhatGaber,ArkadyZaslavskyandShonaliKrishnaswamy 1. Introduction 39 2. ResearchIssues 41 3. SolutionApproaches 43 4. ClassificationTechniques 44 4.1 EnsembleBasedClassification 45 4.2 VeryFastDecisionTrees(VFDT) 46 vi DATASTREAMS:MODELSANDALGORITHMS 4.3 OnDemandClassification 48 4.4 OnlineInformationNetwork(OLIN) 48 4.5 LWClassAlgorithm 49 4.6 ANNCADAlgorithm 51 4.7 SCALLOPAlgorithm 51 5. Summary 52 References 53 4 FrequentPatternMininginDataStreams 61 RuomingJinandGaganAgrawal 1. Introduction 61 2. Overview 62 3. NewAlgorithm 67 4. WorkonOtherRelatedProblems 79 5. ConclusionsandFutureDirections 80 References 81 5 ASurveyofChangeDiagnosis 85 AlgorithmsinEvolvingData Streams CharuC.Aggarwal 1. Introduction 86 2. TheVelocityDensityMethod 88 2.1 SpatialVelocityProfiles 93 2.2 EvolutionComputationsinHighDimensionalCase 95 2.3 Ontheuseofclusteringforcharacterizingstreamevolution 96 3. OntheEffectofEvolutioninDataMiningAlgorithms 97 4. Conclusions 100 References 101 6 Multi-DimensionalAnalysisofData 103 StreamsUsingStreamCubes JiaweiHan,Y.DoraCai,YixinChen,GuozhuDong,JianPei,BenjaminW.Wah,and JianyongWang 1. Introduction 104 2. ProblemDefinition 106 3. ArchitectureforOn-lineAnalysisofDataStreams 108 3.1 Tiltedtimeframe 108 3.2 Criticallayers 110 3.3 Partialmaterializationofstreamcube 111 4. StreamDataCubeComputation 112 4.1 Algorithmsforcubecomputation 115 5. PerformanceStudy 117 6. RelatedWork 120 7. PossibleExtensions 121 8. Conclusions 122 References 123 Contents vii 7 LoadSheddinginDataStreamSystems 127 BrianBabcock,MayurDatarandRajeevMotwani 1. LoadSheddingforAggregationQueries 128 1.1 ProblemFormulation 129 1.2 LoadSheddingAlgorithm 133 1.3 Extensions 141 2. LoadSheddinginAurora 142 3. LoadSheddingforSlidingWindowJoins 144 4. LoadSheddingforClassificationQueries 145 5. Summary 146 References 146 8 TheSliding-WindowComputationModelandResults 149 MayurDatarandRajeevMotwani 0.1 MotivationandRoadMap 150 1. ASolutiontotheBasicCountingProblem 152 1.1 TheApproximationScheme 154 2. SpaceLowerBoundforBasicCountingProblem 157 3. Beyond0’sand1’s 158 4. ReferencesandRelatedWork 163 5. Conclusion 164 References 166 9 ASurveyofSynopsisConstruction 169 inDataStreams CharuC.Aggarwal,PhilipS.Yu 1. Introduction 169 2. SamplingMethods 172 2.1 RandomSamplingwithaReservoir 174 2.2 ConciseSampling 176 3. Wavelets 177 3.1 RecentResearchonWaveletDecompositioninDataStreams 182 4. Sketches 184 4.1 FixedWindowSketchesforMassiveTimeSeries 185 4.2 VariableWindowSketchesofMassiveTimeSeries 185 4.3 SketchesandtheirapplicationsinDataStreams 186 4.4 Sketcheswithp-stabledistributions 190 4.5 TheCount-MinSketch 191 4.6 RelatedCountingMethods: HashFunctionsforDetermining DistinctElements 193 4.7 AdvantagesandLimitationsofSketchBasedMethods 194 5. Histograms 196 5.1 OnePassConstructionofEqui-depthHistograms 198 5.2 ConstructingV-OptimalHistograms 198 5.3 WaveletBasedHistogramsforQueryAnswering 199 5.4 SketchBasedMethodsforMulti-dimensionalHistograms 200 6. DiscussionandChallenges 200 viii DATASTREAMS:MODELSANDALGORITHMS References 202 10 ASurveyofJoinProcessingin 209 DataStreams JunyiXieandJunYang 1. Introduction 209 2. ModelandSemantics 210 3. StateManagementforStreamJoins 213 3.1 ExploitingConstraints 214 3.2 ExploitingStatisticalProperties 216 4. FundamentalAlgorithmsforStreamJoinProcessing 225 5. OptimizingStreamJoins 227 6. Conclusion 230 Acknowledgments 232 References 232 11 IndexingandQueryingDataStreams 237 AhmetBulut,AmbujK.Singh 1. Introduction 238 2. IndexingStreams 239 2.1 Preliminariesanddefinitions 239 2.2 Featureextraction 240 2.3 Indexmaintenance 244 2.4 DiscreteWaveletTransform 246 3. QueryingStreams 248 3.1 Monitoringanaggregatequery 248 3.2 Monitoringapatternquery 251 3.3 Monitoringacorrelationquery 252 4. RelatedWork 254 5. FutureDirections 255 5.1 Distributedmonitoringsystems 255 5.2 Probabilisticmodelingofsensornetworks 256 5.3 Contentdistributionnetworks 256 6. ChapterSummary 257 References 257 12 DimensionalityReductionand 261 ForecastingonStreams SpirosPapadimitriou,JimengSun,andChristosFaloutsos 1. Relatedwork 264 2. Principalcomponentanalysis(PCA) 265 3. Auto-regressivemodelsandrecursiveleastsquares 267 4. MUSCLES 269 5. Trackingcorrelationsandhiddenvariables: SPIRIT 271 6. PuttingSPIRITtowork 276 7. Experimentalcasestudies 278 Contents ix 8. Performanceandaccuracy 283 9. Conclusion 286 Acknowledgments 286 References 287 13 ASurveyofDistributedMiningofDataStreams 289 SrinivasanParthasarathy,AmolGhotingandMatthewEricOtey 1. Introduction 289 2. OutlierandAnomalyDetection 291 3. Clustering 295 4. Frequentitemsetmining 296 5. Classification 297 6. Summarization 298 7. MiningDistributedDataStreamsinResourceConstrainedEnviron- ments 299 8. SystemsSupport 300 References 304 14 AlgorithmsforDistributed 309 DataStreamMining KanishkaBhaduri,KamalikaDas,KrishnamoorthySivakumar,HillolKargupta,Ran WolffandRongChen 1. Introduction 310 2. Motivation: WhyDistributedDataStreamMining? 311 3. ExistingDistributedDataStreamMiningAlgorithms 312 4. Alocalalgorithmfordistributeddatastreammining 315 4.1 LocalAlgorithms: definition 315 4.2 Algorithmdetails 316 4.3 Experimentalresults 318 4.4 Modificationsandextensions 320 5. BayesianNetworkLearningfromDistributedDataStreams 321 5.1 DistributedBayesianNetworkLearningAlgorithm 322 5.2 Selectionofsamplesfortransmissiontoglobalsite 323 5.3 OnlineDistributedBayesianNetworkLearning 324 5.4 ExperimentalResults 326 6. Conclusion 326 References 329 15 ASurveyofStreamProcessing 333 ProblemsandTechniques inSensorNetworks SharmilaSubramaniam,DimitriosGunopulos 1. Challenges 334 x DATASTREAMS:MODELSANDALGORITHMS 2. TheDataCollectionModel 335 3. DataCommunication 335 4. QueryProcessing 337 4.1 AggregateQueries 338 4.2 JoinQueries 340 4.3 Top-k Monitoring 341 4.4 ContinuousQueries 341 5. CompressionandModeling 342 5.1 DataDistributionModeling 343 5.2 OutlierDetection 344 6. Application: TrackingofObjectsusingSensorNetworks 345 7. Summary 347 References 348 Index 353 List of Figures 2.1 Micro-clusteringExamples 11 2.2 SomeSimpleTimeWindows 11 2.3 VaryingHorizonsfortheclassificationprocess 23 2.4 Qualitycomparison(NetworkIntrusiondataset,horizon=256, stream speed=200) 30 2.5 Quality comparison (Charitable Donation dataset, hori- zon=4,stream speed=200) 30 2.6 Accuracycomparison(NetworkIntrusiondataset,stream speed=80, buffer size=1600,k =80,init number=400) 31 fit 2.7 Distributionofthe(smallest)besthorizon(NetworkIn- trusiondataset,Timeunits=2500,buffer size=1600,k =80, fit init number=400) 31 2.8 Accuracy comparison (Synthetic dataset B300kC5D20, stream speed=100,buffer size=500,k =25,init number=400) 31 fit 2.9 Distributionofthe(smallest)besthorizon(Syntheticdataset B300kC5D20,Timeunits=2000,buffer size=500,k =25, fit init number=400) 32 2.10 StreamProc. Rate(Charit. Donationdata,stream speed=2000) 33 2.11 StreamProc. Rate(Ntwk. Intrusiondata,stream speed=2000) 33 2.12 ScalabilitywithDataDimensionality(stream speed=2000) 34 2.13 ScalabilitywithNumberofClusters(stream speed=2000) 34 3.1 Theensemblebasedclassificationmethod 53 3.2 VFDTLearningSystems 54 3.3 OnDemandClassification 54 3.4 OnlineInformationNetworkSystem 55 3.5 AlgorithmOutputGranularity 55 3.6 ANNCADFramework 56 3.7 SCALLOPProcess 56 4.1 Karpetal. AlgorithmtoFindFrequentItems 68 4.2 ImprovingAlgorithmwithAnAccuracyBound 71

Description:
Contents List of Figures xi List of Tables xv Preface xvii 1 An Introduction to Data Streams 1 Charu C. Aggarwal 1. Introduction 1 2. Stream Mining Algorithms 2
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.