DATA MINING IN TIME SERIES DATABASES SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors: H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal) Vol. 50: Empirical Evaluation Methods in Computer Vision (Eds. H. I. Christensen and P. J. Phillips) Vol. 51: Automatic Diatom Identification (Eds. H. du Buf and M. M. Bayer) Vol. 52: Advances in Image Processing and Understanding A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof) Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal) Vol. 54: Fundamentals of Robotics — Linking Perception to Action (M. Xie) Vol. 55: Web Document Analysis: Challenges and Opportunities (Eds. A. Antonacopoulos and J. Hu) Vol. 56: Artificial Intelligence Methods in Software Testing (Eds. M. Last, A. Kandel and H. Bunke) Vol. 57: Data Mining in Time Series Databases (Eds. M. Last, A. Kandel and H. Bunke) Vol. 58: Computational Web Intelligence: Intelligent Technology for Web Applications (Eds. Y. Zhang, A. Kandel, T. Y. Lin and Y. Yao) Vol. 59: Fuzzy Neural Network Theory and Application (P. Liu and H. Li) *For the complete list of titles in this series, please write to the Publisher. Series in Machine Perception and Artificial Intelligence - Vol, 57 DATA MINING IN TIME SERIES DATABASES Editors Mark Last Ben-Gurion LIniversity of the Negeu, Israel Abraham Kandel Zl-Auiv University, Israel University of South Florida, Tampa, LISA Horst Bunke University of Bern, Switzerland vp World Scientific NEW JERSEY * LONDON * SINGAPORE BElJlNG SHANGHAI HONG KONG TAIPEI CHENNAI Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. DATA MINING IN TIME SERIES DATABASES Series in Machine Perception and Artificial Intelligence (Vol. 57) Copyright © 2004 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher. For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher. ISBN 981-238-290-9 Typeset by Stallion Press Email: [email protected] Printed in Singapore by World Scientific Printers (S) Pte Ltd April22,2004 16:59 WSPC/TrimSize:9inx6inforReviewVolume fm Dedicated to The Honorable Congressman C. W. Bill Young House of Representatives For his vision and continuous support in creating the National Institute for Systems Test and Productivity at the Computer Science and Engineering Department, University of South Florida This page intentionally left blank April22,2004 16:59 WSPC/TrimSize:9inx6inforReviewVolume fm Preface Traditional data mining methods are designed to deal with “static” databases, i.e. databases where the ordering of records (or other database objects)hasnothingtodowiththepatternsofinterest.Thoughtheassump- tion of order irrelevance may be sufficiently accurate in some applications, therearecertainlymanyothercases,wheresequentialinformation,suchas a time-stamp associated with every record, can significantly enhance our knowledge about the mined data. One example is a series of stock values: a specific closing price recorded yesterday has a completely different mean- ing than the same value a year ago. Since most today’s databases already include temporal data in the form of “date created”, “date modified”, and other time-related fields, the only problem is how to exploit this valuable information to our benefit. In other words, the question we are currently facing is: How to mine time series data? The purpose of this volume is to present some recent advances in pre- processing, mining, and interpretation of temporal data that is stored by modern information systems. Adding the time dimension to a database produces a Time Series Database (TSDB) and introduces new aspects and challenges to the tasks of data mining and knowledge discovery. These new challenges include: finding the most efficient representation of time series data, measuring similarity of time series, detecting change points in time series, and time series classification and clustering. Some of these problems have been treated in the past by experts in time series analysis. However, statisticalmethodsoftimeseriesanalysisarefocusedonsequencesofvalues representing a single numeric variable (e.g., price of a specific stock). In a real-world database, a time-stamped record may include several numerical andnominalattributes,whichmaydependnotonlyonthetimedimension but also on each other. To make the data mining task even more com- plicated, the objects in a time series may represent some complex graph structures rather than vectors of feature-values. vii April22,2004 16:59 WSPC/TrimSize:9inx6inforReviewVolume fm viii Preface Our book covers the state-of-the-art research in several areas of time series data mining. Specific problems challenged by the authors of this volume are as follows. Representation of Time Series. Efficient and effective representation of time series is a key to successful discovery of time-related patterns. The most frequently used representation of single-variable time series is piecewise linear approximation, where the original points are reduced to a set of straight lines (“segments”). Chapter 1 by Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani provides an extensive and compar- ative overview of existing techniques for time series segmentation. In the view of shortcomings of existing approaches, the same chapter introduces an improved segmentation algorithm called SWAB (Sliding Window and Bottom-up). Indexing and Retrieval of Time Series. Since each time series is char- acterized by a large, potentially unlimited number of points, finding two identicaltimeseriesforanyphenomenonishopeless.Thus,researchershave beenlookingforsetsofsimilardatasequencesthatdifferonlyslightlyfrom eachother.Theproblemofretrievingsimilarseriesarisesinmanyareassuch as marketing and stock data analysis, meteorological studies, and medical diagnosis. An overview of current methods for efficient retrieval of time series is presented in Chapter 2 by Magnus Lie Hetland. Chapter 3 (by Eugene Fink and Kevin B. Pratt) presents a new method for fast compres- sionandindexingoftimeseries.Arobustsimilaritymeasureforretrievalof noisy time series is described and evaluated by Michail Vlachos, Dimitrios Gunopulos, and Gautam Das in Chapter 4. Change Detection in Time Series. The problem of change point detec- tion in a sequence of values has been studied in the past, especially in the context of time series segmentation (see above). However, the nature of real-world time series may be much more complex, involving multivariate and even graph data. Chapter 5 (by Gil Zeira, Oded Maimon, Mark Last, and Lior Rokach) covers the problem of change detection in a classification modelinducedbyadataminingalgorithmfromtimeseriesdata.Achange detection procedure for detecting abnormal events in time series of graphs ispresentedbyHorstBunkeandMiroKraetzlinChapter6.Theprocedure is applied to abnormal event detection in a computer network. Classification of Time Series. Rather than partitioning a time series into segments, one can see each time series, or any other sequence of data points, as a single object. Classification and clustering of such complex April22,2004 16:59 WSPC/TrimSize:9inx6inforReviewVolume fm Preface ix “objects” may be particularly beneficial for the areas of process con- trol, intrusion detection, and character recognition. In Chapter 7, Carlos J. Alonso Gonza´lez and Juan J. Rodr´ıguez Diez present a new method for early classification of multivariate time series. Their method is capable of learning from series of variable length and able of providing a classification whenonlypartoftheseriesispresentedtotheclassifier.Anovelconceptof representingtimeseriesbymedianstrings(seeChapter8,byXiaoyiJiang, Horst Bunke, and Janos Csirik) opens new opportunities for applying clas- sification and clustering methods of data mining to sequential data. As indicated above, the area of mining time series databases still includes many unexplored and insufficiently explored issues. Specific sug- gestionsforfutureresearchcanbefoundinindividualchapters.Ingeneral, we believe that interesting and useful results can be obtained by applying the methods described in this book to real-world sets of sequential data. Acknowledgments The preparation of this volume was partially supported by the National Institute for Systems Test and Productivity at the University of South FloridaunderU.S.SpaceandNavalWarfareSystemsCommandgrantnum- ber N00039-01-1-2248. Wealsowouldliketoacknowledgethegeneroussupportandcooperation of: Ben-Gurion University of the Negev, Department of Information Sys- tems Engineering, University of South Florida, Department of Computer Science and Engineering, Tel-Aviv University, College of Engineering, The Fulbright Foundation, The US-Israel Educational Foundation. January 2004 Mark Last Abraham Kandel Horst Bunke