UnderconsiderationforpublicationinKnowledgeandInformation Systems On Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal1, Philip S. Yu2 1IBMT.J.WatsonResearchCenter,Hawthorne,NY10532,USA; 2UniversityofIllinoisatChicago,Chicago,IL,USA Abstract. In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corre- spondingattributes.Therefore,wewillproposealgorithmsfortextandcategoricaldata streamclustering.Wewillproposeacondensationbasedapproachforstreamclustering whichsummarizesthestreamintoanumberoffinegrainedclusterdroplets.Thesesum- marizeddropletscanbeusedinconjunctionwithavarietyofuserqueriestoconstruct theclustersfordifferentinputparameters.Thus,thisprovidesanonlineanalyticalpro- cessing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and syntheticdatasets,andshowtheeffectivenessofthemethodoverthebaselineOSKM algorithm for stream clustering. Keywords: Stream clustering, Text clustering, text streams, text stream clustering, categorical data 1. Introduction In this paper, we will study the problem of clustering text and categorical data streams. This problem is relevant in a number of web related applications such as news group segmentation, text crawling, and target marketing for electronic commerce. Some applications of text and categorical data stream clustering are as follows: Received September 08, 2009 Revised May 31, 2009 Accepted June 20, 2009 2 AggarwalandYu – Manyportalsontheworldwidewebproviderealtimenewsandotherarticles which require quick summarization and filtering. Such methods often require effective and efficient methods for text segmentation. – Many web crawlers continuously harvest thousands of web pages on the web, which is subsequently summarized by human effort. When the volume of such crawlsissignificant,itisnotrealisticallypossibletoachievethisgoalbyhuman effort. In such applications, data stream clustering algorithms can be helpful in organizing the crawled resources into coherent sets of clusters. – In many electronic commerce applications, large volumes of transactions are processed on the world wide web. Such transactions can take the form of categoricalormarketbasketrecords.Insuchcases,itisoftenusefultoperform real time clustering for target marketing. Thedatastreamproblemhasreceivedincreasingattentioninrecentyearsbe- causeoftechnologicalinnovationswhichhavefacilitatedthecreationandmainte- nanceofsuchdata.Anumberofdataminingproblemsrecentlyhavebeenstudied inthecontextofdatastreams(Aggarwal,2003;Babcocketal,2002;Domingoset al, 2000; O’Callaghan et al, 2002). The clustering problem is formally defined as follows:foragivensetofdatapoints,wewishtopartitionthemintooneormore groups of similar objects. The similarity among the objects is typically defined with the use of some distance measure or objective function. In addition, data points which do not naturally fit into any particular cluster are referred to as outliers.Becauseoftheapplicabilityoftheclusteringandoutlierdetectionprob- lems,theseproblemshavebeenwellresearchedinthedatabaseanddatamining communities (Bradley et al, 1998; Knorr et al, 1998; Ng et al, 1994; Rastogi et al,2000;Zhangetal,1996).Theproblemsoftextandcategoricaldataclustering havealsobeenrecentlystudied(Cuttingetal,1992;Gibsonetal,1998;Guhaet al, 1997; Silverstein et al, 1997) in the offline context. These methods cannot be easily extended to the data stream problem because of their high computational complexity. The data stream problem poses special challenges because of the large volume of incoming data. As a result, one cannot scan a data point more than once during the course of the computation. Furthermore, it is extremely important for the process to be computationally efficient, since the processing rate needs to be sufficiently fast to match the incoming rate of the data points. The problem of stream clustering has been studied extensively for the case of continuous data (Aggarwal et al, 2003b; Aggarwal et al, 2008; Cao et al, 2006; Chen et al, 2007). However, such approaches are difficult to use directly for other data domains, since the cluster computation and tracking techniques may be completely different. A number of other techniques for the Topic De- tection and Tracking (TDT) program have studied incremental clustering of text streams (Allan et al, 1998a; Allan et al, 1998b; Franz et al, 2001; Yang et al, 1998) to find new events and to track later articles related to new events. A recent piece of work in (Yang et al, 1998) is able to perform topic-conditioned novelty detection, though this approach requires supervision with class labels. Some recent techniques for streaming text clustering (Banerjee et al, 2004; He et al, 2007; Surendran et al, 2006; li et al, 2006; Zhang et al, 2005; Zhong, 2005) adapt a partition based approach to perform rapid online clustering of stream- ing data. While these techniques are effective for online clustering, they lack the flexibility to allow the user the ability to perform effective analysis of the clus- ters over different time horizons. A recent piece of work (Banerjee et al, 2007) studies the problem of online topic learning in text streams with the use of a OnClusteringMassiveTextandCategoricalDataStreams 3 varietyofsimilaritymeasures.Themachinelearningcommunityhasalsostudied the problem of incremental clustering for non-text data using methods such as COBWEB and CLASSIT (LFisher, 1987; Gennari et al, 1989). Our goal in this paper is to study the stream clustering problem for evolving data sets, so that a user has the maximum flexibility of examining the behavior of the clusters over different time horizons. In typical applications, users may wish to examine the clusters only over a particular region of the stream, or may wish to vary the parameters such as the number of clusters in the data. In particular, the ability to provide flexibility to the user in specifying different parameters (such as time horizon) is very challenging for a streaming application. This is because such parameters affect the clustering process from scratch, and it is difficult to maintaintheresultsofclusteringovermultiplechoicesofparameterssuchasthe user-specifiedhorizonorthenumberofclusters.Inordertoachievethisgoal,we will utilize a technique which leverages on intermediate summary statistics for the clustering task. Most real applications present data streams which are highly evolving in nature. For example, an electronic commerce application containing customer transactionsofmarketbasketdataislikelytocreatedatawhichishighlyskewed from the temporal perspective. Similarly, many newsfeed services receive large volumes of documents for text clustering and categorization. Thus, real appli- cations often exhibit temporal locality which is not taken into account by most batch processing algorithms. The problems of clustering and outlier detection present a number of unique challenges in an evolving data stream environment. For example, the continuous evolution of clusters makes it essential to be able to quickly identify new clusters in the data. While the clustering process needs to be executed continuously in online fashion, it is also important to be able to provideenduserswiththeabilitytoanalyzetheclustersinanofflinefashion.In many cases, the users may wish to cluster the data based on specific temporal parameters such as the choice of a particular time horizon or specified number of clusters. In order to achieve this goal, we will construct a framework in which care- fullychosenstatisticalsummarydataisstoredatregularintervals.Wewillintro- duce the concept of cluster droplets which provides an effective representation of the condensed characterization of the data. We will see that the summary data has specific characteristics such as additivity and subtractivity which are usefulincomputingtheeffectivenessoveruser-specifiedtimehorizons.Theclus- ter droplets can be considered intermediate summary data which can be reused for a variety of purposes. Thus, this is an online analytical processing method in which efficient construction of the intermediate structures is essential for the success of the approach. While the data stream may have a very large volume, thecondenseddropletsareonlyasummaryrepresentation,andcanbeprocessed efficiently because of their compressed format. We will refer to our algorithm as ConStream which corresponds to CONdensation based STREAM Clustering. A related problem to clustering is that of outlier detection. We will also examine the outlier detection problem in the context of data stream clustering. Previ- ous work has studied anomalies only in the context of either supervised models (Agrawal,2007)ornon-streamcases(Petersonetal,2008).Inthecaseofstream clustering, new patterns in the data often appear as outliers, which eventually become clusters in such cases. We will show how such outlier patterns can be extracted from the data stream in real time. 4 AggarwalandYu We note that the nature of the underlying data can affect the procedures for clustering data streams. For example, the sparsity and distribution of the differ- entvaluesofthecategoricaldatacanaffectthemethodologywhichisutilizedfor the clustering process. For example, numerical data allows for easy tracking of theunderlyinginformationintheformofquantitativemomentinformation.This isnotthecasefortextorcategoricaldatainwhichthesuchmomentinformation cannot be easily defined. Therefore, a different methodology is required for the case of categorical data streams. In this paper, we will discuss the problem of clustering different kinds of categorical data sets, including the closely related text domain. In the presence of data evolution or temporal locality, the exploration of the stream over different time horizons may lead to different kinds of clusters. These different kinds of clusters may have applicability in understanding the behavior of the stream over different periods in time. For this purpose, one may desire to periodically store a certain amount of summary information which can be utilized later for better understanding and analysis of the data stream. Thiskindofmethodologyissimilartoonlineanalyticalprocessingalgorithmsin which summary information is created for the purpose of repeated querying. In the context of a data stream, such a methodology seems quite convenient, since a fast data stream cannot be repeatedly processed in order to answer different kinds of queries. The clustering algorithm of this paper pre-stores condensed statistical infor- mationatregularintervals.Thiscondensedstatistical datasatisfiestworequire- ments: – It should be easy to update the statistical information for a fast data stream. In this paper, we specifically choose the nature of the statistical information in such a way that it is possible to perform updates which are linear in the number of data points. – The statistical information stored should allow the computation of various analyticalmeasuresrequiredbytheuser.Suchmeasurescouldincludeclusters or outliers over a specific time horizon. It is also often desirable to find the nature of the evolution over a given time horizon. In the next sections, we will discuss methods to store statistical information which satisfy both the conditions. This paper is organized as follows. In the next section, we will discuss the classification of different kinds of clusters and outliers. In section 3, we will discuss the process of storing and maintaining the data structures necessary for the clustering algorithm. We will also discuss the differences which arise from using different kinds of data. In section 4, we will describe the process of utilizingdifferentkindsofofflineprocessingmethodsontheunderlyingsummary data.Section5describestheempiricalresults.Theconclusionsandsummaryare discussed in section 6. 2. The Nature of Clusters and Outliers in Stream Processing Datastreamsexhibitconsiderablevariations intheunderlyingpatternsoverthe course of their progression. Old clusters may become inactive, and eventually OnClusteringMassiveTextandCategoricalDataStreams 5 get replaced by new clusters. Similarly, when newly arriving data points do not naturally fit in any particular cluster, these need to be initially classified as outliers. However, as time progresses, these new points may create a distinctive pattern of activity which can be recognized as a new cluster. The temporal locality of the data stream is manifested by these new clusters. For example, thefirstwebpagebelongingtoaparticularcategoryinanewsstreamofcurrent eventsmayberecognizedasanoutlier,butmaylaterformaclusterofdocuments ofits own. Onthe other hand, thenew outliers may not necessarilyresultinthe formationofnewclusters.Suchoutliersaretrueshort-termabnormalities inthe data since they do not result in the emergence of sustainable patterns. Inordertodistinguishbetweenthedifferentkindsofclustersandabnormali- ties, we will introduce some notations and terminology. When a cluster is newly discoveredduringthearrivalofthedatastream,itisreferredtoasatrend-setter. From the point of view of the user, a trend-setter is an outlier, until the arrival of other points certify the fact that it is actually a cluster. If and when a suf- ficient number of new points have arrived in the cluster, it is referred to as a mature cluster. We will quantify these definitions more precisely slightly later. At a given moment in time, a mature cluster can either be active or inactive. A matureclusterissaidtobeactivewhenithascontinuedtoreceivedatapointsin the recent past. When a mature cluster is not active, it is said to be inactive. In some cases, a trend-setting cluster becomes inactive before it has had a chance to mature. Such a cluster typically contains a small number of transient data points. In a given application, this may typically be the result of a short-term abnormality. Since the stream clustering process should provide a greater level of impor- tance to recent clusters, we will provide a time-sensitive weightage to each data point. It is assumed that each data point has a time-dependent weight defined by the function f(t). The function f(t) is also referred to as the fading function. The fading function f(t) is a non-monotonic decreasing function which decays uniformly with time t. In order to formalize this concept, we will define the half-life of a point in the data stream. Definition 1. The half life t of a point is defined as the time at which f(t )= 0 0 (1/2)f(0). Conceptually,theaimofdefiningahalflifeistoquantifytherateofdecayofthe importanceofeachdatapointinthestreamclusteringprocess.Thedecay-rateis definedastheinverseofthehalflifeofthedatastream.Wedenotethedecayrate by λ=1/t . We denote the weight function of each point in the data stream by 0 f(t) = 2−λ·t. From the perspective of the clustering process, the weight of each data point is f(t). It is easy to see that this decay function creates a half life of 1/λ.Itisalsoevidentthatbychangingthevalueofλ,itispossibletochangethe rate at which the importance of the historical information in the data stream decays. The higher the value of λ, the lower the importance of the historical information compared to more recent data. For more stable data streams, it is desirabletopickasmallervalueofλ,whereasforrapidlyevolvingdatastreams, it is desirable to pick a larger value of λ. When a cluster is created during the streaming process by a newly arriving data point, it is allowed to remain as a trend-setting outlier for at least one half-life. During that period, if at least one more data point arrives, then the cluster becomes an active and mature cluster. On the other hand, if no new 6 AggarwalandYu points arrive during a half-life, then the trend-setting outlier is recognized as a true anomaly in the data stream. At this point, this anomaly is removed from the list of current clusters. We refer to the process of removal as cluster death. Thus, a new cluster containing one data point dies when the (weighted) number of points in the cluster is 0.5. The same criterion is used to define the death of mature clusters. A necessary condition for this criterion to be met is that the inactivity period in the cluster has exceeded the half life 1/λ. The greater the number of points in the cluster, the greater the level by which the inactivity period would need to exceed its half life in order to meet the criterion. This is a natural solution, since it is intuitively desirable to have stronger requirements (alongerinactivityperiod)forthedeathofaclustercontainingalargernumber of points. In the next section, we will discuss the process for cluster droplet mainte- nance. The algorithm dynamically maintains a set of clusters by using an algo- rithmwhichscaleseffectivelywithdatasize.Inordertoachievebetterscalability of the maintenance process, it is necessary to construct data structures which allow for additive operations on the data points. 3. Storage and Update of Cluster Statistics The data stream consists of a set of multi-dimensional records of dimensionality denoted by d. The format of the attribute values for each data point is defined basedonthedomainathand.Forthecaseofthecategoricalstreamdomain,each of these d dimensions corresponds to a categorical attribute value. It is assumed thattheithcategoricaldimensioncontainsv possiblevalues.Forthecaseoftext i records,eachdimensioncorrespondstothefrequencyofawordinthedocument. Thus, the dimensionality is defined by the size of the lexicon. Only words which areincludedinadocumenthavenon-zerofrequency.Thecorrespondingattribute value is the numeric frequency of that word in the vector space representation. For the purpose of achieving greater accuracy in the clustering process, it is necessary to maintain a high level of granularity in the underlying data struc- tures. In order to achieve this goal, we will use a process in which condensed clusters of data points are maintained. We will refer to such groups as cluster droplets. We will discuss and define the cluster droplet differently for the case of text and categorical data streams respectively. First, we will define the cluster droplet for the categorical data domain: Definition 2. A cluster droplet D(t,C) for a set of categorical data points C at time t is referred to as a tuple (DF2, DF1, n, w(t), l), in which each tuple component is defined as follows: –ThevectorDF2containsPi∈{1...d},j∈{1,...d},i6=jvi·vj entries.Foreachpairof dimensions, we maintain v ·v values. We note that v is number of possible i j i categorical values of dimension i and v is the number of possible values of j dimension j. Thus, for each of the v ·v categorical value pairs i and j, we i j maintain the (weighted) counts of the number of points for each value pair which are included in cluster C. In other words, for every possible pair of categorical values of dimensions i and j, we maintain the weighted number of points in the cluster in which these values co-occur. –The vector DF1 contains Pdi=1vi entries. For each i, we maintain a weighted OnClusteringMassiveTextandCategoricalDataStreams 7 count of each of the v possible values of categorical attribute i occurring in i the cluster. –The entry n contains the number of data points in the cluster. –The entry w(t) contains the sum of the weights of the data points at time t. We note that the value w(t) is a function of the time t and decays with time unless new data points are added to the droplet D(t). –The entry l contains the time stamp of the last time that a data point was added to the cluster. The reason for keeping the pairwise information on attribute values will become clear in the section on offline analysis, in which we exploit this information to perform analysis of inter-attribute correlations within clusters. We note that the above definition of a droplet assumes a data set in which each categorical attribute takes on a small number of possible values. (Thus, the value of v for i each dimension i is relatively small.) However, in many cases, the data might actually be somewhat sparse. In such cases, the values of v could be relatively i large. In those instances, we use a sparse representation. Specifically, for each pair of dimensions i and j, we maintain a list of the categorical value pairs which have non-zero counts. In a second list, we store the actual counts of these pairs. In many cases, this results in considerable savings of storage space. For example, consider the dimension pairs i and j, which contain v and v possible i j categorical values. Also, let us consider the case when b ≤ v and b ≤ v of i i j j them have non-zero presence in the droplet. Thus, at most b ·b categorical i j attribute pairs will co-occur in the points in the cluster. We maintain a list of these(atmost)b <b ·b valuepairsalongwiththecorrespondingcounts.This ij i j requires a storage of 3·b values. (Two entries are required for the identities of ij the value pairs and one is required for the count.) We note that if the number of distinct non-zero values b and b are substantially lower than the number of i j possible non-zero values v and v respectively, then it may be more economical i j to store 3·b values instead of v ·v entries. These correspond to the list of ij i j categoricalvalueswhichhavenon-zeropresencetogetherwiththecorresponding weighted counts. This works best for correlated data in which there are even fewer non-zero entries within a cluster droplet. Similarly, for the case of DF1, we only need to maintain 2·b entries for each dimension i. i Next, we consider the case of the text data set which is an example of a sparse numeric data set. This is because most documents contain only a small fraction of the vocabulary with non-zero frequency. The only difference with the categorical data domain is the way in which the underlying cluster droplets are maintained. Definition 3. A cluster droplet D(t,C) for a set of text data points C at time t is defined to as a tuple (DF2,DF1,n,w(t),l). Each tuple component is defined as follows: –The vector DF2 contains at most 3·wb·(wb−1)/2 entries. Here wb is the number of distinct words with non-zero presence in the cluster C. For each pair of dimensions, we maintain a list of the pairs of word ids with non-zero counts. We also maintain the sum of the weighted counts for such word pairs. Inpractice, the number is substantially lower than 3·wb·(wb−1)/2, sincewe only need to keep those word pairs which co-occur in at least one document. –ThevectorDF1contains2·wbentries.Wemaintaintheidentitiesofthewords 8 AggarwalandYu withnon-zerocounts.Inaddition,wemaintainthesumoftheweightedcounts for each word occurring in the cluster. –The entry n contains the number of data points in the cluster. –The entry w(t) contains the sum of the weights of the data points at time t. We note that the value w(t) is a function of the time t and decays with time unless new data points are added to the droplet D(t). –The entry l contains the time stamp of the last time that a data point was added to the cluster. We note that the definition of the cluster droplet for the case of categorical and text data sets are quite similar, except that the text data sets are sparse, and the categorical data sets are dense. The representation for each case needs to take this into account. In addition, the categorical data set contains unordered values for each attribute, which is not the case for the text data set. The concept of cluster droplet has some interesting properties that will be useful during the maintenance process. These properties relate to the additivity and decay behavior of the cluster droplet. Observation 1. Consider the cluster droplets D(t,C )= 1 (DF2 ,DF1 ,n ,w(t) ,l ) and D(t,C )= 1 1 1 1 1 2 (DF2 ,DF1 ,n ,w(t) ,l ). Then the cluster droplet D(t,C ∪C ) is defined by 2 2 2 2 2 1 2 the tuple (DF2 +DF2 ,DF1 +DF1 ,n +n ,w(t) +w(t) ,max{l ,l }). 1 2 1 2 1 2 1 2 1 2 The cluster droplet for the union of two clusters is the sum of individual entries. The only exception is the last entry which is the maxima of the two last-update times. We note that the additivity property provides considerable convenience for data stream processing since the entries can be updated effi- ciently using simple additive and maximization operations. Thesecondobservationrelatestotherateofdecayofthecondenseddroplets. Since the weights of each data point decay with the passage of time, the cor- responding entries also decay at the same rate. Correspondingly, we make the following observation: Observation 2. Consider the cluster droplet D(t,C)= (DF2,DF1,n,w(t),l). Then the entries of the same cluster droplet C at a time t′ >t (without the addition of new data points) are given by D(t′,C)=(DF2· 2−λ·(t′−t),DF1·2−λ·(t′−t),n,w(t)·2−λ·(t′−t),l). The above observation is important in regulating the rate at which the data points in the cluster decay over time. The combination of the two observations discussed above are essential in maintaining the clusters over time. 3.1. Cluster Droplet Maintenance In this subsection, we will discuss the process of cluster droplet maintenance. The purpose of using cluster droplets is to create an effective intermediate rep- resentation from which a user may query using a variety of parameters. The maintenance algorithm continuously maintains the droplets C ...C , which it 1 k updates as new data points arrive. For each cluster, the entire set of statistics in the droplet is maintained. The maximum number k of droplets maintained is dependent upon the amount of available main memory. Even if modest memory OnClusteringMassiveTextandCategoricalDataStreams 9 AlgorithmMaintainStreamClusters(Clusters:C1...Ck, IncomingDataPoint:Xq); begin {LetthetimeatwhichXq arrivesbetq; fori=1tok S(Xq,Ci)=ComputeSimilarity(Xq,Ci); mindex=argmaxi∈{1,...k}S(Xq,Ci); if(S(Xq,Cmindex)>thresh)AddPoints(Xq,tq,Cmindex); else begin CreatenewclusterwithsolitarypointXq; CreateclusterdropletstatisticswithsolitarypointXq; Removetheleastrecentlyupdatedclustertomakeroomfornewcluster; end; end; Fig. 1. MaintenanceofClusterDroplets AlgorithmAddPoints(DataPoint:Xq,Timestamp:tq, Cluster:Cj); begin {Decayfactorneedstobeupdatedsinceitwaslast changedatthelastupdatetimeoftheclusterCj } LetlbethelastupdatetimeofclusterCj; UseObservation2toapplythetemporaldecay factore−λ·(tq−l) toupdate statisticsofclusterCj; UseObservation1toaddthedatapointXq toCj andupdateitsstatistics; end Fig. 2. Addinganewpointtoacluster resourcesareavailable,thestatisticscanbemaintainedatarelativelyhighlevel of granularity. Atthebeginningofalgorithmicexecution,westartwithanemptysetofclus- ters. As new data points arrive, unit clusters containing individual data points are created. Once a maximum number k of such clusters have been created, we can begin the process of online cluster maintenance. Thus, we initially start off with a trivial set of k clusters. These clusters are updated over time with the arrival of new data points. When a new data point X arrives, its similarity to each cluster droplet is computed. For the case of text data sets, the cosine similarity measure (Cutting et al, 1992; Silverstein et al, 1997) between DF1 and X is used. Let q ...q 1 n be the frequencies of the words in DF1, and u ...u be the frequencies of the 1 n wordsinX.Then,thecosinemeasurebetweenDF1andX isdefinedasfollows: n cosine(DF1,X)= Pr=1qr·ur (1) pPnr=1qr2pPnr=1u2r While the cosine measure is used for the purpose of this paper, other coeffi- cients such as dice and jaccard coefficients can also be supported. The following relationships may be used: n jaccard(DF1,X)= Pr=1qr·ur (2) Pnr=1qr2+Pnr=1u2r−Pnr=1qr·ur 10 AggarwalandYu n dice(DF1,X)= Pr=12·qr·ur (3) Pnr=1qr2+Pnr=1u2r We note that all of these similarity functions simply use different combinations of the same basic computations (dot product and the L -modulus) as the cosine 2 function, and can therefore be computed directly from the droplet statistics. For the case of categorical data sets, the similarity measure is computed as follows: for each attribute i, we calculate the (weighted) fraction of records in the clusters which contain the same categorical value as the data point X. We notethatfortheclusterC ,thispercentagecanbecomputedfromthesummary j informationcontainedinclusterdropletD(t,C ).ThisisbecausethevectorDF1 j containstheweightedcountsofeachvalueforattributei.Therelevantweighted count is divided by the total weight w(t) in order to calculate the correspond- ing fractional presence of a particular categorical value of attribute i. Let the weighted fraction for attribute i and cluster C be denoted by wf (X,C ). The j i j average weighted fraction over all attributes for cluster C is given by the simi- j d larityvalueS(X,Cj)=Pi=1wfi(X,Cj)/d.Theoverallprocessofstreamcluster maintenance is illustrated in Figure 1. As illustrated in Figure 1, the similarity value S(X,C ) is computed over j all clusters. The cluster with the maximum value of S(X,C ) is chosen as the j relevant cluster for data insertion. Let us assume that this cluster is C . If mindex thevalueofS(X,C )islargerthanthethresholdthresh,thenthepointX mindex isassignedtotheclusterC .Theprocessofaddingadatapointtoacluster mindex is denoted byAddPoints and is illustrated in Figure 2. On the other hand, when the similarity value is less than the threshold thresh, a new cluster is created containing the solitary data point X. This newly created cluster replaces the inactive cluster. This newly created cluster replaces the least recently updated cluster from the current set of clusters. We note that the newly cluster is a potential true outlier or the beginning of a new trend of data points. Further understanding of this new cluster may only be obtained with the progress of the data stream. In the event that X is inserted into the cluster C , we need to perform mindex two steps: – We update the statistics to reflect the decay of the data points at the current moment in time. This updating is performed using the computation discussed inObservation2.Thus,therelevantupdatesareperformedina“lazy”fashion. In other words, the statistics for a cluster do not decay, until a new point is added to it. Let the corresponding time stamp of the moment of addition be t. The last update time l is available from the cluster droplet statistics. We multiply the entries in the vectors DC2, DC1 and w(t) by the factor 2−λ·(t−l) in order to update the corresponding statistics. We note that the lazy update mechanism results in stale decay characteristics for most of the clusters. This does not however affect the afore-discussed computation of the similarity measures. – In the second step, we add the statistics for each newly arriving data point to the statistics for C by using the computation discussed in Observation mindex 2. In the event that the newly arriving data point does not naturally fit in any of the cluster droplets and an inactive cluster does exist, then we replace
Description: