Figures
Abstract
Clustering analysis’ primary purpose is to divide a dataset into a finite number of segments based on the similarities between items. In recent years, a significant amount of study has focused on the spatio-temporal aspects of clustering. However, clusters are no longer regarded as static objects since changes influence them in the underlying population. This paper describes an R package implementing the MONIC framework for tracing the evolution of clusters extracted from temporal datasets. The name of the package is clusTransition, which stands for Cluster Transition. The algorithm is based on re-clustering cumulative datasets that evolve at successive time-points and monitoring the transitions experienced by the clusters in these clustering solutions. This paper’s contribution is to demonstrate how the package clusTransition is developed in the R programming language, and its workflow is discussed using hypothetical and real-life datasets.
Citation: Atif M, Leisch F (2022) clusTransition: An R package for monitoring transition in cluster solutions of temporal datasets. PLoS ONE 17(12): e0278146. https://doi.org/10.1371/journal.pone.0278146
Editor: Mohammad Mehdi Rashidi, Tongji University, CHINA
Received: July 16, 2022; Accepted: November 10, 2022; Published: December 15, 2022
Copyright: © 2022 Atif, Leisch. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper. All other data streams used in the manuscript are available in public repositories. DOI for Human Value Scale datasets: doi:10.21338/NSD-ESS-CUMULATIVE. Link for Human Value Scale datasets: https://ess-search.nsd.no/CDW/ConceptVariables Link for household Electric Power Consumption: https://archive.ics.uci.edu/452ml/datasets/individual+household+electric+power+consumption Link for Intel Lab sensor datasets: https://www.kaggle.com/datasets/divyansh22/intel-berkeley-research-lab-sensor-data.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
The prime goal of clustering analysis is the organization of a dataset into a finite number of segments according to the similarities within objects. Ideally, the set of objects in the same segment should be comparably similar to one another than to the objects belonging to different partitions [1]. Each individual partition is known as a cluster, whereas the objects belonging to the same cluster are called its members [2, 3]. Its applications covers many real-world applications, ranging from business and economics, marketing, pattern recognition, medical sciences, image processing to big data analysis [4]. For example, in the field of market segmentation, better marketing strategies can be adopted by clustering the customers with similar demographic or buying characteristics [5]. In a similar passion, clustering might be helpful in better understanding the disease and targeting appropriate treatment by sub-grouping the patients into homogeneous sets based on psychological inventory scores [6]. Since the notion of clustering is not precisely described, consequently, several algorithms/models have been proposed in the literature, and all of them may result in well different clustering solutions [7, 8].
It is already true that in recent years, a considerable amount of research work conducted is based on investigating the spatio-temporal properties of clustering. In these applications, clusters are no longer considered static objects, as they are affected by changes occurring in the underlying population [9, 15]. The inclusion of new data records to the original population over time may affect the cluster’s memberships, and entirely different clustering solutions may be generated at later time-points. This transition in clustering solutions may include the disappearing of a specific cluster(s), migration of some elements from one cluster to another, splitting of a cluster into several, several clusters splicing together to form one, survival of a cluster and emerging of new ones. The survived clusters can experience internal transition, including changes in location, size, and density [10, 11]. Various topics such as spatio-temporal, evolutionary, stream, and incremental clustering address this issue by adopting the dataset that changes over time. Tracing and understanding the phenomena behind this transition is of practical importance for effective decision-making. This can be helpful in various fields like marketing, fraud detection, networking, scientific publication, health, etc. [12].
In many real-world applications, clustering of the data stream is performed all time to identify the changes occurring in the pattern of the underlying phenomena [13]. Since in a stream, new data items are continually generated, which join the underlying population at a regular interval. Therefore, in order to control part of the data that contributes to the pattern in data mining, the stream needs to be discretized into subsets based on some attributes that have an order. This data discretization into subsets is called the windowing approach and is mainly done based on time. Some of the most commonly used examples are landmark, sliding, and damped window models [14]. These models are discussed in the next section.
[15] introduces the notion of evolutionary clustering to process the times-tamped dataset by producing a sequence of clustering solutions. That is, a clustering solution for each time-step of the temporal data. The algorithm optimizes two competing criteria i.e. each clustering in the sequence should be similar to the clustering at the previous time-step, while at the same time should accurately reflect the data arriving during that time-step. This framework is further extended to spectral clustering [16], density-based clustering [17], and Hierarchical Dirichlet Process with the Hidden Markov model [18].
Using a totally online method, Hyde et al. [19] offer an algorithm that clusters the evolving data streams into arbitrary shaped clusters. The approach consists of two stages: the first stage finds micro-clusters in the datasets, and the second step merges these micro-clusters into macro-clusters. In a similar vein, Fahy et al. [20] describe an Ant Colony Stream Clustering technique built on a density-based methodology that recognises clusters as a collection of micro-clusters. To read a stream and create micro-clusters in the window pane, the method uses a tumbling window model. By combining the related clusters based on a similarity index, these clusters are then further refined. Fahy and Yang [21] further enhance this technique to address the multi-density issue in the density-based clustering strategy. This method uses the local radius of each cluster to identify clusters, and it then tracks changes in the solutions. For the first time, multiple view clustering challenges are addressed by Huang et al. [22] in MVStream clustering method. In order to assign cluster labels to the data items that include summary statistics, this technique creates support vectors from various views of the data objects. Similarly, some studies have been conducted for measuring the similarities between the trajectory in the dynamic environment [23–25].
2 Window models
In a landmark window model, all items that arrive after some specific time-point (landmark time) are maintained and cannot be discarded irrespective of window size. The window size is uncontrolled and keeps increasing as time progresses [26, 27]. The data records that arrives in the interval (ti−1, ti) are accumulated according to the equation given by:
(1)
where n is the number of time-points and t is the current time-point. Implementation of the landmark window model will generate n window panes, where each pane contain data items evolving from starting time-point t1 to the current time-point ti.
The sliding window model, on the other hand, is based on a fixed size of window w that contains only those objects falling in the interval [ti − w + 1, ti], while older cases are discarded. In such type of model, as time progress, the window slides forward while keeping its size w by including new data records and discarding the older ones [27, 28]. The scenario of the sliding window model can be described in the equation below:
(2)
(3)
(4)
(5)
where m is the number of window panes and is equal to n − w + 2, n is the number of time-points, and w is the sliding window size.
3 The change detection algorithm
In order to monitor and trace the evaluation of clusters extracted from re-clustering of cumulative datasets [29] introduced a framework known as ‘MONIC’ algorithm. This algorithm is based on clustering cumulative datasets arriving at discrete time-points t1, t2, …, tn. Initially, the data is collected at time-point t1, and as time progresses new data records join the data set at regular interval of time. The initial datasets d1, d2, …., dn, are accumulated and re-clustered at each time-point t1, t2, …, tn to monitor and detect the cluster evolution over time.
The algorithm is mainly based on the idea of a non-symmetric overlap matrix between two clustering extracted from cumulative datasets at two different time-points. Let be a set of clusters extracted from dataset Di at time point ti and is referred to as first clustering. Similarly, let
be a set of clusters extracted from dataset Dj at time point tj (i<j) and is referred to as second clustering. Then the overlap matrix can be defined as:
(6)
where k1 is the number of clusters from the first clustering ξi, and k2 is the number of clusters from second clustering ξj. This will generate a matrix of order k1*k2, where rows and columns describe first and second clustering respectively. The value on the corresponding element of the matrix represents the similarity index between cluster Xi and Yj. The MONIC framework assumes hard clustering where each observation belongs to one and only one cluster [30].
In the context of this algorithm, the transition is the change experienced by a cluster Xi ϵ ξi, when it has been perceived at second clustering ξj. This change in the clustering solution is referred to as an external or internal transition. External transition concern the relationship of cluster found at clustering ξi to the clusters found at clustering ξj, whereas internal transition is regarded as changes that occurred in the structure of the survived clusters.
The external transition is categorized into five categories i.e. Survive, Merge, Split, Disappear, and Emerge candidates. The cluster Xl ϵ ξi may survive into Ym ϵ ξj, clusters may merge to form Ym ϵ ξj, or cluster Xl ϵ ξi may split into various daughter clusters
. If a cluster Xl ϵ ξi does not experience any of the above transitions, then it disappears. Similarly, if a cluster Ym ϵ ξj is not a result of any external transition from its ancestors, then it is a newly emerged candidate. The overlap between Xl ϵ ξi and Ym ϵ ξj serve as an indicator of identifying the external transition experienced by clusters at clustering ξi. This value is compared with a minimum threshold value say τϵ[0.5, 1] to identify match of X ϵ ξi in Y ϵ ξj. A cluster Xl ϵ ξi is said to survive in Ym ϵ ξj if this is the only cluster that has an overlap of greater than τsurvive. If at least two clusters from X ϵ ξi (such as
have an overlap of greater than τsurvive with Ym ϵ ξj), then it is a case of merge i.e. X1 and X2 merge to form Ym. Furthermore, a cluster is said to split in daughter clusters, if the overlap of Xl with
and
is greater than τsplit and collectively their overlap is greater than τsurvive, i.e. for split the following two conditions are required.
(7)
(8)
where M is the number of daughter clusters from second clustering.
The overlap can not be used as an indicator for monitoring the changes in the form of survived clusters. The shift in the location of the survived cluster (Xl → Ym) can be traced by calculating Euclidean distance between their centroids normalized by the minimum radius. This information can be summarized in the following formula:
(9)
where
and
are the centroids of clusters Xl and Ym respectively, and
is the Euclidean distance between them. The r denotes radius of the corresponding clusters and is computed as the maximum distance of an object from its cluster centroid. If the absolute value of location.difference is greater than τlocation, then the algorithm will detect a shift in location of the survived cluster.
For density transition, the average distance of objects from cluster centroid can be computed. The formula for the density of cluster is given by:
(10)
The difference in density of cluster Xl survived in Ym is normalized by the minimum radius i.e.
(11)
If the absolute value density.difference is less than τdensity then there is no change in density of the survived cluster. On the other hand, if the absolute value is greater than τdensity then a change in density would be detected. If density.difference is positive then the cluster is more compact than its ancestors, otherwise, it becomes more diffuse.
4 Package description
The state-of-the-art “MONIC” algorithm is implemented in the R-software via package clusTransition. The package can be used for tracing and monitoring the evolution of clustering solutions in cumulative datasets over time. In this section, we briefly describe the functions and methods exported by the package in detail. Fig 1 below demonstrates the workflow of the package.
The Transition function exported by the package offers three different options for importing datasets. The function then trace changes in clustering solutions.
Table 1 below summarizes the functions, methods, and classes exported by the package along with its corresponding arguments and slots.
More details about these functions and classes are described below.
4.1 Function Transition()
The evolution of clusters can be traced using the primary function Transition(), which exports an object of class S4. In implementing the package clusTransition, we have considered the portability of the functions for various types of hard clustering algorithms. A typical call to the Transition() function involves three essential pieces: the data input (listdata, listclus, overlap), choice of window swSize, and the threshold parameters. The user must only provide the swSize and k arguments in case of importing datasets using the listdata argument. This function has the following interface:
>Transition(listdata, listclus = NULL, Overlap = NULL, swSize = 1, typeind = 1,
+ survival_thresHold = 0.8, split_thresHold = 0.3, location_thresHold = 0.3,
+ density_thresHold = 0.3, k)
We took into account the portability of the functions for many kinds of hard clustering algorithms while developing the clusTransition package. For this purpose three different options i.e. listdata, listclus, and Overlap are provided for importing the data.
The listdata imports the raw data stream at discrete time points t1, t2, …, tn. A sequence of cluster solutions are generated from the stream using k-means clustering algorithm. Each element of the list corresponds to the dataset at a single time point. The number of clusters in each accumulative data matrix is specified by the argument k.
On the other hand, the listclus argument imports the clustering solutions at successive time-points to allow clusters other than k-means. Each element of listclus is a nested list that contain clustering solutions at corresponding time point i.e. .
Overlap is a List of numeric matrices containing similarity measures between clusters extracted at consecutive time points. The similarity between clusters are computed using Eq 6. The Overlap method exported by the package can be used to compute the similarity matrices.
swSize indicates size of the sliding window model. The default value of swSize = 1 implements the landmark window model and discretize the stream according to Eq 1. Whereas other numeric values discretize the stream using a sliding window scenario according to Eq 5. The sliding window size can only be provided if listdata argument is chosen.
The survival_thresHold, split_thresHold, location_thresHold, and density_thresHold are minimum threshold value for survival of clusters from Xϵξi to Yϵξj, split of cluster Xϵξi to {Ym1, Ym2}ϵξj, shift in location, and changes in density of survived clusters respectively. These are user defined parameters and belongs to the interval (0,1).
One of the most perplexing problems with most clustering algorithms is deciding the ideal number of partitions. This is a crucial parameter for partitioning, hierarchical and model-based clustering algorithms. The number of clusters one wants to generate from a dataset has to be predefined. There are several ways of estimating the optimal number of clusters k, such as the silhouette, Gap, and Elbow methods. k is a numeric vector containing the relevant number of clusters at the corresponding time-point. The length of k is to be determined from the swSize. This argument should only be provided if listdata argument is chosen.
Typing the object’s name comprising the Transition() function’s output will produce external and internal transition results at each time point. External transition includes the number of clusters still existent, absorbed by others, split into various, disappeared and newly emerged at second clustering. Internal transition comprises changes in the location and density of the survived clusters.
Along with this information, the Monic object holds the cluster’s radius, membership, and distance between cluster centres.
4.2 OverLap class
This is an object of class OverLap that contains summaries of first and second clustering. This object has eight slots that work as input for tracking the evolution of clusters by the Transition() function. The slots include a numeric matrix containing the similarities between clusters generated at first and second clustering (Overlap computed from Eq 6), the cluster’s membership vector, radius, centres, and an average distance of items from the cluster’s centres (computed from Eq 10). In addition, this has the following interface:
>obj <- new(“OverLap”)
4.3 Overlap method
This method initializes the slots of an object having class OverLap by importing the clustering solution ξ of cumulative datasets D at two consecutive time points i and j. Clusters at each data point should be provided as a list of matrices, where each matrix contains a data set belonging to one cluster. It has the following interface.
>Overlap <- Overlap(object, e1 = C1, e2 = C2)
where e1 is the set of clusters obtained at time point ti from cumulative dataset Di, e2 is the set of clusters
obtained at time point tj from cumulative dataset Dj, and object is an object of class OverLap.
4.4 Function moplot()
This method plot 3 bar-plot and 1 line graph. The first stack bar-plot shows SurvivalRatio and AbsorptionRatio, second bar-plot shows number of new emerged clusters at each time stamp, third bar-plot shows number of disappearance at each time stamp. The line graph shows passforward Ratio and SurvivalRatio.
> plot(obj)
5 Simulation example
Let us assumes that a data stream consist of datasets d1, d2, …, dn arriving at corresponding time-points t1, t2, …, tn respectively. For the generation of initial dataset d1, we use a generator that takes into account the number of clusters (k), size of each cluster, and separation value between theme [31]. While the generator for generating other streams like d2, d3, …, dn consider the center of each cluster, size of each cluster, and the co-variance structure between them as input [32, 33].
As a working example, we generate a data stream sprouting at four consecutive time points.Fig 2 below demonstrates the scenario for generating datasets di, i = 1, 2, 3, 4 at four time points. The new objects joining the underlying population are shown by red color whereas older records are displayed by black color.
The new data items at each time stamp is shown by the red color, whereas the older data items are shown by black color.
6 Pre-processing
Prior to the implementation of the change detection algorithm in cluster solutions over time, the user needs to pre-specify some relevant parameters. First of all, the user needs to decide a suitable windowing approach for the accumulation of datasets evolving at successive time points. For this purpose, we offered two types of windowing approaches in the package i.e. landmark and sliding window models. Implementation of the windowing approach will accumulate the datasets at corresponding time points according to the chosen model and will generate window panes at successive time points. In the second phase, the optimal number of clusters in each window pane Di at the corresponding time point must be determined using an appropriate technique. For illustration purposes, we use worked examples based on the datasets simulated in section IV. The datasets are accumulated according to the landmark and sliding windowing approaches, and then the optimal number of clusters was estimated in each window pane Di.
The implementation of the landmark window model will produce four window panes. Each pane will contain the datasets generated between [t1, ti], where ti represent the current time point. Table 2 below demonstrates the number of objects and optimal number of clusters in each window pane Di estimated from Gap statistics at corresponding time points ti.
Similarly, the implementation of a sliding window of size 3 will generate 3 window panes. Table 3 below demonstrates the number of objects and optimal number of clusters in each window pane Di.
7 Implementation of function Transition()
In this section implementation of the primary function, Transition() is presented using working examples. The data stream simulated in section 5 is used for monitoring the cluster evolution over time. The function provides three different options for importing the datasets, which are explained in subsections below.
7.1 Looking at listdata argument
The argument listdata is a list of matrices or data frames containing the datasets d1, d2, …, dn evolving at corresponding time-points t1, t2, …, tn. The ith element of the listdata comprises set of data items di that evolve at corresponding time point ti. At this point the Transition() function accumulates the datasets di according to the suitable windowing approach provided in swSize argument. The default value i.e Swsize = 1 will implement landmark window model, whereas other integer values implements sliding window model. The accumulation of datasets di will generate window panes Di that contain cumulative datasets at successive time points. Each window pane Di will be re-clustered by using cclust() function from flexclust package [34]. The optimal number of clusters in cumulative datasets Di should be decided by the user and must be imported via argument k of the function. Both k and swSize arguments are used only if listdata option is chosen for importing datasets di. The argument typeind = 1 allows the user to implement listdata argument. Monitoring and tracking the evolution of clusters using the landmark window model is shown in the example below.
7.1.1 Example (listdata argument with landmark window model).
The default value of swSize = 1 implements the landmark window model and generates n window panes of cumulative datasets Di according to Eq 1. In this working example, the datasets generated in section 5 is used. According to Table 2 in this simulated example window panes D1, D2, D3, and D4 comprises of 4, 4, 5, and 4 clusters respectively. Hence the Transition() function with arguments listdata = listdata, swSize = 1, typeind = 1, Survival_thrHold = 0.8, Split_thrHold = 0.3, and k = c(4,4,5,4) can be implemented as:
>library(clusTransition)
>listdata <- list(d1, d2, d3, d4)
>clusterTrace <- Transition(listdata = listdata, swSize = 1, typeind = 1,
+ Survival_thrHold = 0.8, Split_thrHold = 0.3, k = c(4,4,5,4))
This will generate two tables, displaying the number of clusters experiencing external and internal transition at successive time points. The first table in the output comprises the number of clusters that experience external transitions at corresponding time points tj. Similarly, the second table comprises the number of survived clusters that undergone internal transitions at corresponding time points. Hence the full summary of external and internal transitions are shown below.
The object clusterTrace returned by the Transition() function is an object of class S4, named Monic. The object contains the candidates that experience external and internal transitions at successive time points. The slots ending with x represent candidates that adopt external transitions from first clustering ξi. Whereas the slots ending with y represent the candidates that evolve as a result of corresponding external transition at second clustering ξj. For example, the candidates that experience external transitions at time point t3 can be retrieved as:
Let Cimϵξi(first clustering) be the cluster that experience some external transition and evolve as Cjnϵξj(second clustering). Where the first subscript (i and j) represent time point and second subscript (m and n) represent the cluster number. The Time Step [3]] in the output represents the time point tj at second clustering, and hence the time point ti (i = j − 1) at first clustering ξi is one less. So in this particular example i = 2 and j = 3, then the above transition can be summarized as:
The algorithm detect that three clusters survive (C21→C31, C23→C34, and C24→C32) and one cluster split (C22→{C33, C35}).
7.1.2 Example (listdata argument with sliding window model).
In case one is interested in sliding window model, where older records are discarded with the progression of time. This can be achieved by utilizing swSize argument. Here in this synthetic example swSize = 3 will generate window panes that contain datasets arrives in the interval [ti − 3 + 1, ti]. Analysis of Table 3 demonstrates that the number of clusters in window panes D1, D2, and D3 are 4, 5, and 6 respectively. Hence the Transition() function with arguments listdata = listdata, swSize = 3, typeind = 1, Survival_thrHold = 0.8, Split_thrHold = 0.3, and k = c(4,5,6) can be implemented as:
>clusterTrace <- Transition(listdata = listdata, swSize = 3, typeind = 1, + Survival_thrHold = 0.8, Split_thrHold = 0.3, k = c(4,5,6))
7.2 Looking at listclus argument
The listdata argument permit the users to implement un-clustered datasets d1, d2, …, dn arrives at time-points t1, t2, …, tn. However, this restricts the package to only one type of clustering algorithm i.e. k-means algorithm. In order to make the package more flexible for other types of hard clustering, an alternate argument listclus is provided in the function. The listclus argument imports clustering solutions of each window pane as a list i.e. listclus = {ξ1, ξ2, …, ξn} and compute the similarity indices between them. The argument listclus is a list, where every individual element is a nested list of matrices or data-frames. The ith element corresponds to the set of clusters extracted at time-point ti, by implementation of an appropriate clustering algorithm to window pane Di. This is explained in the example given below.
7.2.1 Example: Listclus argument.
Prior to applying Transition() function, the user need to extract clusters from each window pane Di. For this purpose, first of all, accumulate the initially collected datasets d1, d2, …, dn, according to a suitable window model like landmark in this example. This can be done by explicitly calling merge() function from base package. By running the R codes given below will generate 4 panes.
>D1 <- d1
>D2 <- merge(d1, d2, all.x = TRUE, all.y = TRUE)
>D3 <- merge(D2, d3, all.x = TRUE, all.y = TRUE)
>D4 <- merge(D3, d4, all.x = TRUE, all.y = TRUE)
Fitting of clustering algorithm
Afterward, choose the relevant number of clusters from each window pane Di, and extract clusters by implementing an appropriate clustering algorithm. Save this clustering solution as a list of matrices or data frames. For illustration purposes, we obtain 4, 4, 5, and 4 clusters from datasets D1, D2, D3, and D4 respectively.
>set.seed(100)
>fit1 <- kmeans(D1, 4)
>C1 <- list()
>for(i in 1:4)C1[[i]] <- D1[fit1$cluster == i,]
where C1 = {C11, C12, C13, C14} is a list of clusters extracted from D1 at time point t1. Similarly, extract clusters from all window panes at corresponding time point as:
>fit2 <- kmeans(D2, 4)
>C2 <- list()
>for(i in 1:4)C2[[i]] <- D2[fit2$cluster == i,]
>fit3 <- kmeans(D3, 5)
>C3 <- list()
>for(i in 1:5)C3[[i]] <- D3[fit3$cluster == i,]
>fit4 <- kmeans(D4, 4)
>C4 <- list()
>for(i in 1:4)C4[[i]] <- D4[fit4$cluster == i,]
Combine all these lists of clustering solutions in a single list and apply Transition() function with arguments listclus = listclus, typeind = 3, Survival_thrHold = 0.8, Split_thrHold = 0.3 as:
>listclus <- list(C1, C2, C3, C4)
>clusterTrace <- Transition(listclus = listclus, typeind = 3,
+ Survival_thrHold = 0.8, Split_thrHold = 0.3)
7.3 Looking at Overlap argument
The Overlap argument also permits the user to implement other types of clustering algorithms and trace the evolution of clusters over time. Overlap argument imports a list of objects as produced by the Overlap() method that contain similarity between clustering obtained at successive time points ti and tj (i < j) and the summaries of these clusters. This can be implemented by setting typeind = 2. The overlap matrices can be computed by utilizing the S4 method overlap() exported by the clusTransition package. In the same way as listclus, some clustering algorithm can be applied to landmark or sliding window modeled dataset to extract the cluster memberships at corresponding time-points. List of clusters extracted from Di and Di−1 can be used to compute the overlap matrix between clustering. This is elaborated in the working example given below.
7.3.1 Example: Overlap argument.
Let C1 = {C11, C12, C13, C14}, C2 = {C21, C22, C23, C24}, C3 = {C31, C32, C33, C34, C35}, and C4 = {C41, C42, C43, C44} be the set of clustering solutions obtained from corresponding datasets D1, D2, D3, and D4. These sets of clustering solutions are already obtained in the previous example. Then the objects of class OverLap can be created and initialized as:
>obj <- new(“OverLap”)
>Overlap1 <- Overlap(obj, e1 = C1, e2 = C2)
>Overlap2 <- Overlap(obj, e1 = C2, e2 = C3)
>Overlap3 <- Overlap(obj, e1 = C3, e2 = C4)
Combine all these objects in a list and apply Transition() function with arguments Overlap = Overlap, typeind = 2, Survival_thrHold = 0.8, Split_thrHold = 0.3 as:
>Overlap <- list(Overlap1, Overlap2, Overlap3)
>clusterTrace <- Transition(Overlap = Overlap, typeind = 2,
+ Survival_thrHold = 0.8, Split_thrHold = 0.3)
7.4 moplot() function
Fig 3 displays the graphical summary of an object of class Monic generated by Transition() function as output. The stack bar-plot in the top left corner displays the survival and absorption ratio at successive time points. The Figure illustrates that all clusters survived at time point t1, and hence the survival ratio is 1. However, at time point t2 3 out of 4 clusters survived resulting in a 0.75 survival ratio. Similarly at time point t3 3 out of 5 clusters survive, while 2 merged. This resulted in 0.60 survival and 0.40 absorption ratios respectively. Consequently, no cluster disappears and no newly emerged candidate were detected at any of the time points. This can be seen from pass-forward ratio, which is unity at all time points except t2 where one cluster splits into daughter candidates.
The new data items at each time stamp is shown by the red color, whereas the older data items are shown by black color.
8 Real data example
To demonstrate the practicality of the package and deeply understand applications of cluster evolution, we investigate three real-life datasets. To comprehend the notion of transformation in social, political, and moral attitudes of European nations; the Human Values datasets were extracted from European Social Surveys [35]. The changes in electricity consumption of inhabitants were traced using Individual Household Electricity Consumption dataset. Similarly, the Intel Lab sensors streaming dataset was used to show the applications of the framework. Both these data streams were extracted from the home page of “UCL Machine Learning Repository”.
8.1 Application to human values scale
As a case study, we extract eight datasets each corresponds to a single round of European Social Surveys (ESS) conducted in years 2002, 2004, 2006, 2008, 2010, 2012, 2014, and 2016 respectively. The dataset consist of 25024 individuals who respond to the Schwartz Value Survey (SVS) for computing basic human values and can be downloaded from the URL https://ess-search.nsd.no/CDW/ConceptVariables. The ten basic values are Benevolence, Universalism, Self-direction, security, Confirmatory, Hedonism, Achievements, Traditions, Stimulation, and Power [35]. The k-means clustering algorithm was implemented to sliding window-modeled datasets at each time point. Whereas, the number of clusters in the respective datasets was estimated from the well-known GAP statistic. Fig 4 below describe the evolution of clusters at time point ti, i = 1, 2, 3, 4, 5, 6, 7 in Human Value scale datasets. which demonstrates that two clusters C11 and C12 survived over time. The first imperative cluster was C11(C11→C22→C32→C42) that emerged at t1(2002) and survived until t4(2006, 2010). However, the cluster survived till 2010, but experienced internal transition and became more diffused eventually disappeared at time-point t5. The second vibrant cluster was C12(C12→C24→C33→C41→C52→C63→C71) which survive through the entire time span. This was the most important cluster because not only it survives over time but also turns out to be denser. Mostly the new respondents of SVS surveys over the years joins this cluster. The shift in location was observed for this cluster at time-point t2 and t3, and afterward, remain stable. The first external transition was experienced in the cluster C14 which split into two clusters and ultimately disappeared. The algorithm also detects a cluster C61 that emerged at t6(2010, 2014) and pass-forward while absorbing elements of the cluster C62.
8.2 Application to Individual Household Electric Power Consumption
As a second example, the Individual Household Electric Power Consumption dataset for the years [2006, 2010] was used. This dataset comprises of 2075259 households characterized by seven numerical attributes. The dataset is available at machine learning repository [36] and can be downloaded from https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption. A sliding window model of size 2 was used for accumulation of the stream at successive time points. In this section, we use the CLARA algorithm to extract clusters from the datasets at successive time points. Whereas the average silhouette method was used to estimate the optimal value of k in each window pane. Fig 5 below demonstrates the evolution of clusters at time point ti, i = 1, 2, 3, 4, 5 in individual household electric power consumption datasets. The algorithm detect that all of the four clusters survive (C11→C21, C12→C21, C13→C23, and C14→C24) experiencing internal transition and became diffuse during [2006, 2007]. A shift in location for only one cluster C13 was detected, whereas other clusters were stable to change in location. Similarly, three clusters survive (C21→C31, C22→C33, and C24→C34), one cluster disappear (C23→ ⊙), and one cluster emerged (⊙→C32) during [2007, 2008]. Two of the survive clusters became diffuse, while one cluster became compact than its predecessors. Likewise, one cluster survive (C33→C43), three disappears (C31→ ⊙, C32→ ⊙, and C34→ ⊙), and three newly emerged clusters (⊙→C41, ⊙→C42, and ⊙→C44) were detected during [2008, 2009]. Afterwards all four clusters disappears (C41→ ⊙, C42→ ⊙, C43→ ⊙, and C44→ ⊙), and three new clusters emerged (⊙→C51, ⊙→C52, and ⊙→C53) during [2009, 2010].
8.3 Intel Lab dataset
In this section, we used the publically accessible dataset recorded from 54 sensors deployed at Intel research laboratory during February 28th and April 5th, 2004. Each sensor record information on temperature, humidity, voltage, and light every thirty-one seconds. The dataset comprises of 2.3 million readings collected from 54 sensors. The sensors were designed to make it energy-efficient and consume power only in sensing environment and transmitting data. We select only a subset of measurements from this dataset and include readings from sensor-1 only. This subset of the data consists of 43,047 readings from sensor-1 and can be downloaded from the URL https://www.kaggle.com/datasets/divyansh22/intel-berkeley-research-lab-sensor-data.
We accumulate the dataset according to the landmark window model, and as the flow is uniform, so we consider 9000 records per time period. This implementation generates 5 window panes of cumulative datasets. The shadow statistic decided the optimal number of clusters in cumulative datasets at the corresponding time point. The Partitioned Around Medoids (PAM) algorithm was used for extracting clusters from datasets.
Fig 6 below demonstrates the transitions of clusters at time points ti, i = 1, 2, 3, 4, 5 in Intel Lab dataset. The algorithm detect that all six clusters survive (C11→C21, C12→C22, C13→C24, C14→C25, C15→C26, and C16→C23) while one new cluster emerge (⊙→C27) at time point t2. All survived clusters experience internal transition and became more diffuse. Also six clusters survive (C21→C31, C22→C32, C24→C33, C25→C34, C26→C35, and C27→C36) and one cluster disappears (C23→ ⊙) at time point t3. Cluster C24 experience double internal transition i.e. shift in location and change in density, while other clusters only became diffuse. Likewise, five clusters survive (C31→C43, C32→C45, C34→C44, C35→C42, and C36→C47), one cluster disappears (C33→ ⊙), and two clusters emerged (⊙→C41 and ⊙→C46) at time point t4. Similarly, five clusters survive (C42→C54, C43→C56, C44→C57, C45 →C53, and C47→C55), two clusters merge ({C41, C46}→C51), whereas one cluster emerge (⊙→C52) at time point t5.
For further details and understanding the significance and practical applications of monitoring changes in clustering solutions of streaming datasets see Atif et al [37].
9 Concluding remarks
In this paper, we introduce an R package clusTransition dedicated to trace the evolution of cluster solutions in cumulative datasets. The package implements state-of-the-art algorithm MONIC for modeling and tracing the transition of cluster solutions in dynamic datasets. This algorithm is based on re-clustering of cumulative datasets D1, D2, …, Dn arriving at corresponding time-points t1, t2, …, tn and monitor the changes occurring in these cluster solutions. The changes comprise of clusters that still exist, split into various, absorbed by others, disappeared and newly emerged. The clusters that survived in external transition may experience a change in location and density called internal transition. We have applied clusTransition package on synthetic as well as on real-life datasets to look insight into change detection framework.
10 Limitations of the package
The clusTransition package takes into account batch processing, where the stream is discretized and the gathered data is put into the windowing model. The datasets are not clustered upon arrival immediately in real time. Similarly, the use of sliding and landmark models either contain the data items or entirely ignore them at subsequent time-points. A damped window model, on the other hand, assigns each object, depending on its arrival time, exponentially decreasing weights. Future plans call for adding support for the damped window model to the R package for change detection.
The paradigm for cluster transition monitoring presupposes hard clustering, which requires that each item be assigned to one and only one cluster. This assumption implies that the strategy cannot be used to density-based or model-based clustering approaches, leaving the problem open for further investigation.
References
- 1.
Wierzchoń and M. Kłopotek. Modern Algorithms of Cluster Analysis. Studies in Big Data. Springer International Publishing, 2017. URL: https://books.google.com.pk/books?id=LeJEDwAAQBAJ.
- 2. Rapkin B.D., Luke D.A. Cluster analysis in community research: Epistemology and practice. Am J Commun Psychol. 1993; 21, 247–277. https://doi.org/10.1007/BF00941623
- 3.
H.C. Romesburg. Cluster Analysis for Researchers. Morrisville, NC: Lulu.com. (Reprint of 1984 edition, with minor revisions.); 2004.
- 4. Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A. Y., et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing. 2014; 2(3):267–279.
- 5. Montinaro M. and Sciascia I. Market segmentation models to obtain different kinds of customer loyalty. Journal of Applied Sciences. 2011; 11(4):655–662.
- 6. Borgen F. H. and Barnett D. C. Applying cluster analysis in counseling psychology research. Journal of Counseling Psychology. 1987; 34(4):456–468.
- 7. Punj G. and Stewart D. W. Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research. 1983; 20(2):134.
- 8. Zakharov K. Application of k-means clustering in psychological studies. The Quantitative Methods for Psychology. 2016; 12(2):87–100.
- 9. Landauer M., Wurzenberger M., Skopik F., Settanni G., and Filzmoser P. Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection. Computers Security. 2018; 79:94–116.
- 10.
M. Oliveira and J. a. Gama. Mec –monitoring clusters’ transitions. In Proceedings of the 2010 Conference on STAIRS 2010: Proceedings of the Fifth Starting AI Researchers’ Symposium, page 212–224, NLD, 2010. IOS Press. ISBN 9781607506751.
- 11. Spiliopoulou M., Ntoutsi E., Theodoridis Y., and Schult R. Monic and followups on modeling and monitoring cluster transitions. Advanced Information Systems Engineering Lecture Notes in Computer Science. 2013; 622–626.
- 12. Silva J. D. A., Hruschka E. R., and Gama J. An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications. 2017; 67:228–238.
- 13.
S. Badiozamany, K. Orsborn, and T. Risch. Framework for real-time clustering over sliding windows. Proceedings of the 28th International Conference on Scientific and Statistical Database Management—SSDBM 16. 2016. https://doi.org/10.1145/2949689.2949696
- 14. Patroumpas K. and Sellis T. Window specification over data streams. Current Trends in Database Technology—EDBT 2006 Lecture Notes in Computer Science. 2006; 445–464.
- 15.
Deepayan Chakrabarti, Ravi Kumar, and Andrew Tomkins. Evolutionary clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’06). 2006. Association for Computing Machinery, New York, NY, USA, 554–560. https://doi.org/10.1145/1150402.1150467
- 16.
Y. Chi, X. Song, D. Zhou, K. Hino, B. L. Tseng. On evolutionary spectral clustering. ACM Trans. Knowl. Discov. 2009. https://doi.org/10.1145/1631162.1631165
- 17.
Y. Zhang, H. Liu, B. Deng. Evolutionary clustering with dbscan. Ninth International Conference on Natural Computation (ICNC). 2013; 923–928. https://doi.org/10.1109/ICNC.2013.6818108
- 18.
T. Xu, Z. Zhang, P. S. Yu, B. Long. Evolutionary clustering by hierarchical dirichlet process with hidden markov state. Eighth IEEE International Conference on Data Mining. 2008; 658–667. https://doi.org/10.1109/ICDM.2008.24
- 19. Hyde R., Angelov P., and MacKenzie A.R. Fully online clustering of evolving data streams into arbitrarily shaped clusters. Information Sciences. 2017; 382: 96–114.
- 20. Fahy C., Yang S., and Gongora M. Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams. IEEE Trans. Cybern. 2019; 49: 2215–2228.
- 21.
C. Fahy and S. Yang. Finding and tracking multi-density clusters in online dynamic data streams. IEEE Trans. Big Data. 2019; 1–15. https://doi.org/10.1109/TB-DATA.2019.2922969
- 22. Huang L., Wang C.-D., Chao H.-Y., and Yu P.S. MVStream: Multiview data stream clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020; 31: 3482–3496.
- 23. Li H., Liu J., Yang Z., Liu R. W., Wu K., and Wan Y. Adaptively constrained dynamic time warping for time series classification and clustering. Information Science. 2020; 534: 97–116.
- 24. Liang M., Liu R. W., Li S., Xiao Z., Liu X., and Lu F. An unsupervised learning method with convolutional auto-encoder for vessel trajectory similarity computation. Ocean Eng. 2021; 225: 108803.
- 25.
Z. Zhang, K. Huang, and T. Tan. Comparison of Similarity Measures for Trajectory Clustering in Outdoor Surveillance Scenes. Proceedings of the 18th International Conference on Pattern Recognition (ICPR06), IEEE, Hong Kong, China, 2006: 1135–1138.
- 26. Liu X., Guan J., and Hu P. Mining frequent closed itemsets from a landmark window over online data streams. Computers Mathematics with Applications. 2009; 57(6):927–936.
- 27. Mansalis S., Ntoutsi E., Pelekis N., and Theodoridis Y. An evaluation of data stream clustering algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2018; 11(4):167–187.
- 28. Hu Y. Optimal algorithm of data streams clustering on sliding window model. Journal of Computer Applications. 2008; 28(6):1414–1416.
- 29.
M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD 06. 2006. https://doi.org/10.1145/1150402.1150491
- 30. Ntoutsi E., Spiliopoulou M., and Theodoridis Y. Fingerprint. International Journal of Data Warehousing and Mining. 2012; 8(3):27–44.
- 31. Qiu W. and Joe H. Generation of random clusters with specified degree of separation. Journal of Classification. 2006; 23(2):315–334.
- 32.
Weiliang Qiu and Harry Joe. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.7. https://CRAN.R-project.org/package=clusterGeneration
- 33. Melnykov Volodymyr, Chen Wei-Chen, Maitra Ranjan. MixSim: An Package R for Simulating Data to Study Performance of Clustering Algorithms. Journal of Statistical Software. 2012; 51(12): 1–25. URL https://doi.org/10.18637/jss.v051.i12
- 34. Leisch F. A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis. 2006; 51(2):526–544.
- 35.
European Social Survey Cumulative File, ESS 1-9 (2020). Data file edition 1.0. Sikt—Norwegian Agency for Shared Services in Education and Research, Norway. Data Archive and distributor of ESS data for ESS ERIC. https://doi.org/10.21338/NSD-ESS-CUMULATIVE
- 36.
Dua, D. and Graff, C. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- 37. Atif Muhammad, Shafiq Muhammad and Leisch Friedrich. Applications of monitoring and tracing the evolution of clustering solutions in dynamic datasets. Journal of Applied Statistics. 2021;