clusTransition: An R package for monitoring transition in cluster solutions of temporal datasets

Muhammad Atif; Friedrich Leisch

doi:10.1371/journal.pone.0278146

Abstract

Clustering analysis’ primary purpose is to divide a dataset into a finite number of segments based on the similarities between items. In recent years, a significant amount of study has focused on the spatio-temporal aspects of clustering. However, clusters are no longer regarded as static objects since changes influence them in the underlying population. This paper describes an R package implementing the MONIC framework for tracing the evolution of clusters extracted from temporal datasets. The name of the package is clusTransition, which stands for Cluster Transition. The algorithm is based on re-clustering cumulative datasets that evolve at successive time-points and monitoring the transitions experienced by the clusters in these clustering solutions. This paper’s contribution is to demonstrate how the package clusTransition is developed in the R programming language, and its workflow is discussed using hypothetical and real-life datasets.

Citation: Atif M, Leisch F (2022) clusTransition: An R package for monitoring transition in cluster solutions of temporal datasets. PLoS ONE 17(12): e0278146. https://doi.org/10.1371/journal.pone.0278146

Editor: Mohammad Mehdi Rashidi, Tongji University, CHINA

Received: July 16, 2022; Accepted: November 10, 2022; Published: December 15, 2022

Copyright: © 2022 Atif, Leisch. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper. All other data streams used in the manuscript are available in public repositories. DOI for Human Value Scale datasets: doi:10.21338/NSD-ESS-CUMULATIVE. Link for Human Value Scale datasets: https://ess-search.nsd.no/CDW/ConceptVariables Link for household Electric Power Consumption: https://archive.ics.uci.edu/452ml/datasets/individual+household+electric+power+consumption Link for Intel Lab sensor datasets: https://www.kaggle.com/datasets/divyansh22/intel-berkeley-research-lab-sensor-data.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

The prime goal of clustering analysis is the organization of a dataset into a finite number of segments according to the similarities within objects. Ideally, the set of objects in the same segment should be comparably similar to one another than to the objects belonging to different partitions [1]. Each individual partition is known as a cluster, whereas the objects belonging to the same cluster are called its members [2, 3]. Its applications covers many real-world applications, ranging from business and economics, marketing, pattern recognition, medical sciences, image processing to big data analysis [4]. For example, in the field of market segmentation, better marketing strategies can be adopted by clustering the customers with similar demographic or buying characteristics [5]. In a similar passion, clustering might be helpful in better understanding the disease and targeting appropriate treatment by sub-grouping the patients into homogeneous sets based on psychological inventory scores [6]. Since the notion of clustering is not precisely described, consequently, several algorithms/models have been proposed in the literature, and all of them may result in well different clustering solutions [7, 8].

It is already true that in recent years, a considerable amount of research work conducted is based on investigating the spatio-temporal properties of clustering. In these applications, clusters are no longer considered static objects, as they are affected by changes occurring in the underlying population [9, 15]. The inclusion of new data records to the original population over time may affect the cluster’s memberships, and entirely different clustering solutions may be generated at later time-points. This transition in clustering solutions may include the disappearing of a specific cluster(s), migration of some elements from one cluster to another, splitting of a cluster into several, several clusters splicing together to form one, survival of a cluster and emerging of new ones. The survived clusters can experience internal transition, including changes in location, size, and density [10, 11]. Various topics such as spatio-temporal, evolutionary, stream, and incremental clustering address this issue by adopting the dataset that changes over time. Tracing and understanding the phenomena behind this transition is of practical importance for effective decision-making. This can be helpful in various fields like marketing, fraud detection, networking, scientific publication, health, etc. [12].

In many real-world applications, clustering of the data stream is performed all time to identify the changes occurring in the pattern of the underlying phenomena [13]. Since in a stream, new data items are continually generated, which join the underlying population at a regular interval. Therefore, in order to control part of the data that contributes to the pattern in data mining, the stream needs to be discretized into subsets based on some attributes that have an order. This data discretization into subsets is called the windowing approach and is mainly done based on time. Some of the most commonly used examples are landmark, sliding, and damped window models [14]. These models are discussed in the next section.

[15] introduces the notion of evolutionary clustering to process the times-tamped dataset by producing a sequence of clustering solutions. That is, a clustering solution for each time-step of the temporal data. The algorithm optimizes two competing criteria i.e. each clustering in the sequence should be similar to the clustering at the previous time-step, while at the same time should accurately reflect the data arriving during that time-step. This framework is further extended to spectral clustering [16], density-based clustering [17], and Hierarchical Dirichlet Process with the Hidden Markov model [18].

Using a totally online method, Hyde et al. [19] offer an algorithm that clusters the evolving data streams into arbitrary shaped clusters. The approach consists of two stages: the first stage finds micro-clusters in the datasets, and the second step merges these micro-clusters into macro-clusters. In a similar vein, Fahy et al. [20] describe an Ant Colony Stream Clustering technique built on a density-based methodology that recognises clusters as a collection of micro-clusters. To read a stream and create micro-clusters in the window pane, the method uses a tumbling window model. By combining the related clusters based on a similarity index, these clusters are then further refined. Fahy and Yang [21] further enhance this technique to address the multi-density issue in the density-based clustering strategy. This method uses the local radius of each cluster to identify clusters, and it then tracks changes in the solutions. For the first time, multiple view clustering challenges are addressed by Huang et al. [22] in MVStream clustering method. In order to assign cluster labels to the data items that include summary statistics, this technique creates support vectors from various views of the data objects. Similarly, some studies have been conducted for measuring the similarities between the trajectory in the dynamic environment [23–25].

2 Window models

In a landmark window model, all items that arrive after some specific time-point (landmark time) are maintained and cannot be discarded irrespective of window size. The window size is uncontrolled and keeps increasing as time progresses [26, 27]. The data records that arrives in the interval (t_i−1, t_i) are accumulated according to the equation given by: (1) where n is the number of time-points and t is the current time-point. Implementation of the landmark window model will generate n window panes, where each pane contain data items evolving from starting time-point t₁ to the current time-point t_i.

The sliding window model, on the other hand, is based on a fixed size of window w that contains only those objects falling in the interval [t_i − w + 1, t_i], while older cases are discarded. In such type of model, as time progress, the window slides forward while keeping its size w by including new data records and discarding the older ones [27, 28]. The scenario of the sliding window model can be described in the equation below: (2) (3) (4) (5) where m is the number of window panes and is equal to n − w + 2, n is the number of time-points, and w is the sliding window size.

3 The change detection algorithm

In order to monitor and trace the evaluation of clusters extracted from re-clustering of cumulative datasets [29] introduced a framework known as ‘MONIC’ algorithm. This algorithm is based on clustering cumulative datasets arriving at discrete time-points t₁, t₂, …, t_n. Initially, the data is collected at time-point t₁, and as time progresses new data records join the data set at regular interval of time. The initial datasets d₁, d₂, …., d_n, are accumulated and re-clustered at each time-point t₁, t₂, …, t_n to monitor and detect the cluster evolution over time.

The algorithm is mainly based on the idea of a non-symmetric overlap matrix between two clustering extracted from cumulative datasets at two different time-points. Let be a set of clusters extracted from dataset D_i at time point t_i and is referred to as first clustering. Similarly, let be a set of clusters extracted from dataset D_j at time point t_j (i<j) and is referred to as second clustering. Then the overlap matrix can be defined as: (6) where k₁ is the number of clusters from the first clustering ξ_i, and k₂ is the number of clusters from second clustering ξ_j. This will generate a matrix of order k₁*k₂, where rows and columns describe first and second clustering respectively. The value on the corresponding element of the matrix represents the similarity index between cluster X_i and Y_j. The MONIC framework assumes hard clustering where each observation belongs to one and only one cluster [30].

In the context of this algorithm, the transition is the change experienced by a cluster X_i ϵ ξ_i, when it has been perceived at second clustering ξ_j. This change in the clustering solution is referred to as an external or internal transition. External transition concern the relationship of cluster found at clustering ξ_i to the clusters found at clustering ξ_j, whereas internal transition is regarded as changes that occurred in the structure of the survived clusters.

The external transition is categorized into five categories i.e. Survive, Merge, Split, Disappear, and Emerge candidates. The cluster X_l ϵ ξ_i may survive into Y_m ϵ ξ_j, clusters may merge to form Y_m ϵ ξ_j, or cluster X_l ϵ ξ_i may split into various daughter clusters . If a cluster X_l ϵ ξ_i does not experience any of the above transitions, then it disappears. Similarly, if a cluster Y_m ϵ ξ_j is not a result of any external transition from its ancestors, then it is a newly emerged candidate. The overlap between X_l ϵ ξ_i and Y_m ϵ ξ_j serve as an indicator of identifying the external transition experienced by clusters at clustering ξ_i. This value is compared with a minimum threshold value say τϵ[0.5, 1] to identify match of X ϵ ξ_i in Y ϵ ξ_j. A cluster X_l ϵ ξ_i is said to survive in Y_m ϵ ξ_j if this is the only cluster that has an overlap of greater than τ_survive. If at least two clusters from X ϵ ξ_i (such as have an overlap of greater than τ_survive with Y_m ϵ ξ_j), then it is a case of merge i.e. X₁ and X₂ merge to form Y_m. Furthermore, a cluster is said to split in daughter clusters, if the overlap of X_l with and is greater than τ_split and collectively their overlap is greater than τ_survive, i.e. for split the following two conditions are required. (7) (8)

where M is the number of daughter clusters from second clustering.

The overlap can not be used as an indicator for monitoring the changes in the form of survived clusters. The shift in the location of the survived cluster (X_l → Y_m) can be traced by calculating Euclidean distance between their centroids normalized by the minimum radius. This information can be summarized in the following formula: (9) where and are the centroids of clusters X_l and Y_m respectively, and is the Euclidean distance between them. The r denotes radius of the corresponding clusters and is computed as the maximum distance of an object from its cluster centroid. If the absolute value of location.difference is greater than τ_location, then the algorithm will detect a shift in location of the survived cluster.

For density transition, the average distance of objects from cluster centroid can be computed. The formula for the density of cluster is given by: (10)

The difference in density of cluster X_l survived in Y_m is normalized by the minimum radius i.e. (11)

If the absolute value density.difference is less than τ_density then there is no change in density of the survived cluster. On the other hand, if the absolute value is greater than τ_density then a change in density would be detected. If density.difference is positive then the cluster is more compact than its ancestors, otherwise, it becomes more diffuse.

4 Package description

The state-of-the-art “MONIC” algorithm is implemented in the R-software via package clusTransition. The package can be used for tracing and monitoring the evolution of clustering solutions in cumulative datasets over time. In this section, we briefly describe the functions and methods exported by the package in detail. Fig 1 below demonstrates the workflow of the package.

Download:

Fig 1. Workflow diagram of the package.

The Transition function exported by the package offers three different options for importing datasets. The function then trace changes in clustering solutions.

https://doi.org/10.1371/journal.pone.0278146.g001

Table 1 below summarizes the functions, methods, and classes exported by the package along with its corresponding arguments and slots.

Download:

Table 1. Functions, methods and classes exported by the package clusTransition.

https://doi.org/10.1371/journal.pone.0278146.t001

More details about these functions and classes are described below.

4.1 Function Transition()

The evolution of clusters can be traced using the primary function Transition(), which exports an object of class S4. In implementing the package clusTransition, we have considered the portability of the functions for various types of hard clustering algorithms. A typical call to the Transition() function involves three essential pieces: the data input (listdata, listclus, overlap), choice of window swSize, and the threshold parameters. The user must only provide the swSize and k arguments in case of importing datasets using the listdata argument. This function has the following interface:

>Transition(listdata, listclus = NULL, Overlap = NULL, swSize = 1, typeind = 1,

+ survival_thresHold = 0.8, split_thresHold = 0.3, location_thresHold = 0.3,

+ density_thresHold = 0.3, k)

We took into account the portability of the functions for many kinds of hard clustering algorithms while developing the clusTransition package. For this purpose three different options i.e. listdata, listclus, and Overlap are provided for importing the data.

The listdata imports the raw data stream at discrete time points t₁, t₂, …, t_n. A sequence of cluster solutions are generated from the stream using k-means clustering algorithm. Each element of the list corresponds to the dataset at a single time point. The number of clusters in each accumulative data matrix is specified by the argument k.

On the other hand, the listclus argument imports the clustering solutions at successive time-points to allow clusters other than k-means. Each element of listclus is a nested list that contain clustering solutions at corresponding time point i.e. .

Overlap is a List of numeric matrices containing similarity measures between clusters extracted at consecutive time points. The similarity between clusters are computed using Eq 6. The Overlap method exported by the package can be used to compute the similarity matrices.

swSize indicates size of the sliding window model. The default value of swSize = 1 implements the landmark window model and discretize the stream according to Eq 1. Whereas other numeric values discretize the stream using a sliding window scenario according to Eq 5. The sliding window size can only be provided if listdata argument is chosen.

The survival_thresHold, split_thresHold, location_thresHold, and density_thresHold are minimum threshold value for survival of clusters from Xϵξ_i to Yϵξ_j, split of cluster Xϵξ_i to {Y_m1, Y_m2}ϵξ_j, shift in location, and changes in density of survived clusters respectively. These are user defined parameters and belongs to the interval (0,1).

One of the most perplexing problems with most clustering algorithms is deciding the ideal number of partitions. This is a crucial parameter for partitioning, hierarchical and model-based clustering algorithms. The number of clusters one wants to generate from a dataset has to be predefined. There are several ways of estimating the optimal number of clusters k, such as the silhouette, Gap, and Elbow methods. k is a numeric vector containing the relevant number of clusters at the corresponding time-point. The length of k is to be determined from the swSize. This argument should only be provided if listdata argument is chosen.

Typing the object’s name comprising the Transition() function’s output will produce external and internal transition results at each time point. External transition includes the number of clusters still existent, absorbed by others, split into various, disappeared and newly emerged at second clustering. Internal transition comprises changes in the location and density of the survived clusters.

Along with this information, the Monic object holds the cluster’s radius, membership, and distance between cluster centres.

4.2 OverLap class

This is an object of class OverLap that contains summaries of first and second clustering. This object has eight slots that work as input for tracking the evolution of clusters by the Transition() function. The slots include a numeric matrix containing the similarities between clusters generated at first and second clustering (Overlap computed from Eq 6), the cluster’s membership vector, radius, centres, and an average distance of items from the cluster’s centres (computed from Eq 10). In addition, this has the following interface:

>obj <- new(“OverLap”)

4.3 Overlap method

This method initializes the slots of an object having class OverLap by importing the clustering solution ξ of cumulative datasets D at two consecutive time points i and j. Clusters at each data point should be provided as a list of matrices, where each matrix contains a data set belonging to one cluster. It has the following interface.

>Overlap <- Overlap(object, e1 = C1, e2 = C2)

where e1 is the set of clusters obtained at time point t_i from cumulative dataset D_i, e2 is the set of clusters obtained at time point t_j from cumulative dataset D_j, and object is an object of class OverLap.

4.4 Function moplot()

This method plot 3 bar-plot and 1 line graph. The first stack bar-plot shows SurvivalRatio and AbsorptionRatio, second bar-plot shows number of new emerged clusters at each time stamp, third bar-plot shows number of disappearance at each time stamp. The line graph shows passforward Ratio and SurvivalRatio.

> plot(obj)

5 Simulation example

Let us assumes that a data stream consist of datasets d₁, d₂, …, d_n arriving at corresponding time-points t₁, t₂, …, t_n respectively. For the generation of initial dataset d₁, we use a generator that takes into account the number of clusters (k), size of each cluster, and separation value between theme [31]. While the generator for generating other streams like d₂, d₃, …, d_n consider the center of each cluster, size of each cluster, and the co-variance structure between them as input [32, 33].

As a working example, we generate a data stream sprouting at four consecutive time points.Fig 2 below demonstrates the scenario for generating datasets d_i, i = 1, 2, 3, 4 at four time points. The new objects joining the underlying population are shown by red color whereas older records are displayed by black color.

Download:

Fig 2. Data stream generated at four discrete time stamps.

The new data items at each time stamp is shown by the red color, whereas the older data items are shown by black color.

https://doi.org/10.1371/journal.pone.0278146.g002

6 Pre-processing

Prior to the implementation of the change detection algorithm in cluster solutions over time, the user needs to pre-specify some relevant parameters. First of all, the user needs to decide a suitable windowing approach for the accumulation of datasets evolving at successive time points. For this purpose, we offered two types of windowing approaches in the package i.e. landmark and sliding window models. Implementation of the windowing approach will accumulate the datasets at corresponding time points according to the chosen model and will generate window panes at successive time points. In the second phase, the optimal number of clusters in each window pane D_i at the corresponding time point must be determined using an appropriate technique. For illustration purposes, we use worked examples based on the datasets simulated in section IV. The datasets are accumulated according to the landmark and sliding windowing approaches, and then the optimal number of clusters was estimated in each window pane D_i.

The implementation of the landmark window model will produce four window panes. Each pane will contain the datasets generated between [t₁, t_i], where t_i represent the current time point. Table 2 below demonstrates the number of objects and optimal number of clusters in each window pane D_i estimated from Gap statistics at corresponding time points t_i.

Download:

Table 2. Optimal number of clusters in landmark window model datasets.

https://doi.org/10.1371/journal.pone.0278146.t002

Similarly, the implementation of a sliding window of size 3 will generate 3 window panes. Table 3 below demonstrates the number of objects and optimal number of clusters in each window pane D_i.

Download:

Table 3. Optimal number of clusters in sliding window model datasets.

https://doi.org/10.1371/journal.pone.0278146.t003

7 Implementation of function Transition()

In this section implementation of the primary function, Transition() is presented using working examples. The data stream simulated in section 5 is used for monitoring the cluster evolution over time. The function provides three different options for importing the datasets, which are explained in subsections below.

7.1 Looking at listdata argument

The argument listdata is a list of matrices or data frames containing the datasets d₁, d₂, …, d_n evolving at corresponding time-points t₁, t₂, …, t_n. The i^th element of the listdata comprises set of data items d_i that evolve at corresponding time point t_i. At this point the Transition() function accumulates the datasets d_i according to the suitable windowing approach provided in swSize argument. The default value i.e Swsize = 1 will implement landmark window model, whereas other integer values implements sliding window model. The accumulation of datasets d_i will generate window panes D_i that contain cumulative datasets at successive time points. Each window pane D_i will be re-clustered by using cclust() function from flexclust package [34]. The optimal number of clusters in cumulative datasets D_i should be decided by the user and must be imported via argument k of the function. Both k and swSize arguments are used only if listdata option is chosen for importing datasets d_i. The argument typeind = 1 allows the user to implement listdata argument. Monitoring and tracking the evolution of clusters using the landmark window model is shown in the example below.

7.1.1 Example (listdata argument with landmark window model).

The default value of swSize = 1 implements the landmark window model and generates n window panes of cumulative datasets D_i according to Eq 1. In this working example, the datasets generated in section 5 is used. According to Table 2 in this simulated example window panes D₁, D₂, D₃, and D₄ comprises of 4, 4, 5, and 4 clusters respectively. Hence the Transition() function with arguments listdata = listdata, swSize = 1, typeind = 1, Survival_thrHold = 0.8, Split_thrHold = 0.3, and k = c(4,4,5,4) can be implemented as:

>library(clusTransition)

>listdata <- list(d1, d2, d3, d4)

>clusterTrace <- Transition(listdata = listdata, swSize = 1, typeind = 1,

+ Survival_thrHold = 0.8, Split_thrHold = 0.3, k = c(4,4,5,4))

This will generate two tables, displaying the number of clusters experiencing external and internal transition at successive time points. The first table in the output comprises the number of clusters that experience external transitions at corresponding time points t_j. Similarly, the second table comprises the number of survived clusters that undergone internal transitions at corresponding time points. Hence the full summary of external and internal transitions are shown below.

The object clusterTrace returned by the Transition() function is an object of class S4, named Monic. The object contains the candidates that experience external and internal transitions at successive time points. The slots ending with x represent candidates that adopt external transitions from first clustering ξ_i. Whereas the slots ending with y represent the candidates that evolve as a result of corresponding external transition at second clustering ξ_j. For example, the candidates that experience external transitions at time point t₃ can be retrieved as:

Let C_imϵξ_i(first clustering) be the cluster that experience some external transition and evolve as C_jnϵξ_j(second clustering). Where the first subscript (i and j) represent time point and second subscript (m and n) represent the cluster number. The Time Step [3]] in the output represents the time point t_j at second clustering, and hence the time point t_i (i = j − 1) at first clustering ξ_i is one less. So in this particular example i = 2 and j = 3, then the above transition can be summarized as:

The algorithm detect that three clusters survive (C₂₁→C₃₁, C₂₃→C₃₄, and C₂₄→C₃₂) and one cluster split (C₂₂→{C₃₃, C₃₅}).

7.1.2 Example (listdata argument with sliding window model).

In case one is interested in sliding window model, where older records are discarded with the progression of time. This can be achieved by utilizing swSize argument. Here in this synthetic example swSize = 3 will generate window panes that contain datasets arrives in the interval [t_i − 3 + 1, t_i]. Analysis of Table 3 demonstrates that the number of clusters in window panes D₁, D₂, and D₃ are 4, 5, and 6 respectively. Hence the Transition() function with arguments listdata = listdata, swSize = 3, typeind = 1, Survival_thrHold = 0.8, Split_thrHold = 0.3, and k = c(4,5,6) can be implemented as:

>clusterTrace <- Transition(listdata = listdata, swSize = 3, typeind = 1, + Survival_thrHold = 0.8, Split_thrHold = 0.3, k = c(4,5,6))

7.2 Looking at listclus argument

The listdata argument permit the users to implement un-clustered datasets d₁, d₂, …, d_n arrives at time-points t₁, t₂, …, t_n. However, this restricts the package to only one type of clustering algorithm i.e. k-means algorithm. In order to make the package more flexible for other types of hard clustering, an alternate argument listclus is provided in the function. The listclus argument imports clustering solutions of each window pane as a list i.e. listclus = {ξ₁, ξ₂, …, ξ_n} and compute the similarity indices between them. The argument listclus is a list, where every individual element is a nested list of matrices or data-frames. The i^th element corresponds to the set of clusters extracted at time-point t_i, by implementation of an appropriate clustering algorithm to window pane D_i. This is explained in the example given below.

7.2.1 Example: Listclus argument.

Prior to applying Transition() function, the user need to extract clusters from each window pane D_i. For this purpose, first of all, accumulate the initially collected datasets d₁, d₂, …, d_n, according to a suitable window model like landmark in this example. This can be done by explicitly calling merge() function from base package. By running the R codes given below will generate 4 panes.

>D1 <- d1

>D2 <- merge(d1, d2, all.x = TRUE, all.y = TRUE)

>D3 <- merge(D2, d3, all.x = TRUE, all.y = TRUE)

>D4 <- merge(D3, d4, all.x = TRUE, all.y = TRUE)

Fitting of clustering algorithm

Afterward, choose the relevant number of clusters from each window pane Di, and extract clusters by implementing an appropriate clustering algorithm. Save this clustering solution as a list of matrices or data frames. For illustration purposes, we obtain 4, 4, 5, and 4 clusters from datasets D₁, D₂, D₃, and D₄ respectively.

>set.seed(100)

>fit1 <- kmeans(D1, 4)

>C1 <- list()

>for(i in 1:4)C1[[i]] <- D1[fit1$cluster == i,]

where C1 = {C₁₁, C₁₂, C₁₃, C₁₄} is a list of clusters extracted from D₁ at time point t₁. Similarly, extract clusters from all window panes at corresponding time point as:

>fit2 <- kmeans(D2, 4)

>C2 <- list()

>for(i in 1:4)C2[[i]] <- D2[fit2$cluster == i,]

>fit3 <- kmeans(D3, 5)

>C3 <- list()

>for(i in 1:5)C3[[i]] <- D3[fit3$cluster == i,]

>fit4 <- kmeans(D4, 4)

>C4 <- list()

>for(i in 1:4)C4[[i]] <- D4[fit4$cluster == i,]

Combine all these lists of clustering solutions in a single list and apply Transition() function with arguments listclus = listclus, typeind = 3, Survival_thrHold = 0.8, Split_thrHold = 0.3 as:

>listclus <- list(C1, C2, C3, C4)

>clusterTrace <- Transition(listclus = listclus, typeind = 3,

+ Survival_thrHold = 0.8, Split_thrHold = 0.3)

7.3 Looking at Overlap argument

The Overlap argument also permits the user to implement other types of clustering algorithms and trace the evolution of clusters over time. Overlap argument imports a list of objects as produced by the Overlap() method that contain similarity between clustering obtained at successive time points t_i and t_j (i < j) and the summaries of these clusters. This can be implemented by setting typeind = 2. The overlap matrices can be computed by utilizing the S4 method overlap() exported by the clusTransition package. In the same way as listclus, some clustering algorithm can be applied to landmark or sliding window modeled dataset to extract the cluster memberships at corresponding time-points. List of clusters extracted from D_i and D_i−1 can be used to compute the overlap matrix between clustering. This is elaborated in the working example given below.

7.3.1 Example: Overlap argument.

Let C1 = {C₁₁, C₁₂, C₁₃, C₁₄}, C2 = {C₂₁, C₂₂, C₂₃, C₂₄}, C3 = {C₃₁, C₃₂, C₃₃, C₃₄, C₃₅}, and C4 = {C₄₁, C₄₂, C₄₃, C₄₄} be the set of clustering solutions obtained from corresponding datasets D₁, D₂, D₃, and D₄. These sets of clustering solutions are already obtained in the previous example. Then the objects of class OverLap can be created and initialized as:

>obj <- new(“OverLap”)

>Overlap1 <- Overlap(obj, e1 = C1, e2 = C2)

>Overlap2 <- Overlap(obj, e1 = C2, e2 = C3)

>Overlap3 <- Overlap(obj, e1 = C3, e2 = C4)

Combine all these objects in a list and apply Transition() function with arguments Overlap = Overlap, typeind = 2, Survival_thrHold = 0.8, Split_thrHold = 0.3 as:

>Overlap <- list(Overlap1, Overlap2, Overlap3)

>clusterTrace <- Transition(Overlap = Overlap, typeind = 2,

+ Survival_thrHold = 0.8, Split_thrHold = 0.3)

7.4 moplot() function

Fig 3 displays the graphical summary of an object of class Monic generated by Transition() function as output. The stack bar-plot in the top left corner displays the survival and absorption ratio at successive time points. The Figure illustrates that all clusters survived at time point t₁, and hence the survival ratio is 1. However, at time point t₂ 3 out of 4 clusters survived resulting in a 0.75 survival ratio. Similarly at time point t₃ 3 out of 5 clusters survive, while 2 merged. This resulted in 0.60 survival and 0.40 absorption ratios respectively. Consequently, no cluster disappears and no newly emerged candidate were detected at any of the time points. This can be seen from pass-forward ratio, which is unity at all time points except t₂ where one cluster splits into daughter candidates.

Download:

Fig 3. Data stream generated at four discrete time stamps.

The new data items at each time stamp is shown by the red color, whereas the older data items are shown by black color.

https://doi.org/10.1371/journal.pone.0278146.g003

8 Real data example

To demonstrate the practicality of the package and deeply understand applications of cluster evolution, we investigate three real-life datasets. To comprehend the notion of transformation in social, political, and moral attitudes of European nations; the Human Values datasets were extracted from European Social Surveys [35]. The changes in electricity consumption of inhabitants were traced using Individual Household Electricity Consumption dataset. Similarly, the Intel Lab sensors streaming dataset was used to show the applications of the framework. Both these data streams were extracted from the home page of “UCL Machine Learning Repository”.

8.1 Application to human values scale

As a case study, we extract eight datasets each corresponds to a single round of European Social Surveys (ESS) conducted in years 2002, 2004, 2006, 2008, 2010, 2012, 2014, and 2016 respectively. The dataset consist of 25024 individuals who respond to the Schwartz Value Survey (SVS) for computing basic human values and can be downloaded from the URL https://ess-search.nsd.no/CDW/ConceptVariables. The ten basic values are Benevolence, Universalism, Self-direction, security, Confirmatory, Hedonism, Achievements, Traditions, Stimulation, and Power [35]. The k-means clustering algorithm was implemented to sliding window-modeled datasets at each time point. Whereas, the number of clusters in the respective datasets was estimated from the well-known GAP statistic. Fig 4 below describe the evolution of clusters at time point t_i, i = 1, 2, 3, 4, 5, 6, 7 in Human Value scale datasets. which demonstrates that two clusters C₁₁ and C₁₂ survived over time. The first imperative cluster was C₁₁(C₁₁→C₂₂→C₃₂→C₄₂) that emerged at t₁(2002) and survived until t₄(2006, 2010). However, the cluster survived till 2010, but experienced internal transition and became more diffused eventually disappeared at time-point t₅. The second vibrant cluster was C₁₂(C₁₂→C₂₄→C₃₃→C₄₁→C₅₂→C₆₃→C₇₁) which survive through the entire time span. This was the most important cluster because not only it survives over time but also turns out to be denser. Mostly the new respondents of SVS surveys over the years joins this cluster. The shift in location was observed for this cluster at time-point t₂ and t₃, and afterward, remain stable. The first external transition was experienced in the cluster C₁₄ which split into two clusters and ultimately disappeared. The algorithm also detects a cluster C₆₁ that emerged at t₆(2010, 2014) and pass-forward while absorbing elements of the cluster C₆₂.

Download:

Fig 4. Transition of clusters in basic human values datasets.

https://doi.org/10.1371/journal.pone.0278146.g004

8.2 Application to Individual Household Electric Power Consumption

As a second example, the Individual Household Electric Power Consumption dataset for the years [2006, 2010] was used. This dataset comprises of 2075259 households characterized by seven numerical attributes. The dataset is available at machine learning repository [36] and can be downloaded from https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption. A sliding window model of size 2 was used for accumulation of the stream at successive time points. In this section, we use the CLARA algorithm to extract clusters from the datasets at successive time points. Whereas the average silhouette method was used to estimate the optimal value of k in each window pane. Fig 5 below demonstrates the evolution of clusters at time point t_i, i = 1, 2, 3, 4, 5 in individual household electric power consumption datasets. The algorithm detect that all of the four clusters survive (C₁₁→C₂₁, C₁₂→C₂₁, C₁₃→C₂₃, and C₁₄→C₂₄) experiencing internal transition and became diffuse during [2006, 2007]. A shift in location for only one cluster C₁₃ was detected, whereas other clusters were stable to change in location. Similarly, three clusters survive (C₂₁→C₃₁, C₂₂→C₃₃, and C₂₄→C₃₄), one cluster disappear (C₂₃→ ⊙), and one cluster emerged (⊙→C₃₂) during [2007, 2008]. Two of the survive clusters became diffuse, while one cluster became compact than its predecessors. Likewise, one cluster survive (C₃₃→C₄₃), three disappears (C₃₁→ ⊙, C₃₂→ ⊙, and C₃₄→ ⊙), and three newly emerged clusters (⊙→C₄₁, ⊙→C₄₂, and ⊙→C₄₄) were detected during [2008, 2009]. Afterwards all four clusters disappears (C₄₁→ ⊙, C₄₂→ ⊙, C₄₃→ ⊙, and C₄₄→ ⊙), and three new clusters emerged (⊙→C₅₁, ⊙→C₅₂, and ⊙→C₅₃) during [2009, 2010].

Download:

Fig 5. Transition of clusters in Individual Household Electric Power Consumption datasets.

https://doi.org/10.1371/journal.pone.0278146.g005

8.3 Intel Lab dataset

In this section, we used the publically accessible dataset recorded from 54 sensors deployed at Intel research laboratory during February 28^th and April 5^th, 2004. Each sensor record information on temperature, humidity, voltage, and light every thirty-one seconds. The dataset comprises of 2.3 million readings collected from 54 sensors. The sensors were designed to make it energy-efficient and consume power only in sensing environment and transmitting data. We select only a subset of measurements from this dataset and include readings from sensor-1 only. This subset of the data consists of 43,047 readings from sensor-1 and can be downloaded from the URL https://www.kaggle.com/datasets/divyansh22/intel-berkeley-research-lab-sensor-data.

We accumulate the dataset according to the landmark window model, and as the flow is uniform, so we consider 9000 records per time period. This implementation generates 5 window panes of cumulative datasets. The shadow statistic decided the optimal number of clusters in cumulative datasets at the corresponding time point. The Partitioned Around Medoids (PAM) algorithm was used for extracting clusters from datasets.

Fig 6 below demonstrates the transitions of clusters at time points t_i, i = 1, 2, 3, 4, 5 in Intel Lab dataset. The algorithm detect that all six clusters survive (C₁₁→C₂₁, C₁₂→C₂₂, C₁₃→C₂₄, C₁₄→C₂₅, C₁₅→C₂₆, and C₁₆→C₂₃) while one new cluster emerge (⊙→C₂₇) at time point t₂. All survived clusters experience internal transition and became more diffuse. Also six clusters survive (C₂₁→C₃₁, C₂₂→C₃₂, C₂₄→C₃₃, C₂₅→C₃₄, C₂₆→C₃₅, and C₂₇→C₃₆) and one cluster disappears (C₂₃→ ⊙) at time point t₃. Cluster C₂₄ experience double internal transition i.e. shift in location and change in density, while other clusters only became diffuse. Likewise, five clusters survive (C₃₁→C₄₃, C₃₂→C₄₅, C₃₄→C₄₄, C₃₅→C₄₂, and C₃₆→C₄₇), one cluster disappears (C₃₃→ ⊙), and two clusters emerged (⊙→C₄₁ and ⊙→C₄₆) at time point t₄. Similarly, five clusters survive (C₄₂→C₅₄, C₄₃→C₅₆, C₄₄→C₅₇, C₄₅ →C₅₃, and C₄₇→C₅₅), two clusters merge ({C₄₁, C₄₆}→C₅₁), whereas one cluster emerge (⊙→C₅₂) at time point t₅.

Download:

Fig 6. Evolution of clusters in Intel Lab dataset.

https://doi.org/10.1371/journal.pone.0278146.g006

For further details and understanding the significance and practical applications of monitoring changes in clustering solutions of streaming datasets see Atif et al [37].

9 Concluding remarks

In this paper, we introduce an R package clusTransition dedicated to trace the evolution of cluster solutions in cumulative datasets. The package implements state-of-the-art algorithm MONIC for modeling and tracing the transition of cluster solutions in dynamic datasets. This algorithm is based on re-clustering of cumulative datasets D₁, D₂, …, D_n arriving at corresponding time-points t₁, t₂, …, t_n and monitor the changes occurring in these cluster solutions. The changes comprise of clusters that still exist, split into various, absorbed by others, disappeared and newly emerged. The clusters that survived in external transition may experience a change in location and density called internal transition. We have applied clusTransition package on synthetic as well as on real-life datasets to look insight into change detection framework.

10 Limitations of the package

The clusTransition package takes into account batch processing, where the stream is discretized and the gathered data is put into the windowing model. The datasets are not clustered upon arrival immediately in real time. Similarly, the use of sliding and landmark models either contain the data items or entirely ignore them at subsequent time-points. A damped window model, on the other hand, assigns each object, depending on its arrival time, exponentially decreasing weights. Future plans call for adding support for the damped window model to the R package for change detection.

The paradigm for cluster transition monitoring presupposes hard clustering, which requires that each item be assigned to one and only one cluster. This assumption implies that the strategy cannot be used to density-based or model-based clustering approaches, leaving the problem open for further investigation.

References

1. Wierzchoń and M. Kłopotek. Modern Algorithms of Cluster Analysis. Studies in Big Data. Springer International Publishing, 2017. URL: https://books.google.com.pk/books?id=LeJEDwAAQBAJ.
2. Rapkin B.D., Luke D.A. Cluster analysis in community research: Epistemology and practice. Am J Commun Psychol. 1993; 21, 247–277. https://doi.org/10.1007/BF00941623
- View Article
- Google Scholar
3. H.C. Romesburg. Cluster Analysis for Researchers. Morrisville, NC: Lulu.com. (Reprint of 1984 edition, with minor revisions.); 2004.
4. Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A. Y., et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing. 2014; 2(3):267–279.
- View Article
- Google Scholar
5. Montinaro M. and Sciascia I. Market segmentation models to obtain different kinds of customer loyalty. Journal of Applied Sciences. 2011; 11(4):655–662.
- View Article
- Google Scholar
6. Borgen F. H. and Barnett D. C. Applying cluster analysis in counseling psychology research. Journal of Counseling Psychology. 1987; 34(4):456–468.
- View Article
- Google Scholar
7. Punj G. and Stewart D. W. Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research. 1983; 20(2):134.
- View Article
- Google Scholar
8. Zakharov K. Application of k-means clustering in psychological studies. The Quantitative Methods for Psychology. 2016; 12(2):87–100.
- View Article
- Google Scholar
9. Landauer M., Wurzenberger M., Skopik F., Settanni G., and Filzmoser P. Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection. Computers Security. 2018; 79:94–116.
- View Article
- Google Scholar
10. M. Oliveira and J. a. Gama. Mec –monitoring clusters’ transitions. In Proceedings of the 2010 Conference on STAIRS 2010: Proceedings of the Fifth Starting AI Researchers’ Symposium, page 212–224, NLD, 2010. IOS Press. ISBN 9781607506751.
11. Spiliopoulou M., Ntoutsi E., Theodoridis Y., and Schult R. Monic and followups on modeling and monitoring cluster transitions. Advanced Information Systems Engineering Lecture Notes in Computer Science. 2013; 622–626.
- View Article
- Google Scholar
12. Silva J. D. A., Hruschka E. R., and Gama J. An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications. 2017; 67:228–238.
- View Article
- Google Scholar
13. S. Badiozamany, K. Orsborn, and T. Risch. Framework for real-time clustering over sliding windows. Proceedings of the 28th International Conference on Scientific and Statistical Database Management—SSDBM 16. 2016. https://doi.org/10.1145/2949689.2949696
14. Patroumpas K. and Sellis T. Window specification over data streams. Current Trends in Database Technology—EDBT 2006 Lecture Notes in Computer Science. 2006; 445–464.
- View Article
- Google Scholar
15. Deepayan Chakrabarti, Ravi Kumar, and Andrew Tomkins. Evolutionary clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’06). 2006. Association for Computing Machinery, New York, NY, USA, 554–560. https://doi.org/10.1145/1150402.1150467
16. Y. Chi, X. Song, D. Zhou, K. Hino, B. L. Tseng. On evolutionary spectral clustering. ACM Trans. Knowl. Discov. 2009. https://doi.org/10.1145/1631162.1631165
17. Y. Zhang, H. Liu, B. Deng. Evolutionary clustering with dbscan. Ninth International Conference on Natural Computation (ICNC). 2013; 923–928. https://doi.org/10.1109/ICNC.2013.6818108
18. T. Xu, Z. Zhang, P. S. Yu, B. Long. Evolutionary clustering by hierarchical dirichlet process with hidden markov state. Eighth IEEE International Conference on Data Mining. 2008; 658–667. https://doi.org/10.1109/ICDM.2008.24
19. Hyde R., Angelov P., and MacKenzie A.R. Fully online clustering of evolving data streams into arbitrarily shaped clusters. Information Sciences. 2017; 382: 96–114.
- View Article
- Google Scholar
20. Fahy C., Yang S., and Gongora M. Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams. IEEE Trans. Cybern. 2019; 49: 2215–2228.
- View Article
- Google Scholar
21. C. Fahy and S. Yang. Finding and tracking multi-density clusters in online dynamic data streams. IEEE Trans. Big Data. 2019; 1–15. https://doi.org/10.1109/TB-DATA.2019.2922969
22. Huang L., Wang C.-D., Chao H.-Y., and Yu P.S. MVStream: Multiview data stream clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020; 31: 3482–3496.
- View Article
- Google Scholar
23. Li H., Liu J., Yang Z., Liu R. W., Wu K., and Wan Y. Adaptively constrained dynamic time warping for time series classification and clustering. Information Science. 2020; 534: 97–116.
- View Article
- Google Scholar
24. Liang M., Liu R. W., Li S., Xiao Z., Liu X., and Lu F. An unsupervised learning method with convolutional auto-encoder for vessel trajectory similarity computation. Ocean Eng. 2021; 225: 108803.
- View Article
- Google Scholar
25. Z. Zhang, K. Huang, and T. Tan. Comparison of Similarity Measures for Trajectory Clustering in Outdoor Surveillance Scenes. Proceedings of the 18th International Conference on Pattern Recognition (ICPR06), IEEE, Hong Kong, China, 2006: 1135–1138.
26. Liu X., Guan J., and Hu P. Mining frequent closed itemsets from a landmark window over online data streams. Computers Mathematics with Applications. 2009; 57(6):927–936.
- View Article
- Google Scholar
27. Mansalis S., Ntoutsi E., Pelekis N., and Theodoridis Y. An evaluation of data stream clustering algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2018; 11(4):167–187.
- View Article
- Google Scholar
28. Hu Y. Optimal algorithm of data streams clustering on sliding window model. Journal of Computer Applications. 2008; 28(6):1414–1416.
- View Article
- Google Scholar
29. M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD 06. 2006. https://doi.org/10.1145/1150402.1150491
30. Ntoutsi E., Spiliopoulou M., and Theodoridis Y. Fingerprint. International Journal of Data Warehousing and Mining. 2012; 8(3):27–44.
- View Article
- Google Scholar
31. Qiu W. and Joe H. Generation of random clusters with specified degree of separation. Journal of Classification. 2006; 23(2):315–334.
- View Article
- Google Scholar
32. Weiliang Qiu and Harry Joe. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.7. https://CRAN.R-project.org/package=clusterGeneration
33. Melnykov Volodymyr, Chen Wei-Chen, Maitra Ranjan. MixSim: An Package R for Simulating Data to Study Performance of Clustering Algorithms. Journal of Statistical Software. 2012; 51(12): 1–25. URL https://doi.org/10.18637/jss.v051.i12
- View Article
- Google Scholar
34. Leisch F. A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis. 2006; 51(2):526–544.
- View Article
- Google Scholar
35. European Social Survey Cumulative File, ESS 1-9 (2020). Data file edition 1.0. Sikt—Norwegian Agency for Shared Services in Education and Research, Norway. Data Archive and distributor of ESS data for ESS ERIC. https://doi.org/10.21338/NSD-ESS-CUMULATIVE
36. Dua, D. and Graff, C. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
37. Atif Muhammad, Shafiq Muhammad and Leisch Friedrich. Applications of monitoring and tracing the evolution of clustering solutions in dynamic datasets. Journal of Applied Statistics. 2021;
- View Article
- Google Scholar

[ref1] 1. Wierzchoń and M. Kłopotek. Modern Algorithms of Cluster Analysis. Studies in Big Data. Springer International Publishing, 2017. URL: https://books.google.com.pk/books?id=LeJEDwAAQBAJ.

[ref2] 2. Rapkin B.D., Luke D.A. Cluster analysis in community research: Epistemology and practice. Am J Commun Psychol. 1993; 21, 247–277. https://doi.org/10.1007/BF00941623
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. H.C. Romesburg. Cluster Analysis for Researchers. Morrisville, NC: Lulu.com. (Reprint of 1984 edition, with minor revisions.); 2004.

[ref4] 4. Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A. Y., et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing. 2014; 2(3):267–279.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Montinaro M. and Sciascia I. Market segmentation models to obtain different kinds of customer loyalty. Journal of Applied Sciences. 2011; 11(4):655–662.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref6] 6. Borgen F. H. and Barnett D. C. Applying cluster analysis in counseling psychology research. Journal of Counseling Psychology. 1987; 34(4):456–468.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref7] 7. Punj G. and Stewart D. W. Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research. 1983; 20(2):134.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref8] 8. Zakharov K. Application of k-means clustering in psychological studies. The Quantitative Methods for Psychology. 2016; 12(2):87–100.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref9] 9. Landauer M., Wurzenberger M., Skopik F., Settanni G., and Filzmoser P. Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection. Computers Security. 2018; 79:94–116.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. M. Oliveira and J. a. Gama. Mec –monitoring clusters’ transitions. In Proceedings of the 2010 Conference on STAIRS 2010: Proceedings of the Fifth Starting AI Researchers’ Symposium, page 212–224, NLD, 2010. IOS Press. ISBN 9781607506751.

[ref11] 11. Spiliopoulou M., Ntoutsi E., Theodoridis Y., and Schult R. Monic and followups on modeling and monitoring cluster transitions. Advanced Information Systems Engineering Lecture Notes in Computer Science. 2013; 622–626.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref12] 12. Silva J. D. A., Hruschka E. R., and Gama J. An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications. 2017; 67:228–238.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref13] 13. S. Badiozamany, K. Orsborn, and T. Risch. Framework for real-time clustering over sliding windows. Proceedings of the 28th International Conference on Scientific and Statistical Database Management—SSDBM 16. 2016. https://doi.org/10.1145/2949689.2949696

[ref14] 14. Patroumpas K. and Sellis T. Window specification over data streams. Current Trends in Database Technology—EDBT 2006 Lecture Notes in Computer Science. 2006; 445–464.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref15] 15. Deepayan Chakrabarti, Ravi Kumar, and Andrew Tomkins. Evolutionary clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’06). 2006. Association for Computing Machinery, New York, NY, USA, 554–560. https://doi.org/10.1145/1150402.1150467

[ref16] 16. Y. Chi, X. Song, D. Zhou, K. Hino, B. L. Tseng. On evolutionary spectral clustering. ACM Trans. Knowl. Discov. 2009. https://doi.org/10.1145/1631162.1631165

[ref17] 17. Y. Zhang, H. Liu, B. Deng. Evolutionary clustering with dbscan. Ninth International Conference on Natural Computation (ICNC). 2013; 923–928. https://doi.org/10.1109/ICNC.2013.6818108

[ref18] 18. T. Xu, Z. Zhang, P. S. Yu, B. Long. Evolutionary clustering by hierarchical dirichlet process with hidden markov state. Eighth IEEE International Conference on Data Mining. 2008; 658–667. https://doi.org/10.1109/ICDM.2008.24

[ref19] 19. Hyde R., Angelov P., and MacKenzie A.R. Fully online clustering of evolving data streams into arbitrarily shaped clusters. Information Sciences. 2017; 382: 96–114.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref20] 20. Fahy C., Yang S., and Gongora M. Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams. IEEE Trans. Cybern. 2019; 49: 2215–2228.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref21] 21. C. Fahy and S. Yang. Finding and tracking multi-density clusters in online dynamic data streams. IEEE Trans. Big Data. 2019; 1–15. https://doi.org/10.1109/TB-DATA.2019.2922969

[ref22] 22. Huang L., Wang C.-D., Chao H.-Y., and Yu P.S. MVStream: Multiview data stream clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020; 31: 3482–3496.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref23] 23. Li H., Liu J., Yang Z., Liu R. W., Wu K., and Wan Y. Adaptively constrained dynamic time warping for time series classification and clustering. Information Science. 2020; 534: 97–116.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref24] 24. Liang M., Liu R. W., Li S., Xiao Z., Liu X., and Lu F. An unsupervised learning method with convolutional auto-encoder for vessel trajectory similarity computation. Ocean Eng. 2021; 225: 108803.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref25] 25. Z. Zhang, K. Huang, and T. Tan. Comparison of Similarity Measures for Trajectory Clustering in Outdoor Surveillance Scenes. Proceedings of the 18th International Conference on Pattern Recognition (ICPR06), IEEE, Hong Kong, China, 2006: 1135–1138.

[ref26] 26. Liu X., Guan J., and Hu P. Mining frequent closed itemsets from a landmark window over online data streams. Computers Mathematics with Applications. 2009; 57(6):927–936.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref27] 27. Mansalis S., Ntoutsi E., Pelekis N., and Theodoridis Y. An evaluation of data stream clustering algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2018; 11(4):167–187.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref28] 28. Hu Y. Optimal algorithm of data streams clustering on sliding window model. Journal of Computer Applications. 2008; 28(6):1414–1416.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref29] 29. M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD 06. 2006. https://doi.org/10.1145/1150402.1150491

[ref30] 30. Ntoutsi E., Spiliopoulou M., and Theodoridis Y. Fingerprint. International Journal of Data Warehousing and Mining. 2012; 8(3):27–44.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref31] 31. Qiu W. and Joe H. Generation of random clusters with specified degree of separation. Journal of Classification. 2006; 23(2):315–334.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref32] 32. Weiliang Qiu and Harry Joe. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.7. https://CRAN.R-project.org/package=clusterGeneration

[ref33] 33. Melnykov Volodymyr, Chen Wei-Chen, Maitra Ranjan. MixSim: An Package R for Simulating Data to Study Performance of Clustering Algorithms. Journal of Statistical Software. 2012; 51(12): 1–25. URL https://doi.org/10.18637/jss.v051.i12
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref34] 34. Leisch F. A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis. 2006; 51(2):526–544.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref35] 35. European Social Survey Cumulative File, ESS 1-9 (2020). Data file edition 1.0. Sikt—Norwegian Agency for Shared Services in Education and Research, Norway. Data Archive and distributor of ESS data for ESS ERIC. https://doi.org/10.21338/NSD-ESS-CUMULATIVE

[ref36] 36. Dua, D. and Graff, C. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[ref37] 37. Atif Muhammad, Shafiq Muhammad and Leisch Friedrich. Applications of monitoring and tracing the evolution of clustering solutions in dynamic datasets. Journal of Applied Statistics. 2021;
View Article
Google Scholar

[82] View Article

[83] Google Scholar

Figures

Abstract

1 Introduction

2 Window models

3 The change detection algorithm

4 Package description

4.1 Function Transition()

4.2 OverLap class

4.3 Overlap method

4.4 Function moplot()

5 Simulation example

6 Pre-processing

7 Implementation of function Transition()

7.1 Looking at listdata argument

7.1.1 Example (listdata argument with landmark window model).

7.1.2 Example (listdata argument with sliding window model).

7.2 Looking at listclus argument

7.2.1 Example: Listclus argument.

7.3 Looking at Overlap argument

7.3.1 Example: Overlap argument.

7.4 moplot() function

8 Real data example

8.1 Application to human values scale

8.2 Application to Individual Household Electric Power Consumption

8.3 Intel Lab dataset

9 Concluding remarks

10 Limitations of the package

References