Figures
Abstract
We present H-NGPCA, a hierarchical clustering algorithm for data streams that integrates an adaptive unit number growth and local dimensionality control. Unlike existing algorithm, H-NGPCA combines the characteristics of centroid-based, model-based and hierarchical clustering. H-NGPCA builds a hierarchical structure of local Principal Component Analysis (PCA) units, where each unit is a hyper-ellipsoid whose shape is updated by a neural network-based online PCA method. The re-positioning of each unit is handled by Neural Gas, a centroid-based clustering algorithm. In the hierarchical tree structure, new units are created in a branch if suggested by a splitting criterion. In addition, each unit determines its own dimensionality based on the data represented by the unit. In extensive benchmarks, H-NGPCA not only surpasses all competing online algorithms with adaptive unit numbers but also achieves competitive performance with state-of-the-art offline methods, reaching an average NMI = 0.87 and CI = 0.26. This demonstrates that H-NGPCA achieves both online adaptability and offline-level accuracy.
Citation: Migenda N, Möller R, Schenck W (2026) H-NGPCA: Hierarchical clustering of data streams with adaptive number of clusters and adaptive dimensionality. PLoS One 21(1): e0339171. https://doi.org/10.1371/journal.pone.0339171
Editor: Muhammad Ahsan, Sepuluh Nopember Institute of Technology: Institut Teknologi Sepuluh Nopember, INDONESIA
Received: August 12, 2025; Accepted: December 2, 2025; Published: January 5, 2026
Copyright: © 2026 Migenda et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data to reproduce the results (mean, std, graphics) are uploaded with the revision as supporting information files (variety of .mat files). They are all further available from the clustering benchmark database (https://doi.org/10.1007/s10489-018-1238-7).
Funding: This research was funded by the German Federal Ministry of Education and Research (BMBF) in the project VIP4PAPS (to W.S.).
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Cluster analysis is a family of methods from the field of exploratory data analysis. The goal is to sort different objects into groups in such a way that the similarity, which is determined by a proximity measure, between two objects is high if they belong to the same group and low if they belong to different groups. Clustering algorithms have been developed since the 1970s [1]. During the early development phase it was assumed that the data set is static and permanently available for processing. However, in many current applications, data streams are received and processed continuously [2].
A data stream is defined as an infinitely long, chronologically sorted sequence that must be analyzed directly as it is received under the constraints of limited memory and computational resources. Formally, a data stream is a sequence of N data points
, that is,
which is possibly unbounded (
). Each data point is described by an n-dimensional attribute vector
. Data streams unfold sequentially over time, with samples arriving in an unpredictable manner. The unknown size of the data stream requires data reduction techniques as it is impractical to store the entire data set. The rapid arrival of samples requires real-time processing, emphasizing the need for immediate responses. The evolving nature of the content of data streams requires adaptability to account for changing characteristics [3]. The results derived from data stream processing are often approximations as only one data point is considered at a time. Efficient memory utilization is critical given the limitations of stream processing. Concept drift, where the underlying patterns can change, require adaptive algorithmic components [4]. Data streams often have unexpected characteristics such as uncertainty that highlights the need for robust and flexible processing methods. In summary, data streams present a complex and dynamic challenge for clustering algorithms which can only be approached by developing algorithms with real-time adaptability [5].
1.1 Problem statement
Clustering algorithms for data streams require the assignment of hyperparameters such as the number of clusters, thresholds for density and spacing, decay rates, window lengths, and many more. Such hyperparameters must be set according to the input data at hand and directly affect the quality of clustering. While setting such parameters is also difficult in traditional clustering, data streams undergo changes that may cause clusters to emerge, disappear, merge, or split. As such, setting a fixed value without prior knowledge would bias the clustering model. Specifically, determining the optimal number of clusters is critical. When cluster characteristics, such as data distribution, are known a priori, techniques such as cross-validation can determine the cluster count; however, in most cases, knowledge of the input data is not available prior to execution. Whenever two or more local regions (each represented by a cluster model) start drifting towards or apart from each other, the number of cluster models has to be adjusted accordingly.
In brief, clustering streams cannot be done by using traditional algorithms for batch processing of data sets. Intuitively, it is not possible to stop the stream to perform analysis, or to indefinitely postpone results in order to have the time to perform offline clustering. So, the naive solutions would be (i) to try and buffer the stream for later processing, which is not possible for endless streams of data, or (ii) to take random samples, which reduces the data size while attempting to retain representativeness but may fail to capture the correct data distribution. These naive solutions clearly do not make best use of the time available and of the information that the stream contains.
1.2 Related work
We provide a compact overview of state-of-the-art online clustering algorithms and highlight their properties with respect to data stream clustering; an in-depth review of data stream clustering algorithms relevant in this work is shown in appendixA. Based on the conventional taxonomy for clustering algorithms [6], clustering algorithms can be classified into the following categories: (i) Hierarchical; (ii) Grid-based; (iii) Density-based; (iv) Partitioning; and (v) Distribution-based.
(i) Hierarchical clustering organizes objects into a hierarchical tree structure. The hierarchical tree structure allows to create different clusterings by exploring the tree at different levels. The advantage of this category is the tree structure, as its possible to expand or prune each branch independently. However, the hierarchical structure is susceptible to noise and outliers [7,8]. (ii) In grid-based algorithms, the input data is divided into grid cells that group the objects of the data stream into bins. Regardless of how much data is presented, it is represented with a constant number of grid cells. The resolution of the grids is a trade-off between accuracy and computational cost; this is generally defined by user-supplied hyperparameters. The handling of the grid structure limits many grid algorithms of this type to low-dimensional data sets [9]. (iii) In density-based algorithms, clusters are generated as dense regions separated by less dense regions so that they can represent arbitrarily shaped clusters. In density-based algorithms, the definition of a density region is crucial, which is defined by hyperparameters [10] (iv) Partitioning algorithms create partitions so that similar objects are in the same partition and dissimilar objects are in different partitions. Partitions can be defined by centroids, representative points, or nodes. The online variant of the well-known K-means method from this class was compared to hierarchical methods and could achieve better results [11]. However, partitioning algorithms are mostly limited to hyperspherical clusters and the number of clusters is given as a fixed hyperparameter [12]. (v) Distribution-based clustering is based on the approach of learning a model of data distribution that best fits the input data. One of the most important advantages of distribution-based algorithms is their property of noise robustness. The quality of such methods strongly depends on the prior knowledge of the underlying data distribution, which is often unknown.
Some algorithms combine approaches from the different groups, with the goal of minimizing the weaknesses of each group. For example, D-Stream [13] and MR-Stream [14] are classified in the literature as both grid [15] and density methods [16]. A two-step hierarchical K-means that first partitions the data into spherical groups and then merges these groups using hierarchical methods is proposed in [17]. An other hierarchical extension of K-means [18] uses a pairwise overlap between components to determine the optimal number of clusters, while preserving the fast computation property of K-means. Also, the NGPCA algorithm [19] which is extended in this work naturally combines the strengths of Neural Gas (partitioning) and PCA (model-based) with the goal of achieving better clustering.
More recently, hybrid approaches were proposed that attempt to enhance classical clustering algorithms with deep learning [20]. Before these developments, several non-linear dimensionality reduction methods had already been introduced to overcome the limitations of linear techniques such as PCA. Examples include Kernel PCA, Autoencoders, and manifold learning techniques such as t-SNE and UMAP, which are able to capture complex nonlinear structures in high-dimensional data. Building on this, so-called deep clustering approaches [21] combine feature learning and clustering within a joint deep-learning framework. However, this comes with the typical deep learning drawbacks of significantly higher computational costs, lower interpretability, and sensitivity to hyperparameters.
1.3 Contributions
An example of the learning process of our hierarchical neural gas principal component analysis (H-NGPCA) algorithm is shown in Fig 1 with the corresponding unit tree in Fig 2 to highlight the adaptive splitting process. Our contributions compared with existing work are:
- Our algorithm combines the characteristics of three clustering categories: Centroid-based clustering (Neural Gas), model-based clustering (Principal Component Analysis) and a hierarchical structure.
- We provide an algorithm that adaptively decides when a unit needs to be split without the need for a termination criterion.
- We incorporate a local adaptive dimensionality control, so that each region is approximated with the correct dimensionality.
- All components are completely online without the need of historical data, pre-training, and batch or offline components.
- We provide a fully reproducible benchmark and online repository, which can be used as a basis for future benchmarks in the field of data stream clustering with an adaptive number of units.
The pictures show the learning progress of H-NGPCA and the continuous splitting of the units. Each unit is indexed and the corresponding unit dimensionality is the value within the brackets. The corresponding unit tree is shown in Fig 2. After training step 50000, the number of units stays constant (tested up to step 75000.) We recommend to view this picture in color, as each color (evenly selected from the color spectrum) shows the assignment to the units.
The colors correspond to the units that are added at each of the six training snapshots (Fig 1(a)-(f)). After 2000 training steps the model consists of unit 3, 4 and 5 (red color). At training step 10000 (green color) the units 8, 14 and 15 emerged and replace unit 4. At training step 20000 (blue color) the units 18 and 19 emerged and replace unit 15. At training step 30000 (orange color) the units 25, 26, 27, 32 and 33 emerged from unit 3. At training step 40000 (purple color) the units 10 and 11 emerge from unit 5, 28 and 29 from unit 6, and the units 34 and 35 emerge from unit 25. In the final update after 50000 (gray color) training steps the units 12 and 13 emerge from unit 8 and the units 40 and 41 from unit 10. White units were created between two snapshots and split up again.
2 Local online PCA clustering
Classical offline Principal Component Analysis (PCA) preserves the maximal variance of a data set with a given set of linear descriptors. PCA performs an orthonormal transformation on a n-dimensional data set to obtain a smaller set of m linearly independent variables. As the focus of this work are online PCA algorithms, we refer to [22] for a detailed description of offline PCA.
A streaming or online setting for PCA is characterized by data points arriving sequentially over a period of time [23]. In each iteration i, one data point is presented and the model is parameters are updated. To maintain a good approximation, the set of parameters describing the subspace has to be updated continuously without access to the history of data. Different learning rules for online PCA were proposed [24], of which incremental and neural-network algorithms are the most popular. In the following we focus on neural-network-based PCA and refer to [25] for a review of incremental PCA.
In a mixture of local PCA, dimensionality reduction is combined with vector quantization (VQ). This is achieved by extending the simple codebook vectors by local PCA units, where each unit competes for the presented data points. Neural Gas Principal Component Analysis (NGPCA) is an online local PCA algorithm [19,22]. A NGPCA network consist of M local units where the set of parameters of the jth unit is defined as
and
. The center points
are updated according to the Neural Gas (NG) scheme
with being an exponentially declining term that ensures that not only the winner is updated (soft-clustering). The ranking of each unit
is determined based on the distances
, with rank 1 indicating the closest and rank M the furthest Euclidean distance from a codebook vector to the presented data point
.
Typically, the learning rate decreases over time. To stabilize online learning on dynamic distributions, the learning rate must be able to increase again later in the training process, so that the units can adapt to the new data. In the adaptive version [22] the adaptive learning rate
is updated by
with being an adaptive term that depends on the unit fit to their representing data points and is approximated in a low-pass filter [22]. This adaptive term is calculated for each of the m dimensions and then averaged. The neighborhood range
as typically used in Neural Gas is updated globally for all units. It is determined from the average learning rate
of all M units:
The eigenvalues , eigenvectors
and the mean residual variance
in the n–m minor eigendirections are recursively obtained by a coupled neural PCA learning rule [26].
From a geometric perspective, each local PCA unit is a hyper-ellipsoid located in the input data space (Fig 3). For each presented data point it has to be determined to which local model(s) the data point belongs. A ranking based on a Mahalanobis distance measure is performed, where the Mahalanobis distance is one possible choice [19]. The Mahalanobis distance is defined as (unit index j omitted)
with ; the second additive term only occurs for m < n as
for m = n. As only units with a good ranking are updated, it may happen that some units are not winning any data points and are then considered “dead”.
Each cluster is represented by a PCA unit with an axis length of and a specified center (red dot). Black dots represent data points.
In [27] an alternative distance measure (the index H identifies the author’s name) is proposed to prevent dead units. The eigenvalue and eigenvector estimates are updated by the same PCA procedure, but the distance measure used in the ranking is independent of the volume V of the hyper-ellipsoid corresponding to a unit:
with and
. This potential function is essentially treating each unit as a hyper-ellipsoid of the same volume in the competition, regardless of its actual volume.
3 Hierarchical online local PCA clustering
From the viewpoint of hierarchical local PCA clustering, such as NGPCA (cha:NGPCA), each unit has the same horizontal position in a single layer (heterarchy) and same properties, such that each unit theoretically plays an equal role during the ranking process. In hierarchical NGPCA clustering (H-NGPCA) [28], units are placed on multiple horizontal layers with units closer to the root having a larger share of data points. A hierarchical structure and the data flow of H-NGPCA are shown in Fig 4(a). The model consists of a binary tree of PCA units, with all units belonging to the set . Within the tree, we distinguish between two different kind of units. Firstly, there are the “developed” units
, which have two so-called child units directly connected to them in the binary tree structure. The child units are one hierarchical level lower in the branch and the child units can in turn also have children, which gradually builds up the tree. This structure continues to the last hierarchical level of each branch until it reaches the units that no longer have children of their own. These units are referred to as unborn children or unborn units
. Unborn children have the same properties as developed units and are fully involved in the learning process; the only difference is that they have no children themselves. This is shown in Fig 4(b), with the solid-drawn units representing
and dashed-drawn units
. The unborn units
are trained normally and their goal is to outperform their parent unit (from which they were generated) so that they themselves become developed units, which in turn each receive two unborn children of their own. For the later derivation of this competition of the unborn children against their parents, we introduce the set
, which contains the outermost developed unit of each branch.
(a) On each data point presentation, the presented data point x is passed trough the winning branch until an unborn children is reached. The flow of data point x is indicated by the solid line path. Respectively, the winner units have a solid shape and all loser units a dashed shape. (b) Units with a solid shape are developed units (set ) which have children. The dashed units on the lowest level are unborn children (set
). They are trained normally and compete against their respective parent unit to become a developed unit with its own unborn children.
Two example cases are visualized in Fig 5. In Fig 5(a), a root unit (solid line) is shown that lies between two clusters. The two unborn child units (dashed line), on the other hand, represent the two clusters well, which indicates a split. Another scenario is shown in Fig 5(b), where the model consists of the three clusters represented by the three outermost developed units. Then, the data points of the bottom left cluster (orange squares) start to drift apart, and the two children (dashed line) follow. If the newly created clusters drift apart a little further, a split is suggested. The two other higher-level units (solid line) also have two unborn child units each, which are not drawn for reasons of clarity.
The two unborn units (dashed lines) are much better approximations of the distributions, suggesting a split operation. (b): The original data points (dots) form three clusters. With the presentation of new data points (rectangles) over time, a cluster is splitting into two clusters. The unborn units (dashed line) yield a better approximation of the new distribution.
3.1 Algorithm overview
To provide a intuitive description of H-NGPCA, we will provide a step-by-step explanation of the training procedure on each data point presentation. A full visual summary of all steps is shown in Fig 6 and the pseudo code is included in the appendix B.
The method is divided into two phases; this figure covers the update procedure of the model parameters; Fig 7 covers the adjustment of the model size. Starting from the root unit, the current data point is used to calculate the potential function of both child units. This competition leads to a winner unit. While both units get their activities updated, only the winner unit has its model parameter updated. This includes the learning rate, centroids, shape and intra-cluster distance. If children exist, the processes is repeated. Then, the inter-cluster distance for all loser units is updated for the given data point; the winner units already got their intra-cluster distance updated.
The H-NGPCA model is initialized with one root unit which can be regarded as a global PCA and two corresponding unborn children. On each data point presentation, the set of winner units () and looser (
) units is reset. In a top-down approach (line 7-17 in algorithm 2) the data point is passed through the tree, starting with the two children of the root unit. A pairwise ranking is performed between the two child units. For the pairwise ranking, first, the potential of both units (
) to the presented data point are calculated based on (5). The winner unit is then determined by comparing which one has the lower potential
and the winner is added to the set of units that have their parameters updated . This procedure is recursively repeated as long as the winner unit cw has its own children. In this way, the data point is passed through one branch of the tree until it reaches an unborn unit. This winner path is indicated with the solid line in Fig 4(a).
Once the winner branch is identified, all units in are updated, in a winner takes all setting (hard-clustering), which means that the loser units are unchanged. This second part of the algorithm corresponds to line 18-25 in algorithm 2. All winner units have their learning rate (2), centers, eigenvalues and eigenvectors [26], assignment value, as well as their activity and intra-cluster distance updated. The intra-cluster distance is a key component for the unit splitting algorithm which is discussed in detail in sec. 3.2. The asignment value is always determined between two siblings. It describes the proportion to which the data points are divided between the two units. This is necessary in order to correctly incorporate the weighting when comparing two child units with their parent unit. The activity describes the weighting of the outermost developed units. This is necessary to weight the quality of the model
. The update steps for the above parameters are described below.
The unit centers are updated according to
which is similar to the update in NGPCA (1), except that there is no ranking, as the winner takes all. The assignment value is a weighting factor between two sibling units , that is updated for all units along the winner branch
To approximate the overall assignment value over all data points, a low-pass filter is used
with being a low-pass parameter. The assignment values of both units are then normalized
so that . The assignment value is important for the splitting procedure, as it defines to what proportion the two units replace their respective parental unit (next higher hierarchical level and directly connected). Once the winner units
are updated, in all other units (Fig 4: dashed units) the inter-cluster distance is updated. For high-dimensional data an additional component exists for the winner units, namely an update of the unit specific dimensionality mj. Once the model parameters are updated, it is checked whether the tree should be extended or not (sec. 3.3). The detailed algorithms for both the dimensionality and unit-number adjustment are discussed in separate algorithms below. This procedure is repeated on each data point presentation to continuously update the model parameters.
3.2 Quality measure for model selection
Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering. Some algorithms such as K-means clustering, require the user to specify the number of clusters k to be generated. Unfortunately, the optimal number of clusters is subjective and depends on the method used for measuring similarities and the parameters used for partitioning. So this task should not be left to the user.
In hierarchical clustering algorithms, a quality measure is derived based on a suitable distance and a linkage criterion that specifies the dissimilarity of the data points belonging to a unit. H-NGPCA is a deterministic and geometrical clustering approach that is based on a version of the Mahalanobis distance metric such as (4) or (5) to determine the similarity between data points and group them into clusters. Therefore, we will use the geometric properties to define a quality measure that determines whether a pair of unborn child units outperforms its respective parent unit, resulting in a split.
The model is represented by the outermost developed unit of each branch . Each of these outermost developed units have two unborn child units, all unborn units are gathered in the set
, that compete to replace the respective parent unit. On each data point presentation, the distance of all these units to the presented data point is calculated (5). In order to approximate a continuous quality measure and since it is impossible to calculate the sum over all data points N, the distances are approximated by a low-pass filter. As the classical Mahalanobis distance (4) is biased towards large units [22,27], we consider the volume-normalized Mahalanobis distance
(5)
with a low-pass parameter . The distances
of all unborn child units
and their respective parent units
are necessary for the split decision. Further, an activity between the units
is necessary, so that the distances of the individual units can be weighted into a quality measure; equal weighting would not work with unbalanced data. For this purpose, an activity
is introduced, which defines the weight between all units
that currently represent the model. In analogy to the assignment value
(see (8)-(10)), this leads to
with only one unit of (parent of cw) obtaining a
and all other units
. Then, for each unit the continuous activity is approximated in a low-pass filter
and normalized over all outermost developed units
It is now possible to create a quality measure using the low-pass filtered distances and activities
.
In hierarchical clustering, a linkage criterion is commonly used to specify the similarity or respectively the dissimilarity between data points and clusters. In our method, each unit updates an intra-cluster and an inter-cluster distance. The similarity of data points that belong to a unit is tracked by the intra-cluster distance. All winner units of
have their intra-cluster distance updated in a low-pass filter
with the low-pass parameter , the set of winner units
, the set of outermost developed units
and the set of unborn child units
. The intra-cluster distance represents the average distance to all data points won by a unit. It remains unchanged for all loser units
.
The inter-cluster distance is a natural counterpart that represents the average distance to all data points which do not belong to the unit. It is updated in a similar style as the intra-cluster distance but for all loser units according to
with being the low-pass parameter, the set of winner units
, the set of outermost developed units
and the set of unborn child units
. It remains unchanged for all winner units
.
Based on the continuously updated intra-cluster and inter-cluster distance and the units activity, we can define an adaptive parameterless quality measure
based on the sum of the activity-weighted ratio between intra-cluster (15) and inter-cluster (16) distance. Ideally the intra-cluster distance is small when the represented data points are close to the respective unit. The inter-cluster measure
on the other hand should be large, as data points that do not belong to a unit are ideally further away from it. The activity weighting is necessary for unbalanced data sets, such that units that only represent a small share of the presented data are not distorting the quality measure.
3.3 Splitting algorithm
The next step is to use the updated model parameters (Fig 6) to check whether the model structure should be extended or not. The splitting algorithm is shown in Fig 7. First, the quality of the current model is calculated with (17). All units of are taken into account, i.e. all units that have unborn children. Then, successively each unit of
is substituted by its two unborn child units; only one unit is replaced per comparison. Now we have the quality of the current best model
and the set of qualities where each time a single unit is substituted by its two children. To give a small example: For a model with ten outermost developed units, there are now eleven quality measures. One is the original measure QP obtained from
and ten versions QC,j in which always one outermost developed unit is replaced by its two unborn children. The intra-cluster and inter-cluster distance of parent unit j is replaced by
In addition, the measure is calculated by substituting one developed unit by its unborn children QC,j. If the unborn children outperform their respective parent unit, a split is performed. The winning unborn units are upgraded to a developed unit and each obtains two new unborn units.
with to weight the unborn units measures. From all eleven quality measures the minimum is searched
. Then, it is checked if
is smaller than QP. To prevent false splits due to statistical outliers, we introduce a hysteresis parameter
to slightly penalize the child models, so that
. Throughout the benchmark, a value close to 1 is chosen for
. If the original measure QP remains the best, no splitting takes place. Otherwise the tree is developed in the best branch, by replacing one unit by its children. After a split is performed, the model can not split for
training steps, to allow the newly added units to reposition. To ensure enough training time even with a large model size, the split prevention parameter grows with the model size
. As the splitting affects the entire model, this parameter is global.
If a split decision is made, the tree is developed further at the given location. This means that the two previously unborn children now become independent units and are added to and
. The respective parent unit of the now developed units is removed from
, as it is not the lowest developed unit on this branch anymore. The two new developed units each obtain two new unborn children; these are respectively added to the set
. Due to the hierarchical structure, the newly initialized children are always located in the subspace of the parent unit. The dimensionality is set to m = 2. Starting with a low dimensionality has a computational advantage. If a unit needs a higher dimensionality through the training process, this is achieved with our adaptive dimensionality control algorithm. The eigenvalues are set to half of the parent units first two eigenvalues, the mean residual variance accordingly, the centers and the first two principal eigenvectors are set to the values of the parent units. This leads to two identical children after initialization, so that the first competition between them is random, and the intra- and inter-cluster distances are set to 1. The learning rate is set to
so that the unit starts actively. If it were to inherit the learning rate of the parent unit, it would possibly start inactive and would first have to be woken up using an adaptive learning rate. The low-passes to calculate the learning rate are inherited from their parent unit, but with a small offset of 10%. If the low-passes were adopted exactly, the calculation of the learning rate would directly result in the learning rate of the parent unit.
4 Local adaptive dimensionality adjustment
Each unit covers a different part of the data distribution, while each cluster possibly has a different dimensionality which furthermore constantly changes due to the continuous presentation of new data points. This requires each PCA unit to adaptively adjust its own dimensionality whenever the dimensionality of the represented cluster changes (algorithm 1). The need for a unit-specific dimensionality is further motivated by the hierarchical structure. This effect can be seen in Fig 8, as a parent unit lies between two clusters while the two corresponding child units lie correctly on one cluster each. The data set has 32 dimensions, and for the two children to represent 50% of the cluster variance, 7 and 10 dimensions are required respectively. One might expect the parent unit to require a similar dimensionality, but this is not the case. The parent unit does not represent the variance of the two clusters, but mostly the distance between them, which requires only two dimensions to represent 99% of the variance. The shape even suggests that if the distance is large enough, one dimension would be sufficient. This implies the need to adjust the dimensionality at each unit individually for each data point presentation, so that the best possible fit is achieved.
The first number next to each unit center represents the unit index and the number in the bracket the corresponding unit dimensionality.
Algorithm 1 Local online PCA dimensionality adjustment procedure.
Input: current dimensionality mj, current eigenvectors
, current eigenvalues
, current
, current
with an
initial
Output: updated dimensionality mj, updated eigenvectors
, updated eigenvalues
, updated
, updated
1: if then
2:
[25] Eq (15)
3:
[25] Eq (16)
4:
[25] Eq (17)
5: U Linear Transformation (
)
[25] Eq (18)
6:
[25] Eq (20-23)
7:
8: if mj increased then
9: Extend by the number of added dimensions from U
10: Decrease residual variance by the added eigenvalues
11: for all newly added dimensions do
12:
(20), (21)
13: end for
14: else if mj decreased then
15: Remove dimensions from
16: Remove dimensions from
17: Increase residual variance by the removed eigenvalues
18: end if
19: else
20:
21: end if
For this purpose, we adapt the method from [25], in which an adaptive dimension adjustment for a PCA unit was presented. In the following, we describe the basic functionality of the dimensionality adjustment algorithm, but focus on significant extensions in this work and refer to the basic version of the algorithm for a detailed explanation of the dimensionality adjustment algorithm. The extended local adaptive dimensionality adjustment algorithm is shown in algorithm 1.
The dimensionality adjustment algorithm [25] exploits several natural features of neural network-based PCA and properties of the data distribution: (i) the eigenvalues are naturally sorted in a descending order, (ii) the components are trained in a hierarchical order, ensuring that the most relevant component is trained first, and (iii) the variance is not evenly distributed over all principal components. It is more likely that only some represent a majority of the data variance and the variance of most principal components is minor.
Each unit is initialized with a dimensionality of m = 2 so that only the two most relevant principal components are trained, which greatly reduces the computational effort. A least-square regression on a logarithmic scale is used to approximate the remaining n–m eigenvalues. For this purpose, the m eigenvalues are transformed into the logarithmic scale and the least squares parameters of a fitted line are estimated. The original m eigenvalues are then supplemented with n–m estimated log eigenvalues along the fitted line and transformed back into the non-logarithmic range. This allows to add or remove multiple dimensions at once if necessary and is the main reason why we chose this method. In the following we propose three adjustments to the algorithm:
(i) In the original algorithm [25], the corresponding eigenvectors to the approximated eigenvalues are selected at random, which is not ideal as the randomly generated eigenvectors can point into the space already covered by the first m principal components. Instead, we use the modified Gram-Schmidt algorithm to calculate an orthonormal vector for each added dimension which is then used to extend the already existing eigenvector matrix. As the already existing eigenvectors in W are orthonormal to each other, it is only necessary to orthonormalize the newly added eigenvectors to the already existing ones. Therefore, for each added dimension the Gram-Schmidt procedure is performed sequentially. First, a random new eigenvector is initialized. Then, we modify
according to
normalize the new vector to unit length
and extend the eigenvectors matrix W by the newly approximated dimension. This process is repeated until the added eigenvector is orthogonal with sufficient accuracy.
(ii) The neural network-based PCA method approximates at each data point presentation the variance in the n–m minor eigendirections. When the dimensionality is adjusted, the approximation of
has to be adjusted. When the dimensionality is increased, the newly added variance (represented by the added eigenvalues) has to be subtracted from
, and when the dimensionality is decreased, the variance needs to be added, respectively. This is not done in the original version, which results in a distortion of the potential function (e.g. (4) and (5)).
(iii) In the original algorithm, an initial delay of training steps is defined to allow the PCA to adapt to the data after a change in dimensionality. As we are now working with a mixture of local PCA, we extend this concept. First, whenever a unit adjusts its own dimensionality, we introduce a unit-specific delay
that prevents further dimensionality changes for that unit. Therefore, whenever new units are generated they also have the initial delay
to prevent uncontrolled changes during the unit adjustment.
5 Experiments
Data sets: The experimental study conducted in this paper is based on the data sets provided in the clustering benchmark database [29], appendix C.
Baselines: We benchmark our algorithm against state-of-the-art clustering algorithms, which we all re-implemented using popular python libraries, apendix D.
Metrics: To evaluate the clustering performance we use the commonly used Cluster Index (CI) and Normalized Mutual Information (NMI) measures, appendix C.
5.1 Visual results
H-NGPCA was tested on all data sets (tab:datasets). To get a first impression of the performance, selected clustering results are visualized and discussed. In order to limit the scope to the essential results, the majority of the visual results are discussed in the apendix E. The figures serve to visually validate the numerical results of the following chapters. For referencing between the visualization and the text, each unit is numbered with the dimensionality indicated in brackets. In addition, the data points are colored based on the clustering results, so the images in this chapter should be viewed in color. The final cluster results are shown with the associated time course of quality Q(t), learning rate , number of units, and dimensionality m(t) during the learning process. For the high-dimensional data sets, the PCA units are shown by their two most relevant principal components.
In Fig 9, the H-NGPCA algorithm is trained on the s1 data set. This initial data set is characterized by 15 two-dimensional Gaussians without overlap. Each cluster is covered by exactly one unit and the assignment of the data points works without errors. The learning process on that data set is shown in Fig 1.
All PCA units are represented with an axis length of
While the previous visualizations concerned 2d data sets, now the high-dimensional data set h1024 is considered. This example is particularly relevant as it can be used to analyze the influence of the adaptive dimensionality control. The question arises as to whether drastically changing the dimensionality of the individual units leads to undesirable splits. Fig 10 shows the 2d projections of the units on the h1024 data set. The axes of the units are the 2 eigenvalues with the highest variance. The numbers in brackets after the unit index indicate the dimensionality of the respective unit. It can be seen that each unit has its own dimensionality based on the data associated with it. In addition, each unit is located on exactly one cluster and the data points are correctly assigned to the units. Fig 11.a shows the dimensionality of the units over time. We decided not to take averaged curves over multiple runs, as it cannot be guaranteed that each unit represents the same cluster in every simulation. Newly added units tend to overshoot slightly in their dimensionality before they converge towards the correct dimensionality. This also corresponds to the findings from [25]. It can be seen that the dimensionality converges very well against the actual value and that fluctuations in dimensionality have no influence on the quality measure and thus the splitting process. The ground truth dimensionality of all clusters which was determined offline is m = 15.5 and the average PCA dimensionality at the end of the training procedure is m = 16.75. Fig 11.b shows the continuous growth of outermost units until the real number of clusters is reached.
All PCA units are represented with an axis length of
(a) The average ground truth dimensionality of all clusters is m = 15.5 and the average dimensionality of all PCA units is m = 16.7 at the end of the training procedure.
5.2 Quantitative and statistical comparison with competing algorithms
The literature search revealed that none of the online clustering algorithms with an adaptive number of units under consideration provided a functioning implementation. We therefore extended our benchmark to offline algorithms with an adaptive number of clusters and established popular offline clustering algorithms with a fixed number of units. The NMI and CI measures were used as quality measures for all data sets. In addition, the number of clusters was used as an additional measure for algorithms with an adaptive number of units.
The CI values averaged over 26 runs, with the number of runs obtained from a power analysis (22), are presented in tab:CI. The values are normalized by the number of clusters in each data set, with zero indicating a perfect CI value. The table contains two categories of algorithms. The first category comprises algorithms with a fixed number of units. These algorithms appear with bold font in the table. The second category are algorithms with an adaptive number of clusters. These appear with normal font. The table is sorted based on the last column, which shows the CI average across all data sets. As expected, the algorithms with a fixed number of units (in our tests this coincides with the true number clusters, which is, however, usually unknown) perform noticeably better than algorithms with an adaptive number of units. For the algorithms with an adaptive number of units, however our algorithm is ahead of the competition. It should be noted here that all other algorithms with adaptive unit numbers consider all data offline at once, whereas we update the model in an online setting, data point by data point. In addition, our adaptive dimensionality adjustment means that we can map each subspace with an optimal number of dimensions and thus save a lot of computational effort.
The results are further validated by looking at the NMI results in tab:NMI, with the best possible NMI being equal to one. While the algorithms with a fixed number of units are generally at the top, both our H-NGPCA algorithm and the Birch algorithm are able to outperform the Cure algorithm (with fixed number of units). This can certainly be seen as a success for both algorithms, as it means that the data is mapped better despite the adaptive number of units.
To corroborate our results, we show that the performance differences in NMI and CI between the 13 benchmarked methods are statistically significant. Because the preconditions for ANOVA are not met (no equal variances, no normally distributed residual values), we resort to the nonparametric Kruskal-Wallis test. For each of the measures NMI and CI, a single Kruskal-Wallis test is carried out. Overall, the sample size per method amounts to 416 (16 data sets with 26 random seeds/training runs each) and the degrees of freedom to 12. The null hypothesis can be rejected for all measures with p < .001 (two-sided) (H = 2031.9 for NMI, H = 2411.0 for CI; H values are adjusted for tied ranks).
Bold method names indicate algorithms with a fixed number of clusters, otherwise the algorithm adaptivly determines the number of clusters. The last column shows the mean across all data sets. Results are sorted according to the last column. The corresponding standard deviations are shown in Table 9.
Bold method names indicate algorithms with a fixed number of clusters, otherwise the algorithm adaptivly determines the number of clusters. The last column shows the mean across all data sets. Results are sorted according to the last column. The corresponding standard deviations are shown in Table 10.
Furthermore, for post-hoc analysis, all possible 78 pairwise combinations between methods are subject to the Dunn test [30] with Bonferroni correction. Dunn’s test statistic is a z-score. To achieve a significant result in a single pairwise comparison at (two-tailed; with Bonferroni correction), the absolute value of the z-score has to be greater than 3.529. For NMI, this is the case for 62 out of 78 pairwise comparisons, for CI for 66 out of 78. Taking especially H-NCPGA into account, the performance of this algorithm is significantly different to all other algorithms except for affinity propagation in both NMI (
) and CI (
) and to BIRCH in CI (
). Nevertheless, it should be noted that both affinity propagation and BIRCH are offline algorithms which process all data at once, thus they are (i) not applicable to data streams and (ii) computationally heavy on large data sets. In contrast, H-NGPCA can be applied in an online manner, which constitutes a key advantage in handling dynamic or continuously evolving data.
Because we wanted to reject the null hypothesis for most pairwise comparisons in the post-hoc Dunn test, the power analysis is based on this test. Our goal was to keep the type II error below for
(two-tailed; with Bonferroni correction, resulting in
) and an effect size
(small to medium effect). Since Dunn’s test statistic is a z-score, the required sample size is computed by
yielding per method [31]. With a total of 16 data sets to be tested per method, this results after rounding in 26 training runs for each combination of the 16 data sets with the 13 methods.
6 Discussion and conclusions
Clustering data streams is difficult because the effectiveness of clustering algorithms depends on the correct assignment of hyperparameters. These parameters, which include the number of clusters, density thresholds, decay rates and window lengths, have a significant impact on the quality of the clustering results. In comparison to conventional clustering scenarios, data streams have a dynamic nature and are subject to continuous changes. Consequently, setting fixed hyperparameter values without prior insight risks skewing the clustering model. Among these parameters, the optimal number of clusters is of particular importance, and at least that requires an adaptive component.
Previously published work on the topic of continuously finding the optimal number of units rarely provided repositories to reproduce the results, which makes them unsuitable for benchmark. Also, most algorithms, when trying to remove the hyperparameter of the unit count, added more new hyperparameters in the process. This negates the added value of these methods at the very beginning.
Our algorithm presented in this work extends local PCA clustering by a hierarchical approach to continuously adapt the number of units based on a hyperparameter-free quality measure. We evaluated the performance of the presented algorithm in an experimental study on all data sets of the clustering benchmark database. The visual results showed successful training of the H-NGPCA algorithm on data sets with different characteristics, such as cluster overlap, unbalanced clusters and high dimensionality.
We propose a parameter-free splitting criterion based on intra- and inter-cluster distances that is weighted by each units activity. The quality measure rewards it if the data points that belong to a unit are close to it and if the data points that do not belong to the unit are far away.
For high-dimensional data we proposed an adaptive dimensionality control algorithm that controls each units dimensionality individually. This local dimensionality control is necessary because units require a different dimensionality depending on the data which they represent. For example, units that are at a higher hierarchical level often lie between several clusters rather than directly on one. This means that the variance of the unit no longer depends on the variance of the clusters but on the distances between them.
Further, we compared our algorithm with state-of-the-art clustering algorithms. First, we conducted an extensive literature review to identify the most popular online clustering algorithms. In the next step, we looked for the major potential competitors for our algorithm. Unfortunately, this ended with the realization that none of the direct competitors had provided code repositories which could be used to apply the algorithms in a comparison. We have therefore also extended the comparison to traditional offline clustering methods with a static or adaptive number of units. Compared to these, we naturally have a strong handicap due to online learning and the adaptive number of units. In this comparison, we scored positively based on the NMI and CI values, we are better than all other offline algorithms with an adaptive number of units and in some cases can also keep up with offline algorithms with a static number of units. Our estimated number of units is also closer to the actual number of units compared to the competing algorithms.
A total of five hyperparameters are used in H-NGPCA. One low-pass parameter for updating the low-passes, and two parameters each for dimensionality control and the number-of-unit algorithm. For the dimensionality control, it is necessary to specify how much variance is to be retained for each unit and how many training cycles the unit-specific dimensionality control pauses after a dimensionality change. For the split algorithm, there is a similar hyperparameter that prevents further splits after a recent split and a hysteresis parameter to prevent splits caused by statistical outliers. All these parameters can be set intuitively and do not require any complex tuning. Compared to other algorithms with an adaptive unit number, two hyperparameters are also used on average.
While our algorithm shows very convincing results overall, our algorithm has two weakpoints which we want to address in future work. Firstly, the concept of parent units with two children has the disadvantage that if three units lie along an axis, there is no split. This has already been shown and discussed in Fig 12. One approach could be to replace the strict binary (two children) structure with one that allows more than two children in these situations. Second, a merge mechanism is currently missing, where parts of the unit tree can be pruned again. However, it is possible to implement this with relatively little effort, as it is only necessary to regularly check whether the higher hierarchical levels are still relevant or not. Nevertheless, testing this component would exceed the scope of this work. For future work, we aim to further explore the application of H-NGPCA in real-world scenarios, particularly in biological and industrial domains [32]. These fields often involve complex and high-dimensional data streams, providing an ideal environment to assess the performance of the proposed method.
List of symbols
‘+’ means that the property is present, ‘(+)’ partly present, ‘-’ not present, ‘N.I.’ no information available, and ‘req’ required. In the first three rows, the properties of our NGPCA versions are listed. Then, the best known algorithms from the five cluster categories are mentioned.
A Further details on related work
A complete literature review of data stream clustering algorithms is shown in ref:fig_lit. In particular interesting for the benchmarking are the following algorithms: ODAC [33], Adaptive K-means [34], StrAP [35], FEAC-Stream [36], Incremental DBSCAN [37], LDBSCAN [38], GCHDS [39], GSCDS [40] and CluDistream [41]. Unfortunately, reviewing these algorithms revealed that each publication (i) uses completely different data sets, most of which are simple two-dimensional offline data sets; (ii) does rarely specify the sampling order of the data sets; (iii) does not specify hyperparameters or training time; (iv) if benchmarks are available, these are against algorithms with completely different properties and (v) in most cases no working Github or other code-versioning repositories are available. There is a repository for the algorithm ODAC [33], but only an offline implementation is available. For LDBSCAN [38] a repository exist, but there are no readme or docs that are necessary for use. The only well prepared repository is provided for the algorithm Inc. DBSCAN [37], whereby the owner of the repository describes that some parts of the algorithm are not described in the original paper and that the repository owner has implemented its own solutions for those algorithmic holes. No other competing algorithm provides an implementation of their algorithm. It is therefore not possible to carry out meaningful benchmarks with these algorithms, which is why we also had to consider eleven offline algorithms for the benchmark, such as Gaussian Mixture Models, K-Means, Birch and DBSCAN.
B Pseudo-code of training process
The algorithm 2 shows the algorithmic steps necessary to re-implement the algorithm proposed in this work.
Algorithm 2 H-NGPCA training procedure. The dimensionality adjustment is presented in separate algorithm and the structure adaptation components in a separate figure.
1: []
Init root unit
2: []
Init root children,
3: for all do
Input vector from data stream
4:
Initial set of loser units
5:
Initial set of winner units
6: k = 0 Start at root unit
7: do
8:
9: Potential function ( x,
)
(5)
10: Potential function ( x,
)
11: winner child
(6)
12:
Update assignment value (
)
(8)-(10)
13:
Update assignment value (
)
(8)-(10)
14:
15:
16:
17: while Winner child cw has children Descent in tree
18: for all
19:
Adaptive learning rate (
)
20: Update center (
)
21: []
Online PCA (
)
22: update residual variance (
)
Only for m < n
23: [mj, ]
Dim. adjustment (mj,
)
Alg. 1
24: Update intra-cluster distance with dj
dj set in line
9 and 10
25: end for
26: for all do
27: Update inter-cluster distance with dj
dj set in line
9 and 10
28: end for
29: for all do
30:
Update activity (
)
(12)-(14)
31: end for
32: Unit tree Structure adaptation (
)
Fig 7
33: end for
C Details of benchmark data sets and evaluation metrics
The data sets considered for the benchmark are shown in tab:datasets. They vary in the number of clusters, data points per cluster, cluster overlap, dimensionality, and the data point balance. The high-dimensional data sets are particularly interesting, since it is difficult to find the right number of clusters in high-dimensional spaces, as they are usually only sparsely filled.
The Centroid Index (CI) is a metric derived from the cluster centers to evaluate the cluster-level mismatch [69]. It compares the local PCA centers with the true cluster centres
. For each cluster center
, the nearest PCA unit
is determined using the Mahalanobis distance (4) with respect to the clusters eigenvalues and eigenvectors. Eigenvalues and eigenvectors are computed for each cluster based on their respective covariance matrices. PCA units without matches are labeled as orphans or "dead". The measure CI
is then the sum of all orphan units. It’s important to note that the CI is asymmetric (
) [69]. Therefore, CI
is calculated similarly, matching clusters to local PCA units, using the eigenvalues and eigenvectors of the PCA units. The symmetric version CI2 [69] used in this work is obtained by
With the symmetric variant (), the number of clusters does not matter because the cluster index
is not limited by the pairing as other set-based measures. Instead, it gives a value that is equal to the difference between the number of clusters and number of units, or higher if other cluster-level mismatches are detected; if no orphans exist, each PCA unit is mapped to exactly one cluster indicating that the structures are close to each other (
). In the following, CI stands for
and is in addition normalized by the number of clusters in a data set.
The Normalized Mutual Information (NMI) is an external measure expressing how much information is shared between the real and the predicted clustering. Therefore, ground truth information about the real cluster assignment is required for this measure. For the benchmark data sets in [29] on which we rely, this information is provided. For each data point i, the ground truth cluster label li is set to the corresponding cluster index, varying between 1 and the number of clusters M. is the set of all ground truth labels. The predicted cluster labels
are obtained by determining which of the local PCA units is closest to the respective data point. The distance measure is the Mahalanobis distance (4).
The NMI is a normalized form of the mutual information (MI). In the literature, several versions of the NMI are proposed. We use the following definition [70]:
with ,
(entropy with cluster label probabilities pj) and
. The NMI expresses the amount of information the overall local PCA model can extract from the ground truth distribution. A value of 1.0 is the maximum and indicates that real and predicted clustering are identical.
D Algorithm parameterizations
The hyperparameters shown below for each algorithm are those used to generate the results of tab:CI, tab:NMI and tab:nunits. We limit the parameter overview to those that vary from the default values. For a full list of parameters we refer to the corresponding scikit-learn and pyclustering documentation or to our benchmark-script within our GitHub repository. The N-clusters parameter is set when required to the number of clusters in the respective ground truth data set (parameter).
tab:our_parameters shows the parameters set for the benchmark results of our algorithm. The dimensionality mj, not mentioned in the table, is initialized to two, and adaptively adjusted for each unit individually using the adaptive dimensionality control. The low-pass filter , dimensionality threshold and dimensionality adjustment unit-specific delay
are fixed values obtained from [22,25]. The hysteresis parameter
is chosen close to 1, with slightly smaller values for data sets with high cluster overlap (e.g. S3, S4). The delay after each splitting operation
varies depending on the number of samples in a data set.
E Additional experimental results and discussions
E.1 Extended visual results
In Fig 12, the H-NGPCA algorithm is trained on the s2 data set. In contrast to the s1 data set, this data set stands out due to the overlap of the clusters. Our algorithm shows no weakness when training on strongly overlapping clusters and can also determine cluster centers and shapes in overlap regions. Nevertheless, one weakness of the algorithm becomes clear. If three clusters are in a straight line, the splitting approach used here may fail. This is because the parent unit lies appropriately on the middle cluster and extends along an axis to the two others. The corresponding children each lie between the center cluster and an outer cluster, which means that the fit of the parent unit is better than that of the child units. Therefore, no split is performed. The associated learning rate curve (Fig 13a) shows that the learning rates of all units quickly converge towards zero. The number of units (Fig 13b) increases steadily until it reaches the point where all units but the one covering three clusters are correctly split and then the number stays constant.
Further, H-NGPCA was tested on the u1 data set. This data set is characterized by strongly unbalanced clusters (with many data points in the three clusters on the left). Previous versions of NGPCA [22] had major problems on the data set. The H-NGPCA version presented here can learn this data set correctly without any problems. In Fig 14 the final clustering is shown with all clusters being represented by exactly one unit and a correct clustering of the data points. In Fig 15 the corresponding time courses of the quality measure and the number of units are shown. The quality measure starts high as the units are still adapting. As soon as the units converge, this also happens with the quality measure. New units that only become independent later in the training do not exhibit the same behavior, as the units learning rates have already reduced considerably by this time. The progression of the number of units looks similar to the previous s2 data set. The correct number of units is reached quickly and then remains stable.
E.2 Extended quantitative results
We compared the real number of clusters with the number approximated by a clustering algorithm. As part of the benchmarked algorithms have a fixed number of units, we only performed that test for algorithms with an adaptive number. The results are shown in tab:nunits. The majority of competing algorithms tend to significantly underestimate the number of clusters in a data set. This effect is particularly powerful on the b-series, as these data sets consist of 100 clusters. In general, most competing algorithms seem to stagnate somewhere around 15 clusters. We further discuss this in cha:discussion and it could be interesting to investigate this in a larger study. Our algorithm is always close to the real number of clusters regardless of the amount of clusters. Visual results of all competing algorithms are available in a compact form within the supplementary material. A final cluster result is shown there as an example for each combination of data set and algorithm. Empty cells mean that the algorithm could not be applied to the data set.
The standard deviations for both the CI and NMI results (tab:CI-tab:NMI) are shown in tab:ci_std and tab:nmi_std. Some algorithms, in particular those with a static number of units, are fully deterministic and therefore have standard deviations of zero. On the other hand, some of the algorithms with a dynamic number of units are unsuitable for certain data sets. This also leads to a standard deviation of 0, as the algorithms are stuck with one single unit across all training runs.
In non-stationary and transient environments, data distributions may change over time. So far, only data distributions in which the number of clusters is constant have been considered. In the following, a situation is considered in which the distribution changes in the middle of training, so that the units have to readjust and the optimal number of units changes (Fig 16). The H-NGPCA model is initially trained on a square with a line attached (Fig 16(a)). After the model converged and all units’ learning rates cooled down, the distribution is abruptly extended by a ring (Fig 16(b)). All units wake up again to adjust to the new distribution and split up whenever necessary. As our algorithm does not currently have a merge function which reduces the number of units, there is a risk that incorrect splits will occur when readjusting the units shortly after changing the data distribution, leading to dead units. A weaker effect can be seen in our case, where units 25 and 27 are not dead, but overlap unnecessarily. Ideally, these two units should be merged. Nevertheless, this small experiment shows that the H-NGPCA is suitable for data stream clustering.
The already converged network wakes up, and expands the tree to approximate the extended distribution. The corresponding learning rates show a hike when the data distribution is extended, but cools down quickly.
6.1 E.3 Per-Data-Point complexity analysis
Each data point traverses from the root to a leaf in a binary tree of depth d. Given the data dimensionality n and PCA unit dimensionality m, the following algorithmic results for each data point presentation:
Mahalanobis comparisons: At each of the d–1 levels (excluding root level), the data point is compared to two child nodes using the Mahalanobis distance (5). Each comparison requires the projecting into PCA space using stored eigenvectors with O(nm) and computing the Mahalanobis distance with diagonal covariance (from eigenvalues) with O(m). Since two comparisons are made per level, the total cost across all levels is:
PCA updates: After choosing the winning child at each level, the algorithm updates the PCA parameters of all d visited nodes (including the root). Each update costs according to [26]:
The total costs is multiplied by the depth d in a sequential setting. As this step is performed in parallel in our setup, the cost remain O(nm2).
Unit split: The split decision are based on the already calculated Mahalanobis distances. For the direct comparison we loop over all candidates which has a complexity of
Dimensionality adjustment: The dimensionality adjustment algorithm involves a linear eigenvalue regression and a Gram-Schmidt orthogonalization. The linear eigenvalue regression is only performed on the first m eigenvalues which scales linearly with O(m). The Gram-Schmidt orthogonalization is only performed when the dimensionality is increased. In the case when the PCA unit dimensionality m is increased from an existing set of eigenvectors , we only orthogonalize the newly added
eigenvectors (Gram–Schmidt) against the already existing
. This leads to
which is typically much smaller than a full recomputation O(nm2), especially when . Therefore, the complexity of the dimensionality adjustment algorithm is omitted. Combining all terms, the overall per-data-point complexity is:
The overall complexity grows linearly with n and depends quadratically on m which is usually much smaller m<<n, plus the exponential term in d. The algorithm has therefore a higher computational complexity on data with many high-dimensional clusters.
References
- 1.
Jain AK, Dubes RC. Algorithms for clustering data. Englewood Cliffs (NJ): Prentice Hall; 1988.
- 2. Hou J, Yuan H, Pelillo M. Towards parameter-free clustering for real-world data. Pattern Recogn. 2023;134:109062.
- 3. Lukats D, Zielinski O, Hahn A, Stahl F. A benchmark and survey of fully unsupervised concept drift detectors on real-world data streams. Int J Data Sci Anal. 2024;19(1):1–31.
- 4.
Gunasekara N, Pfahringer B, Murilo Gomes H, Bifet A, Koh YS. Recurrent concept drifts on data streams. In: Proceedings of the thirty-third international joint conference on artificial intelligence; 2024. p. 8029–37.
- 5. Wani AA. Comprehensive analysis of clustering algorithms: Exploring limitations and innovative solutions. PeerJ Comput Sci. 2024;10:e2286. pmid:39314716
- 6. Sushreeta T. A survey on partitioning and parallel clustering algorithms. In International conference on computing and control engineering data mining and knowledge engineering. 2013;4(7):343–8.
- 7. Balcan MF, Liang Y, Gupta P. Robust hierarchical clustering. J Mach Learn Res. 2014;15(118):4011–51.
- 8.
Narasimhan M, Jojic N, Bilmes J. Q-clustering. Advances in Neural Information Processing Systems MIT Press; 2006; p. 979–86.
- 9.
Aggarwal C. A survey of stream clustering algorithms. CRC Press; 2013. p. 28.
- 10.
Moulton RH, Herna V, Nathalie J, Joao G. Clustering in the presence of concept drift. Machine learning and knowledge discovery in databases; 2019. p. 339–55.
- 11. Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C. StreamKM++: A clustering algorithms for data streams. ACM J Exp Algor Assoc Comput Machin. 2012;17.
- 12. Zubaroglu A, Atalay V. Data stream clustering: A review. Springer Science and Business Media LLC. 2020;54:1201–36.
- 13.
Chen Y, Tu L. Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007. 133–42. https://doi.org/10.1145/1281192.1281210
- 14. Wan L, Ng WK, Dang XH, Yu PS, Zhang K. Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data. 2009;3(3):1–28.
- 15. Ghesmoune M, Lebbah M, H A. State-of-the-art on clustering data streams. Big Data Analyt. 2016;1(1).
- 16. Mousavi M, Bakar A, Vakilian M. Data clustering algorithms: A review. Int J Adv Softw Comput Applic. 2015;7.
- 17. Peterson AD, Ghosh AP, Maitra R. Merging K-means with hierarchical clustering for identifying general-shaped groups. Stat (Int Stat Inst). 2018;7(1):e172. pmid:29736237
- 18. Melnykov V, Michael S. Clustering Large Datasets by Merging K-Means Solutions. J Classif. 2019;37(1):97–123.
- 19. Möller R, Hoffmann H. An extension of neural gas to local PCA. Neurocomputing. 2004;62:305–26.
- 20.
Ronen M, Finder SE, Freifeld O. DeepDPM: Deep clustering with an unknown number of clusters. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2022. 9861–70.
- 21. Ren Y, Pu J, Yang Z, Xu J, Li G, Pu X, et al. Deep clustering: A comprehensive survey. IEEE Trans Neural Netw Learn Syst. 2025;36(4):5858–78. pmid:38963736
- 22. Migenda N, Möller R, Schenck W. Adaptive local principal component analysis improves the clustering of high-dimensional data. Pattern Recogn. 2024;146:110030.
- 23.
Kong X, Hu C, Duan Z. Principal component analysis networks and algorithms. Springer Singapore. 2017.
- 24. Cardot H, Degras D. Online principal component analysis in high dimension: Which algorithm to choose?. Int Stat Rev. 2017;86(1):29–50.
- 25. Migenda N, Möller R, Schenck W. Adaptive dimensionality reduction for neural network-based online principal component analysis. PLoS One. 2021;16(3):e0248896. pmid:33784333
- 26. Möller R. Interlocking of learning and orthonormalization in RRLSA. Neurocomputing. 2002;49(1–4):429–33.
- 27.
Hoffmann H. Unsupervised learning of visuomotor associations. Bielefeld University, Faculty of Technology; 2004.
- 28.
Kaiser A. Implementierung und Test des Twin-Birth-Verfahrens für Neural Gas Principal Component Analysis; 2008.
- 29. Fränti P, Sieranoja S. K-means properties on six clustering benchmark datasets. Appl Intell. 2018;48(12):4743–59.
- 30. Dinno A. Nonparametric pairwise multiple comparisons in independent groups using Dunn’s test. The Stata J: Promot Commun Stat Stata. 2015;15(1):292–300.
- 31.
Bortz J, Schuster C. Statistik für Human- und Sozialwissenschaftler. Springer Berlin Heidelberg; 2010.
- 32. Cao B, Zhao S, Li X, Wang B. K-means multi-verse optimizer (KMVO) algorithm to construct DNA storage codes. IEEE Access. 2020;8:29547–56.
- 33. Rodrigues PP, Gama J, Pedroso JP. Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng. 2008;20(5):615–27.
- 34. Puschmann D, Barnaghi P, Tafazolli R. Adaptive clustering for dynamic IoT data streams. IEEE Internet Things J. 2017;4(1):64–74.
- 35. Zhang X, Furtlehner C, Germain-Renaud C, Sebag M. Data stream clustering with affinity propagation. IEEE Trans Knowl Data Eng. 2014;26(7):1644–56.
- 36. Andrade S, Hruschka E, Gama J. An evolutionary algorithm for clustering data streams with a variable number of clusters. J Netw Computat Applic. 2017;67(C):228–38.
- 37.
Ester M, Kriegel H, Sander J, Wimmer M, X X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th international conference on very large data bases; 1998. p. 323–33.
- 38.
Duan L, Xiong D, Lee J, Guo F. A local density based spatial clustering algorithm with noise. In: 2006 IEEE international conference on systems, man and cybernetics; 2006. p. 4061–6. https://doi.org/10.1109/icsmc.2006.384769
- 39.
Lu Y, Sun Y, Xu G, Liu G. A grid-based clustering algorithm for high-dimensional data streams. In: Li X, Wang S, Dong Z, editors. Advanced data mining and applications. Springer; 2005. p. 824–31.
- 40.
Sun Y, Lu Y. A grid-based subspace clustering algorithm for high-dimensional data streams. In: Web information systems – WISE 2006 workshops; 2006. p. 37–48.
- 41.
Zhou A, Cao F, Yan Y, Sha C, He X. Distributed data stream clustering: A fast EM-based approach. In: 2007 IEEE 23rd international conference on data engineering; 2007. p. 736–45. https://doi.org/10.1109/icde.2007.367919
- 42. Shetty N, Shirwaikar R. A comparative study: BIRCH and CLIQUE. Int J Eng Res Technol. 2013;11(2).
- 43. Guha S, Rastogi R, Shim K. Cure: An efficient clustering algorithm for large databases. Inform Syst. 2001;26(1):35–58.
- 44. Karypis G, Eui-Hong Han, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. Computer. 1999;32(8):68–75.
- 45. Udommanetanakit K, Rakthanmanon T, Waiyamai K. E-stream: Evolution-based technique for stream clustering. Adv Data Min Applic. 2007;4632:605–15.
- 46. Kumar P. Data stream clustering in internet of things. SSRG Int J Comput Sci Eng. 2016;3(8):1–14.
- 47.
O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S. Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering; 2002. p. 685–94.
- 48.
Ordonez C. Clustering binary data streams with K-means. In: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2003. 12–9. https://doi.org/10.1145/882082.882087
- 49.
Kaufman L, Rousseeuw P. Clustering large applications (Program CLARA). Wiley Book Series in Probability and Statistics. 1990.
- 50.
Aggarwal C, Han J, Wang J, Yu P. A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases; 2003. p. 81–92.
- 51.
Aggarwal CC, Han J, Wang J, Yu PS. A framework for projected clustering of high dimensional data streams. Proceedings 2004 VLDB conference. Elsevier; 2004. p. 852–63. https://doi.org/10.1016/b978-012088469-8.50075-9
- 52. Zhou A, Cao F, Qian W, Jin C. Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst. 2007;15(2):181–214.
- 53.
Cao F, Estert M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining; 2006. https://doi.org/10.1137/1.9781611972764.29
- 54.
Liu L, Huang H, Guo Y, Chen F. rDenStream, a clustering algorithm over an evolving data stream. In: 2009 international conference on information engineering and computer science; 2009. p. 1–4. https://doi.org/10.1109/iciecs.2009.5363379
- 55.
Isaksson C, Dunham M, Hahsler M. Sostream: Self-organizing density-based clustering over data stream. Machine learning and data mining in pattern recognition; 2012; p. 264–78.
- 56. Amini A, Saboohi H, Herawan T, Wah TY. MuDi-Stream: A multi density clustering algorithm for evolving data stream. J Netw Comput Applic. 2016;59:370–85.
- 57. Hyde R, Angelov P, MacKenzie AR. Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inform Sci. 2017;382–383:96–114.
- 58.
Hassani M, Spaus P, Cuzzocrea A, Seidl T. I-HASTREAM: Density-based hierarchical clustering of big data streams and its application to big graph analytics tools. In: 2016 16th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid); 2016. p. 656–65. https://doi.org/10.1109/ccgrid.2016.102
- 59.
Wang H, Yu Y, Wang Q, Wan Y. A density-based clustering structure mining algorithm for data streams. In: Proceedings of the 1st international workshop on big data, streams, and heterogeneous source mining: Algorithms, systems, programming models and applications; 2012. p. 69–76.
- 60.
Tasoulis D, Ross G, Adams N. Advances in intelligent data analysis VII. Springer Berlin Heidelberg; 2007. p. 81–92.
- 61.
Namadchian A, Esfandani G. DSCLU: A new data stream clustering algorithm for multi density environments. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing; 2012. p. 83–88.
- 62. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 1998;27(2):94–105.
- 63. Sheikholeslami G, Chatterjee AS, Zhang . Wavecluster: A wavelet-based clustering approach for spatial data in very large databases. VLDB J. 2000;8:289–304.
- 64.
Wang W, Yang J, Muntz R. Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases. VLDB; 1997. p. 186–95.
- 65.
Amini A, Wah T. DENGRIS-stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In: International conference on data mining and computer engineering (ICDMCE); 2012. p. 206–10.
- 66. Gama J, Rodrigues PP, Lopes L. Clustering distributed sensor data streams using local processing and reduced communication. IDA. 2011;15(1):3–28.
- 67.
Dang X, Lee V, Ng W, Ong K. In: Bhowmick S, Küng J, Wagner R, editors. Incremental and adaptive clustering stream data over sliding window. Database and expert systems applications. Springer; 2009. p. 660–74.
- 68. Fisher D. Iterative optimization and simplification of hierarchical clusterings. JAIR. 1996;4:147–78.
- 69. Fränti P, Rezaei M, Zhao Q. Centroid index: Cluster level similarity measure. Pattern Recogn. 2014;47(9):3034–45.
- 70. Strehl A, Ghosh J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 2002;3:583–617.