Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A black-winged kite improved fuzzy clustering handling imbalanced uncertain data

  • Hung Tran-Nam ,

    Contributed equally to this work with: Hung Tran-Nam, Ha Che-Ngoc

    Roles Formal analysis, Funding acquisition, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Laboratory for Applied and Industrial Mathematics, Institute for Computational Science and Artificial Intelligence, Van Lang University, Ho Chi Minh City, Vietnam, Faculty of Fundamental Sciences, Van Lang University, Ho Chi Minh City, Vietnam

  • Ha Che-Ngoc

    Contributed equally to this work with: Hung Tran-Nam, Ha Che-Ngoc

    Roles Conceptualization, Methodology, Supervision, Validation

    chengocha@tdtu.edu.vn

    Affiliation Applied Analysis Research Group, Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam

Abstract

Clustering uncertain data is a fundamental problem in data mining. Imbalance among uncertain objects significantly degrades clustering performance, as minority clusters are repeatedly overshadowed by dominant ones. Consequently, existing clustering techniques often fail due to initialisation biases and inadequate similarity modelling. This paper proposes a novel algorithm, the Black-winged Kite Improved Fuzzy clustering for probability density Functions (BKIFF), which combines an optimisation-based initialisation strategy with an enhanced fuzzy clustering framework. Specifically, BKIFF incorporates the Hellinger distance into the clustering objective to more reliably capture similarities between probability density functions (pdfs), and introduces improved membership updating and prototype estimation mechanisms tailored for uncertain and imbalanced data formulated as Improved Fuzzy clustering for probability density Functions (IFF) while theoretical convergence is established. In addition, the algorithm employs Black-winged Kite Optimisation (BKO) to enhance prototype selection, improving clustering stability and convergence. As a result, comprehensive experiments with synthetic Gaussian probability distributions, skewed pdfs, and real-world image datasets demonstrate that BKIFF consistently outperforms baseline methods such as FCF, FCF-, KMEANS, and Self-Updating. Across all three examples, BKIFF achieves near-perfect ARI, improving from near-zero values in highly imbalanced cases {20,50,80,100} by approximately 30–35% in moderate settings, while increasing NMI by about 25–95%. Additionally, it reduces computational time by approximately 95–99% compared to baseline methods. In conclusion, BKIFF demonstrates superior performance and opens up new possibilities for applications in medical diagnostics, ecological analysis, and high-dimensional uncertain data mining, particularly in imbalanced environments.

1 Introduction

1.1 Clustering for uncertain data

Clustering probability density functions (pdfs) has recently emerged as an important direction for analysing uncertain data [1,2]. Nevertheless, its underdevelopment and complexity pose significant challenges as the uncertain data becomes imbalanced, raising big questions around whether clustering techniques could tolerate this type of data well. To understand these challenges, it is vital to first understand the characteristics and sources of uncertainty in modern data.

To begin with, data inherently contains uncertainty, a characteristic often overlooked but prevalent in modern data mining applications. Rather than each object being associated with a singular value, it is associated with an assumed level of uncertainty. In general, data uncertainty can be considered at table, tuple, or attribute level, and is usually specified by fuzzy models, evidence-oriented models, or probabilistic models presenting in massive amounts in sensor networks [3], noise in damage diagnostics experiments [46], population age distributions [7], and species distribution [8]. Uncertain data objects carry extra information due to repeated measurements and potential overlap with other densities. It can be in forms that sketch the probability of appearing at any position in a multidimensional space [6,9,10] as Fig 1(a) and 1(b) stage uncertainty objects in uni- and bi-dimensions, respectively. By contrast, while discrete vectors have previously yielded valuable insights, they now fail to accurately capture the object’s distribution, misrepresent ambivalent information, and overlook volatility and variation present in the data [2]. However, when uncertainty coincides with severe imbalance, the clustering task becomes inherently unstable. Large clusters dominate distance evaluations, prototypes drift toward majority densities, and minority structures are often absorbed or lost. As a result, clustering pdfs under imbalanced conditions remains an unresolved and critical practical problem.

thumbnail
Fig 1. Two objects of uncertainty in (a) 1-dimension and (b) two 2-dimensional Gaussian density functions.

https://doi.org/10.1371/journal.pone.0349753.g001

From a methodological standpoint, clustering is a branch of unsupervised learning that analyses the hidden structure of unlabelled data into non-empty subsets (a.k.a. clusters) that share typical characteristics [1113]. Clustering works by minimising inter-partition while maximising intra-partition similarity. Scilicet clustering aims to form compact, distinctly separated clusters. Therefore, clustering uncertain objects based on their pdfs is prevalent in numerous scenarios.

1.2 Related works

Over the past two decades, sporadic research has focused on developing clustering algorithms for uncertain data. These studies have built upon existing algorithms, incorporating suitable enhancements during this period. Noteworthy contributions include adaptations of partitional methods.

One of the earliest attempts to solve the problem of uncertain data clustering is Uncertain k-means by Chau Michael et al. [1416]. This focuses on incorporating data uncertainty directly into the clustering process by employing the concept of expected distance, computed as the average of all possible distances weighted by an assumed (typically uniform) pdf. Empirical evidence shows that it yields significantly better clusters than ordinary k-means, at only a modest extra CPU cost. However, this approach is computationally expensive because it requires pairwise distance calculations across all possible realizations of the uncertain objects. To tackle this limitation, Ngai et al. [17] introduced pruning techniques to eliminate redundant computations and improve efficiency. Building upon this line of research, Gullo et al. [10,18] proposed the Uncertain k-medoids algorithm, which defines uncertain distance measures for both univariate and multivariate uncertain objects and achieves high clustering accuracy by uniform and binomial pdfs. UK-medoids has been experimentally shown to outperform other existing methods in terms of accuracy, regardless of the choice of uncertainty pdfs. Nevertheless, this method still suffers from high computational cost, as it requires evaluating uncertain distances between every pair of objects. To better handle cases where uncertain objects are not linearly separable in the input space, Yang et al. [19] developed the Kernel Uncertain k-medoids approach, which represents expected distances through kernel functions under uniform and Gaussian pdfs, revealing superior performance on several UCI datasets.

In parallel, the development of the density-based clustering algorithms includes Fuzzy Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Fuzzy Ordering Points to Identify the Clustering Structure (OPTICS) [20]. These approaches employ fuzzy reachability-distance to estimate core-object membership and reachability probabilities. However, Zhang Xian-Chao et al. [21] proved that issues are not addressed well in DBSCAN, including that it does not provide an exact function for calculating, but sampling makes it lose information, the computation is too time-consuming, and it uses a nonadaptive approach. Therefore, the probabilistic density-based uncertain data clustering algorithm (PDBSCAN) introduced definitions of core object probability and direct reachability probability, thus reducing the complexity and avoiding sampling [21]. Moreover, numerous studies employ hierarchical approaches [1,16], such as U-AHC [22], which used the Bhattacharyya coefficient as a distance measure between prototypes for Uniform, Normal, and Gamma pdfs. Additionally, Thao Nguyen-Trang et al. [1] proposed Fuzzy c-means for density functions, introducing both hierarchical and non-hierarchical approaches for applications in image analysis and data mining. This method is well-known for treating a continuous Gaussian density function as a clustering object.

Beyond these representative methods, more recent studies have focused on enhancing clustering performance through evolutionary and adaptive mechanisms, as summarised in Table 1. For example, Dinh Pham-Toan [23] and Tai Vo-Van et al. [24] integrated genetic algorithms and differential evolution into fuzzy clustering frameworks intended to improve optimisation capability the compactness of established cluster by Calinski–Harabasz Index. Determining the suitable number of clusters, and giving the fuzzy membership to belong to the clusters are identified. Similarly, Thao Nguyen-Trang et al. [2,25] investigated meta-heuristic optimisation strategies to stabilise clustering performance, enforce balanced cluster sizes, and a globally automatic DE-based method uses Gaussian prototypes for compactness. In addition, the possibilistic model proposed by Hung Tran-Nam et al. [26] and the self-updating fuzzy scheme introduced by Dinh Pham-Toan [27,28] emphasise the importance of adaptive mechanisms, particularly when pdfs exhibit complex and dynamic variability.

thumbnail
Table 1. Innovation of clustering probability density functions.

https://doi.org/10.1371/journal.pone.0349753.t001

1.3 Research gap

Despite research endeavours, some challenges remain. Even though optimisation methods might be sufficient for many practical applications, they rarely address imbalance as a precise case [2]. In addition, non-optimisation methods are susceptible to the choice of initial centres. Moreover, many of these studies do not address the imbalance among clusters, a condition in which the distribution of elements is more extensive or uneven than in other clusters, which occurs naturally in the real world, including uneven species densities in each survey area and low-probability diseases. The density function cluster analysis has to unveil deeper insights to recognise these imbalanced structures. Traditional methods tend to converge on the centres of the clusters with more significant numbers for two reasons: initialising the input centre density function and designing the objective function (OF) of the resulting clusters to be equal. As the Imbalanced Ratio (IR) increases, the probability of initialising two density functions belonging to two separate clusters to explore the cluster with smaller numbers becomes lower. This may lead to misclassification of fault, eventually resulting in a breakdown. Prior work, such as [3133] in certain data mining, has noted a tendency to shift prototypes. Xiong Hui et al. [34] analysed the reasons for this phenomenon using the OF. They showed that k-means tends to evenly split the samples, even when the number of samples per cluster is large. Liang Jiye et al. [35] explain the problem of shifting the prototypes in the case of fuzzy c-means for clustering imbalanced data. Pu Yue et al. [31] get many clustering improvements with the siphon effect and prove the convergence of their work [2,25] These works collectively show a clear research effort toward improving robustness; no research has contemplated a clustering technical approach in the presence of imbalanced uncertain data. Without mechanisms to handle imbalances, clustering results become unreliable and biased toward dominant groups. A dedicated approach is therefore required to recover meaningful cluster structures in such situations.

1.4 Contributions

Fortunately, these limitations can be addressed through adaptive initialisation methods [36], and clustering designed explicitly for this situation. Given the aforementioned gaps, this study introduces a novel approach called Black-winged Kite optimisation Improved Fuzzy for probability density Functions (BKIFF). Specifically, the study’s innovations and contributions are listed fourfold.

  • First, we establish a two-phase cluster analysis algorithm called Black-winged Kite optimisation Improved Fuzzy for probability density Functions (BKIFF). Specifically, all existing works aim to automate the OF entirely. They will provide the resulting label with the optimal OF. However, we target a slightly different problem: automating only the pre-processing step, rather than randomly, and then combining it with an improved fuzzy algorithm that continues to move the initialisation toward the prototype of the cluster containing them. This way, we can customise the OF and its starting condition step. This can be used independently of the core as an initialisation clustering algorithm.
  • Second, we use the Hellinger metric as the similarity measure to capture distribution differences between objects. The sensitivity of two overlapping density functions to the similarity metric is effective both in theory and in practice.
  • Third, we provide a robust theoretical foundation for the development of Improved Fuzzy for probability density Functions (so-called IFF) by utilising Lemmas 1 and 2 in conjunction with the Zangwill theorem. It clarifies the algorithm’s convergence and stimulates the analysis of its effectiveness while addressing associated computational challenges.
  • Fourth, we will conduct extensive experiments using synthetic and image datasets and compare our results with those of the compared algorithms. Specifically, we will examine the effect of the BKIFF algorithm by analysing the relationship between the imbalance rate and membership values. We will explore the factors contributing to a polarisation phenomenon in the rise of IR. Objects in larger classes tend to have significantly higher membership values, which can cause clustering algorithms to converge incorrectly and result in errors.

These contributions are contextualised within the comparative analysis presented in Table 1, thereby demonstrating their timeliness. The reestablishment of theory is a critical and novel step that highlights differences in initialisation, distance metrics, uncertainty handling, and automation across existing methods and our proposal. Last but not least, applications are novel to the literature on uncertain/imbalanced clustering, offering practical pathways in the ecology and remote sensing domains.

1.5 Organisation

The rest of this paper is organised as follows. First, Preliminaries section introduces the type of uncertain data and its related concepts on fuzzy clustering. Next, Methodologies and Materials section focuses on the imbalanced clustering method and its convergence. The experiments and analysis were then conducted on three datasets and an image application provided at Numerical Examples section. Next, Main Results section shows some effectiveness of the proposed method. Then, Discussion section presents some pros and cons. Finally, Conclusion section gives a summary.

2 Preliminaries

First, we summarise the notations commonly utilised in Table 2.

2.1 Uncertain objects

Definition 1 (multivariate uncertain object). A multivariate uncertain object o is defined as a pair (R, f), where is a d-dimensional region in which o is defined, and is the joint pdf representing the uncertainty of o over R [22].

Definition 2 (univariate uncertain object). A univariate uncertain object o is defined as a pair (R, f), where is a one-dimensional region in which o is defined, and is the pdf of o [22].

This paper exploits this continuous pdf to represent uncertain data. Assuming the space of Borel probability measures on a measurable space , assuming is a family of n-absolutely continuous Lebesgue pdfs. Fig 1 shows the two objects of multi- and univariable uncertainty under the Gaussian density functions.

2.2 Similarities of density function

Many papers have investigated the format of measures of density functions as vague and uncertain data. However, the conversion is problematic, as only [16] and the Hellinger metric affect the convergence task. Although satisfying all three distance properties, is usually computed by numerical methods, and the complexity can quickly go beyond control with the increased dimension. We propose the Hellinger metric as a distance measure to overcome the unwarranted limitations amid that chaos.

Definition 3 (Hellinger distance). Given two probability distributions F1 and F2 on a measurable space X and let f1 and f2 represent the Radon–Nikodym derivatives of F1 and F2, on the same -field, respectively. The Hellinger distance is defined as

(1)

When two pdfs have high overlap, the Hellinger distance declines and vice versa. -distance has domain in [0,2] [37].

3 Methodologies and materials

3.1 Methodologies

3.1.1 The fuzzy partition space.

Let the dataset be a family of n-unlabelled pdfs (n ≥ 2) with distribution F and be a sequence of c-representative density functions (2 ≤ c ≤ n) at the t iteration. With c-cluster given such that each subgroup represents natural substructure in ,

(C1)(C2)(C3)

The (C1) are interpreted as the membership degree of the j-th density function to the cluster relative to all other clusters. For the (C2), this is a requirement that no cluster, represented as clusters of , is empty. In (C3), the partitioning cluster is no less than c.

3.1.2 Improved fuzzy clustering for density function.

The OF of the IFF clustering is designed to balance two competing goals, including minimising the Hellinger distance between each density function and its associated cluster prototype, and preventing large clusters from dominating the optimisation process via the cardinal weight factors. The OF of the IFF is defined as follows

(2)

where the distance is the square of the Hellinger distance between the j-th datum pdf f and the i-th prototype . The weights (so-called cluster size) are non-negative, computed by fuzzy intra-cluster distance.

(3)

where, ni is the number of points in cluster i-th after defuzzification. Then, this OF is codetermined by the distance to each cluster prototype and the cluster’s size.

Without such weighting, large clusters would disproportionately affect the objective function, suppressing smaller ones. The larger the value of , the smaller the size of the cluster i-th. Dividing by helps ensure that every cluster has a balanced influence in the computation and adjusts each cluster’s contribution to the OF, preventing clusters with more elements from entirely dominating the optimisation process. Fig 11 illustrates the difference between two clusters having different cluster sizes. In ideal balanced cases, equal cluster weights lead to symmetric membership curves. Meanwhile, as the imbalance becomes more severe, IFF adaptively reduces the influence of the dominant cluster (e.g., for IR = 10), requiring objects to be much closer to the larger cluster centre to obtain high membership. In other words, only objects extremely close to a large cluster have a chance of being assigned to it.

3.1.3 Initialisation the prototypes.

According to the findings of Celebi et al. [36], random methods are straightforward to understand and implement; however, they tend to be ineffective and unreliable. Additionally, although these methods have low overhead, they do not provide considerable time savings, as they frequently result in slower convergence.

Therefore, this paper used Black-winged Kite (Elanus caeruleus) as an optimal solution as its unique biological heuristic features with adaptability, intelligent behaviour, and predatory [38,39] published by Wang Jun et al. [40]. The initialisation step aims to identify density functions that are maximally separated in the Hellinger metric, encouraging the algorithm to start from representatives that capture the data’s global structure rather than relying on random or redundant choices. We proposed the OF to locate the best positions i from the data such that the distance to the others is optimal.

(OF)

where, is the squared Hellinger distance by Eq. (1) between any two density functions with two indices .

The concept is new but straightforward: We aim to optimise the process of identifying c clusters with the maximum pairwise distance. Following the search algorithm, we will identify objects in the data that are ideal candidates for selection as initial prototypes. This approach allows us to forgo calculating the pairwise distance matrix and instead rely on the optimisation algorithm to search. The proposed method is quite similar to k-means++ [41], however, optimised. Algorithm 1 implements the BKO process.

The application of BKO is still in the initial exploration stage [42]. This study applies BKO to fully utilise its advantages in exploration and exploitation to identify optimal design parameters by effectively exploring the design space and determining the best solution. This innovative application expands BKO’s scope and provides a new optimisation method for the initial pdf clustering.

Algorithm 1 Black-winged kite Optimisation for Initialising Prototypes

Require: The potential solutions pop, maximum number of iterations T, and variable dimension d.

Ensure: the best quasi-optimal prototypes

1:  Initialisation of the position of Black-winged kites and evaluation of the optimal OF

2:  Calculate the fitness value of each Black-winged kite

3:  while t < T do

4:   if p < r then    ▷ Attacking behaviour

5:   

6:   else

7:   

8:   end if

9:   if Fi < Fri then    ▷ Migration behaviour

10:  

11:  else

12:  

13:  end if

14:  if then    ▷ Select the best individual

15:  

16:  else

17:  

18:  end if

19: end while

20: return and Fbest

At the end of the algorithm, we obtain the initialised prototypes inferred from the immanent data. The best individuals are selected as the initial prototypes serving as the encoded input for IFF.

Remark 1 (Initialisation phase). In the Initialisation step, the matrix BKpop,dim represents the location of every Black-winged kite. Then, we distribute the position of each Black-winged kite as uniform, i.e., . BKlb and BKub are the lower and upper bounds of i-th Black-winged kites in the j-th dimension.

Remark 2 (Attacking behaviour). In step 1, and represent the position of the i-th Black-winged kites in the j-th dimension in the iteration t and (t + 1)-th, respectively; r is a random number that ranges from 0 to 1, and p is a constant value of 0.9 for achieving better results [40]; T is the entire number of iterations, and t is the number of iterations that have been completed so far; and .

Remark 3 (Migration behaviour). In step 2, denotes the leading scorer of the Black-winged kites in the j-th dimension of the t-th iteration; define the current position and the fitness value of the random position obtained by any Black-winged kite in the t-th; C(0,1) represents the Cauchy mutation [43]; and .

3.2 Assumptions

The established assumptions underlying the data input and the proposed algorithm include the following considerations

  • The input pdfs do not contain excessive noise and exhibit overlapping properties, ensuring the feasibility of the clustering process.
  • The input uncertainty must be represented as a continuous density function within the common measurable space ; densities are evaluated on finite grids h = 0.01 ensuring integrals in Hellinger distance can be controlled. Empirical datasets are transformed into pdfs using kernel density estimation.
  • Clusters may be imbalanced, however, the IR is finite; there exists no empty or vanishing cluster.
  • The number of given clusters c still should be determined.

3.3 The stage of the proposed algorithm

Algorithm 2 below outlines the process of the proposed two-stage method.

Algorithm 2 The Black-winged Kite Improved Fuzzy clustering for pdfs algorithm (BKIFF)

Require: The converted density functions data , c number of cluster.

Ensure: The membership degree U and prototypes .

 Initialisation of the c-prototypes using the Alg. (1).     ▷ Initialisation

2: while t < MaxIter and do

  Update the partition matrix U(t+1).     ▷ Fix the membership degree For

  • In case of , then(4)

  • In case of , then ,

  • In case of , then .

4: Update the set c-prototypes     ▷ Fix the prototype (5)

  Update the cluster size     ▷ Fix the cluster size (6)

6: end while

At the end of the Algorithm 2, we receive the identifier of each uncertain object about the cluster to which it belongs. In Step 2, we obtain the fuzzy-partition matrix U resulting from the last iteration and prototypes of the i-th cluster.

Remark 4. An adequate stopping criterion for the algorithm is that the maximum absolute difference between elements of the partition matrix in two consecutive iterations be lower than a given positive threshold . A standard setting adopted in this paper is . The maximum number of iterations, MaxIter, is set to 200. The defuzzification of this BKIFF is a final crisp assignment from membership for label reporting, i.e.,

Remark 5. The parameter m, the so-called fuzzifier, controls the extent of membership sharing between fuzzy clusters as m approaches 1.0 from above, the partition matrix tends to be crisp ( or 1). On the other hand, the larger m is, the fuzzier the resulting partition. Usually, m = 2.0 is chosen by reference of [1,24,44]. In this paper, the set of m = {1.1, 1.5, 2.0, 2.5, 3.0, 5.0, 10} will be investigated in all the experiments presented.

Fig 2 illustrates the influence of the fuzzifier parameter m and the cluster IR on the membership distribution in IFF using normalised -distance. The three subplots correspond to increasing IR values of 1, 5, and 10. Varying m from 1.1 to 10 reveals its role in controlling fuzziness, indicating lower values sharpen boundaries, whereas higher values yield softer transitions.

thumbnail
Fig 2. The variation of fuzziers with different simulations of .

The lighter the blue, the greater the representation, the larger the value of m.

https://doi.org/10.1371/journal.pone.0349753.g002

In particular, plays a key role in ensuring the uniform effect. As IR increases, automatically reduces the influence of the large cluster, forcing points that want to be assigned high membership to this cluster to be very close to the cluster centre. This phenomenon causes the membership curves to shift to the right on the distance axis, i.e., the large cluster is restrained and can no longer absorb all the points far away. As a result, points near the small cluster still maintain significant membership. The flowchart of the proposed BKIFF is shown in Fig 3. We set up the ability to reproduce seeds in controlled iterations through each repetition.

3.4 Convergence of IFF

For clarity, the convergence analysis is structured into a sequence of lemmas and theorems applied to Zangwill’s theorem. The convergence proof shows that the OF JIFF decreases monotonically with each iteration, and that the sequence of solutions converges to a stationary point. Specifically, Lemma 1 and Lemma 2 formalize the update operators T1 for the membership matrix and T2 for the prototype functions, providing closed-form minimizers for each variable block. Theorem 2 establishes that each complete iteration of results in a strict decrease of the OF. Theorem 3 demonstrates the continuity of the updates, while Theorem 1 ensures the compactness of the iterates. Collectively, these results culminate in Theorem 4, which confirms that the sequence converges to a stationary point of the JIFF.

Thus, the IFF method incrementally refines the cluster prototypes and the membership matrix until no further improvement can be achieved. Complete analytic proofs are available in the Supplementary Information A. Fig 4 provides a summary of the dependencies and an overview of the convergence mechanism.

thumbnail
Fig 4. The logical flow of the convergence analysis of the proposed IFF algorithm.

https://doi.org/10.1371/journal.pone.0349753.g004

3.5 Time complexity analysis of the BKIFF

We first discuss the runtime complexity of the BKO. The maximum number of iterations (TBKO), population size, and dimension (d) are strongly connected to the position initialisation, fitness value computation, and position update components that determine the computational complexity of the time component of BKO. Consequently, its temporal computational complexity is as follows. It takes time to initialise the population, it takes to compute the fitness value, and to update the location. As a result, the BKO’s overall time complexity is based on Wang Jun et al. [40].

Next, IFF complexity is investigated. The update of the fuzzy membership matrix requires computing (n × c) Hellinger distances, yielding a complexity of . The prototype update step aggregates across n objects in each of c clusters, each in d-dimensional space, resulting in . The recalculation of cluster sizes also takes . Therefore, the computational cost per iteration is . Over TIFF iterations, the overall time complexity of IFF is .

In conclusion, the total time complexity of the BKIFF algorithm is the combination of both phases and can be expressed as .

3.6 Some validity measure indexes for clustering solution

3.6.1 Adjust Rand Index.

The Adjusted Rand Index (ARI), introduced by Hubert Lawrence and Arabie Phipps [45], extends the Rand Index to account for chance. It is widely recommended as the index of choice for assessing the agreement between two partitions in clustering analysis, even when the number of clusters differs. The ARI of results M-partition and its C reality class is computed as 7.

(7)

In this context, represents the number of pairs of elements that are grouped in the same cluster in both partitions, designated as M and C. The expression quantifies the total agreement between the clusters in both partitions. Meanwhile, indicates the number of pairs within each cluster of partition M, regardless of how those elements are grouped in partition C. Finally, represents the total number of possible pairs in the dataset, where n is the total number of elements.

3.6.2 Normalized mutual information.

Mutual information reveals the reduction in entropy of class labels when the cluster labels are known [46].

(8)

where is the number of objects in cluster k belonging to class h; represents entropy, i.e.,

3.6.3 Dunn index.

Given a partition of the dataset, the Dunn Index is defined as [47]

(9)

where, denotes the inter-cluster distance between clusters Zi and Zj, and denotes the intra-cluster diameter of cluster . Common choices are the single-link distance for and the complete-link diameter for .

3.6.4 Silhouette coefficient.

The Silhouette Coefficient captures both intra-cluster compactness and inter-cluster separation. For every pdf object f assigned to cluster Zi, the silhouette value s(f) is defined as [48]

where, is the average distance from fi to all other objects within its own cluster, and is the smallest average distance from fi to objects in any other cluster. The global Silhouette Coefficient of a partition Z is the mean of all individual silhouette values

(10)

In conclusion, the ARI lies in the range , while, NMI are symmetric measures that lie in the range [0,1], in which value 1.0 is perfect agreement between M and C, vice versa. The Dunn Index is non-negative and unbounded above; in practice, values are typically in [0, ∞), with larger values indicating better clustering quality. Silhouette score lies on [−1, +1] with values close to +1 indicating that points are well clustered, values near 0 suggesting overlapping or ambiguous assignments, and negative values revealing possible misclassifications.

3.6.5 Computational time.

Moreover, to assess the computational complexity, their median execution time (in seconds) and the total IFF iterations (number of updated steps) are reported. We consider the BKIFF algorithm’s start time to be from the initialisation with BKO to the completion of the IFF run, in seconds. The baseline comparison algorithm is computed in parallel after the density function data is fed in. In other words, pdfs input extracted, including image-related pipelines, are not counted in the algorithm runtime.

4 Numerical examples

In this work, we tackle the subject from an academic research standpoint. To explore the behaviours of the proposed algorithm in standard problems, synthetic data are first generated (Examples 1 and 2). The BKIFF is then expanded to benchmark image clustering (Example 3). Furthermore, we apply the proposed method to segment a real-world Landsat image, with its grey pixels extracted to pdfs.

Example 1. We simulate a Gaussian pdf data in real numbers with two a priori clusters. The general role of this artificial data is to investigate the stability of the proposed algorithm under highly imbalanced constant scenarios, with IRs ranging from 10 to 100. The larger the IR, the higher the number of clusters 1 and 2. Specifically, the generated data are means and standard deviations of the general formula in each cluster, respectively.

Fig 5 clearly illustrates the impact of the IR on the data distribution. As the IRs increase, such as 1:10, 1:50, and 1:100, the blue cluster becomes denser, while the red cluster remains the same size.

thumbnail
Fig 5. The two clusters of density functions in Example 1.

https://doi.org/10.1371/journal.pone.0349753.g005

Example 2. In this example, we extend Example 1 to three clusters, incorporating a more complex skew-normal distribution from http://azzalini.stat.unipd.it/SN/, introducing greater disparity among cluster distributions. The pdf of a skew-normal distribution is defined as

where and represent the standard normal pdf and distribution function, respectively. The is the location parameter (mean), is the scale parameter (standard deviation), and controls the skewness of the distribution. The pdfs of these clusters are illustrated in Fig 6, where different IRs are examined: (a) 1:10:10, (b) 1:50:50, and (c) 1:100:100.

thumbnail
Fig 6. The three clusters of density functions in Example 2.

https://doi.org/10.1371/journal.pone.0349753.g006

In our framework, the location parameters are sampled from normal distributions , and with cluster 1, 2 and 3, respectively. The scale parameters are set as while the skewness parameters are defined as introducing asymmetric distributions across clusters.

Example 3. We simulate an imbalanced image dataset and a priori define two clusters as the colour differences of two texture images. Two original images (200 × 200 pixels, 256 grey levels) have different background intensities, including D83 and D102. These images were sourced from the Brodatz Texture database [49] at https://multibandtexture.recherche.usherbrooke.ca/original_brodatz.html and were randomly cropped and uniformly distributed into 64 × 64 images. The images are then assembled into an imbalanced dataset with IRs ranging from 1 to 100.

The image is extracted using a non-parametric density estimation method based on the grey distribution, as demonstrated in [25,26]. Specifically, the mapping ℑ from an image shifts into density function are shown by

where is the flattened vector of pixel intensities, . In this paper, we use a Gaussian kernel function implemented via ksdensity(), and the bandwidth is selected according to Scott’s rule [50], , with denoting the standard deviation of I.

Fig 7 illustrates a cropped image and its extracted density functions. Upon reviewing Fig 8, we notice a lack of discernible differences and a significant risk of imbalance within the dataset. We perceive and divide the image dataset’s contents into two clusters based on their a priori nature (D83 and D102). The current challenge is to categorise images into their respective clusters swiftly.

thumbnail
Fig 7. The cropped images from the Brodatz dataset.

https://doi.org/10.1371/journal.pone.0349753.g007

thumbnail
Fig 8. The two clusters of density functions in Example 3.

https://doi.org/10.1371/journal.pone.0349753.g008

The three examples provided are low-dimensional and focus on the behaviour of BKIFF with imbalanced data; thus, the number of BKIFF clusters is set to match the number of simulated clusters.

Application. The Landsat-8 scene (512 × 256 pixels) covering Yam Island presents an extreme marine-class imbalance source from https://oceancolor.gsfc.nasa.gov/. Shallow reef flats, sandy cays, and island vegetation patches are heavily outnumbered by deep-water pixels. This severe skew demands a clustering algorithm that is robust to such imbalanced regimes. The segmentation method is also unique. We first convert the grey-scale image into a raster-scan of non-overlapping (p × p) patches, scanning from left to right and top to bottom. Each patch is now flattened into a (1 × p2)-pixel vector, its intensity density function is estimated via ksdensity(), and the resulting cube on 3-D is stored for patch clustering. This patch-wise strategy retains locally homogeneous spectral signatures and prevents global moderation from blurring colour boundaries. Here, we set the patch size equal to 4. Besides, the BKIFF use m = 2.0, , and h = 0.01 (uniform divided of ) for this application. Fig 9 describes the process of creating a patch pdf and Fig 10 illustrates the original image and its extracted pdf data.

thumbnail
Fig 9. Extract patch sliding for image into pdf.

https://doi.org/10.1371/journal.pone.0349753.g009

thumbnail
Fig 10. The real application image and its patches of pdfs.

https://doi.org/10.1371/journal.pone.0349753.g010

In conclusion, the Table 3 presents the input configurations for the three numerical examples.

thumbnail
Table 3. Configuration of numerical examples.

https://doi.org/10.1371/journal.pone.0349753.t003

4.1 Experiment setup

For a fair comparison between the proposed algorithms, we introduce the baseline algorithms including k-means (KMEAN [30]), fuzzy clustering (FCM CWD [51], FCM- [1]), self-updating process (SUP [29]), and Dinh Pham-Toan’s method (2025) [28] specifically for analysing density functions. These baseline algorithms are contingent on the specific datasets used, highly novel, and have been proven effective by stakeholders. Table 4 describes the proposed method’s parameters and baselines.

thumbnail
Table 4. Configuration of clustering algorithms for comparative analysis.

https://doi.org/10.1371/journal.pone.0349753.t004

Our analysis examines the efficiency of the BKO initialisation, the role of the Hellinger distance, the performance of the BKIFF algorithm, and the behaviour of its fuzziness component. When evaluating the BKO initialisation specifically, average performance is compared using the nonparametric Wilcoxon signed-rank test. For all other comparative analyses involving multiple methods, the Friedman test is applied. These tests have a significance level -value of 0.1%.

We report the experiments by multiple internal and external criteria as described, including ARI, NMI, Silhouette score, and Dunn Index (See more at Some validity measure indexes for clustering solution subsection). Moreover, the computational time consumption is also presented.

4.2 Simulation strategy

It is important to note that all algorithms are executed 10 times independently using the Monte Carlo method to ensure stable seed reproducibility from 1 to 10, and the median and interquartile range (IQR) are calculated. Furthermore, all computations are carried out using Octave (or MATLAB) on an Intel CoreTM i5-11400H @ 2.70GHz with 16.0 GB main memory. Ultimately, the performance results will be presented, along with the main conclusions drawn from the study. Finally, several key conclusions are drawn from the results obtained. Algorithms, including the proposed baseline, are compatible and can be run in programming software using Octave code, which is publicly available at https://doi.org/10.6084/m9.figshare.30600539.v4.

4.3 Sensitivity analysis setup

The author conducts a sensitivity analysis to evaluate the stability and reliability of the BKIFF algorithm across various parameter settings for the two-phase BKO and IFF. On the on hand, we conduct a sensitivity analysis for BKO-based initialisation across three parameter groups including IR = {20,50,80,100}, population sizes pop = {20,30,50,80,100}, and iteration counts . Sample indices bounded the search space, and the objective function maximised the pairwise separation among selected initial prototypes, computed via -distance. We recorded both the Fbest value and the execution time (in seconds). On the other hand, we test the four groups of factors of IFF clustering algorithm, including configurations IR={20,50,80,100}, fuzziness coefficient m={1.1,1.5,2.0,2.5,3.0,5.0,10}, the number of clusters c={2,3,4,5}, and the type of distance metric . Evaluation metrics included ARI and NMI for external clustering fit, as well as Silhouette and Dunn indices for internal clustering quality. Moreover, we record the number of iterations required for convergence and the execution time. Subsequently, the results were analysed using a multi-factor ANOVA and the Morris [52] screening methods.

5 Main results

We present the results by four factors, including intrinsic efficiency of the differential initiation method, distances, fuzziers, and the extrinsic efficiency of the proposed method. The sensitivity analyses are supported by Supplementary Information B.

5.1 Effectiveness of initialisation method

This section evaluates the binary results when comparing BKO-based (BKIFF) and random initialisation (IFF). Overall, the choice of pre-processing method significantly impacts computational efficiency and convergence speed. The experimental results demonstrate that integrating the BKO mechanism into the initialisation stage does not affect clustering accuracy; however, it yields substantial and statistically significant gains in computational efficiency and robustness, particularly under severe class imbalance.

As reported in Tables 5, 6, and 7, Example 1 exhibits complete invariance in clustering quality across all IR ranging from 20 to 100. For both IFF and BKIFF, the median values of ARI and NMI remain at 1.0 (IQR, 0), while the Dunn index decreases identically from 5.1 at IR = 20 to 3.9 at IR = 100. Similarly, Silhouette values remain stable at approximately 0.99 for both methods. Consistently, no statistically significant differences were observed between IFF and BKIFF for any clustering metric (all p = 1.0).

thumbnail
Table 5. Comparison of with/without BKO in varying IR (Example 1).

https://doi.org/10.1371/journal.pone.0349753.t005

thumbnail
Table 6. Comparison of with/without BKO in varying IR (Example 2).

https://doi.org/10.1371/journal.pone.0349753.t006

thumbnail
Table 7. Comparison of with/without BKO in varying IR (Example 3).

https://doi.org/10.1371/journal.pone.0349753.t007

In contrast, empirical studies in clustering research by [30,31] have demonstrated that poor initialisation can lead to fuzzy clustering algorithms becoming trapped in local optima, thereby increasing iteration counts. Pronounced differences, therefore, emerge when computational behaviour is considered. In Example 1, the median runtime of IFF increases from 0.01 s at IR  = 20 to 0.13 s at IR = 100, accompanied by a widening IQR that reaches 0.06 s at the highest imbalance level. BKIFF, by comparison, exhibits a markedly slower growth in runtime, increasing only from 0.01 s to 0.03 s over the same range, corresponding to a reduction of approximately 50%−70%. A similar trend is observed in iteration counts. While IFF requires a median of 11 iterations at IR = 20, rising sharply to 48 iterations at IR = 100 (IQR, 19), BKIFF converges consistently within iterations (IQR, ). These efficiency gains are statistically significant for all IR (p < 0.001). The stabilising effect of BKO becomes more evident in Examples 2 and 3. In Example 2, both methods achieve perfect clustering at IR = 20 and IR = 50; however, IFF performance deteriorates substantially at higher IR, with median ARI values decreasing to 0.8 at IR = 80 and IR = 100. In contrast, BKIFF preserves an ARI of 1.0 (IQR, 0) across all replications. This pattern is consistently reflected in NMI, Dunn, and Silhouette indices, and the corresponding Wilcoxon statistics yield p < 0.001. An even more pronounced divergence is observed in Example 3. At IR = 100, IFF records a median ARI of 0.004 and requires 263 iterations (IQR, 477), indicating volatile convergence behaviour. Conversely, BKIFF achieves perfect clustering (ARI = 1.0) and converges within a median of 11 iterations (IQR, 1). In terms of runtime, IFF reaches a median of 2.04 s, whereas BKIFF completes the optimisation in approximately 0.09 s. All observed differences in computational metrics are statistically significant (p < 0.001).

5.2 Effectiveness of Hellinger distance

The comparative analysis across distance measures with the -distance consistently outperforms the remaining metrics, followed by the -distance, across all three experimental settings. As reported in Tables 8, 9, and 10, the -based BKIFF achieves perfect clustering quality in all examples, maintaining ARI = NMI = 1.0 (IQR, 0). In addition, it yields high cluster separation, as reflected by Dunn indices exceeding 2.4 in Examples 1 and 2 and remaining above 5.8 in Example 3, along with Silhouette values consistently close to 1.0.

thumbnail
Table 8. Comparison of different distances with varying IR (Example 1).

https://doi.org/10.1371/journal.pone.0349753.t008

thumbnail
Table 9. Comparison of different distances with varying IR (Example 2).

https://doi.org/10.1371/journal.pone.0349753.t009

thumbnail
Table 10. Comparison of different distances with varying IR (Example 3).

https://doi.org/10.1371/journal.pone.0349753.t010

By contrast, both the -distance and the CWD metric exhibit systematic performance degradation across all imbalance levels. In Example 1, these distances produce ARI and NMI values below 0.01 once IR = 20 and IR = 50, with Dunn indices collapsing to 0 and Silhouette values remaining below 0.30. Similar behaviour is observed in Examples 2 and 3, where clustering quality metrics for and CWD approach or equal 0 irrespective of imbalance severity, indicating an inability to recover meaningful cluster structures. The -distance shows competitive performance in Example 1 and moderate robustness in Example 2; however, its effectiveness deteriorates under the more challenging density configurations of Example 3, where increased variability is observed in both quality and convergence behaviour. In all cases, the observed differences among distance measures are statistically significant (Friedman tests, p < 0.001)

The superiority of the -distance is further reinforced by its computational efficiency and stability. Across all three examples, and consistently deliver the lowest runtimes and iteration counts. Specifically, their execution times remain within a few milliseconds in Example 1, below 0.13 s in Example 2, and between 0.01 and 0.10 s in Example 3. Correspondingly, convergence is achieved within 6–14 iterations in Example 1, 9–19 iterations in Example 2, and 5–11 iterations in Example 3, with minimal IQR values indicating stable optimisation behaviour. In sharp contrast, the - and CWD distances require substantially higher computational effort. In particular, under high imbalance, these metrics demand tens to hundreds of optimisation steps and frequently reach the maximum iteration cap of 500. Their runtimes exceed 0.5 s in moderate cases and escalate to approximately 2–3 s in the most challenging scenarios, accompanied by large dispersion across replications. The Friedman test results for runtime and iteration counts consistently report p < 0.001, confirming statistically significant differences in computational cost across distance measures.

In conclusion, this comparison highlights that inappropriate distance measures not only compromise clustering quality but also lead to extremely long and unstable iterations. -distance performed best among the compared methods that keeps optimal quality once IR is over 80, while , previously robust in Example 1, begins to waver, and -distance or CWD completely collapse. This empirical demonstration provides a practical guideline for selecting the distance based on IR rather than a default.

5.3 Effectiveness of Fuzziness

Three pairs of curves in three Fig 11(a), 11(b), and 11(c) with the same horizontal axis and the vertical axis of quality. In Fig 11(a) the four overlapping lines are horizontal at , while in (b) the two curves remain high, the two IR = 80, and 100, curves plummet from m = 5.0, and in (c) only the IR = 20 curve remains high, while the IR > 50 curves fall close to 0.0 right from low m.

thumbnail
Fig 11. The inverstigrating of fuzziness of BKIFF on three examples.

https://doi.org/10.1371/journal.pone.0349753.g011

The experimental results reveal a pronounced dependency of clustering performance on the fuzziness parameter m under varying IR. In the proposed algorithm, high-quality clustering is maintained for small m values, regardless of IR. As m increases, performance degradation occurs more abruptly for higher IR, yet the proposed method exhibits markedly slower deterioration compared to typical FCM behaviour reported in earlier studies [1]. Specifically, for low IR = 20, or 50, both ARI and Silhouette remain near-optimal until m = 5.0, after which the scores drop sharply. For higher IR > 80, conventional fuzzy clustering, as documented in prior works [30,53], tends to lose cluster separability at much smaller m = 2.0, often leading to near-random assignments. In contrast, the proposed method delays this collapse, maintaining competitive clustering quality over a wider range of m.

5.4 Effectiveness of BKIFF

The comparative evaluation across all three examples (Tables 11–13) shows that BKIFF consistently maintains good clustering performance across the examined range of imbalance ratios. In all settings, BKIFF attains ceiling-level scores with ARI = NMI = 1.0, maintains high Dunn indices, and achieves Silhouette values close to or equal to 1.0, regardless of imbalance severity. This behaviour contrasts sharply with that of the remaining competitors, whose performance deteriorates as the imbalance increases.

thumbnail
Table 11. Comparison of state-of-the-art methods with varying IR (Example 1).

https://doi.org/10.1371/journal.pone.0349753.t011

thumbnail
Table 12. Comparison of state-of-the-art methods with varying IR (Example 2).

https://doi.org/10.1371/journal.pone.0349753.t012

thumbnail
Table 13. Comparison of state-of-the-art methods with varying IR (Example 3).

https://doi.org/10.1371/journal.pone.0349753.t013

Specifically, FCM-CWD and FCM- exhibit an immediate decline in clustering quality as imbalance intensifies. Across the three examples, their ARI, NMI, and Dunn indices decline rapidly toward zero, while their Silhouette scores stagnate at low levels, indicating poor cluster separation. The K-MEAN algorithm remains competitive only under mild imbalance, with IR = 20 and IR = 50. However, its performance degrades substantially for IR values over 80, where both accuracy and separation metrics deteriorate. The Self-Updating method preserves ARI = 1.0 across several configurations; however, it produces tremendous and unstable Dunn values, reflecting pathological behaviour rather than meaningful cluster structure. The method proposed by Dinh Pham-Toan (2025) [28] exhibits the weakest performance among all evaluated approaches. In Examples 2 and 3, it yields ARI = NMI = 0.000 across all imbalance levels, indicating a complete failure to recover any meaningful clustering structure. In Example 1, where several methods already achieve perfect clustering, Dinh’s method does not offer benefits when dealing with imbalanced data. Across all examples and evaluation metrics, Friedman tests return p-values below 0.001, confirming that the observed performance differences among methods are statistically significant.

From a computational perspective, BKIFF also demonstrates the most favourable and stable cost profile, which is comparable to the baseline. While FCM-CWD and FCM- incur relatively modest runtimes, they require a large number of iterations. The KMEAN algorithm remains computationally inexpensive. Although the Self-Updating method converges in as few as iterations, its wall-clock time grows rapidly, exceeding 10s in Example 2 and approaching 50s in Example 3, highlighting that a low iteration count does not necessarily translate into computational efficiency. In contrast, BKIFF consistently converges within approximately iterations, with runtimes ranging from s even under extreme imbalance conditions. By comparison, the method proposed by Dinh Pham-Toan (2025) [28] repeatedly reaches the maximum iteration limit of 500. It exhibits the longest runtimes across all examples, often extending to tens of seconds in Examples 2 and 3, which demonstrates pronounced scalability limitations. Friedman tests for both runtime and iteration counts uniformly yield p < 0.001, confirming that BKIFF’s computational advantage over all competing methods is statistically significant.

By the qualitative membership results of the typical example 3, panels of Fig 12(a)–Fig 12(d) display the per/sample membership of each cluster algorithm (except the FCM-) for IR = 80. Fig 12 shows that BKIFF maintains perfect crispness in this setting, pointing to a previously underexplored imbalanced-immunity threshold beyond which centroid-based and self-updating methods may struggle to maintain reliable partitions.

thumbnail
Fig 12. Membership matrix of methods at IR = 80.

https://doi.org/10.1371/journal.pone.0349753.g012

5.5 Application results

Fig 13 shows the results of the BKIFF cluster analysis with an increasing number of clusters c from applied to segment the Landsat image of the Yam island and its surrounding sea. It shows that the level of detail of land cover separation increases with c-clusters.

thumbnail
Fig 13. The proposed method BKIFF applied to the Landsat image.

https://doi.org/10.1371/journal.pone.0349753.g013

As a result, above , the BKIFF segmentation progresses monotonically without producing salt-and-pepper artefacts. At c = 2, the algorithm separates the scene into deep water and the surrounding coastal areas. As the c-cluster increases toward 3, the spectral contrast between the coral-reef flats and the island becomes clearer, revealing more defined outlines. Between c = 4 and c = 6, the island becomes more distinct, showcasing areas of light sand, vegetation, and various marine regions. Additionally, a new layer is introduced to highlight differences in the seawater. Next, micro-layers features, such as thin sandbars occupying less than 0.5% of the total pixels, remain clustered, demonstrating the algorithm’s capacity to preserve minority spectral signatures despite extreme class imbalances. Finally, at c = 10, no clusters are disrupted; each habitat remains a spatially contiguous blob, demonstrating the algorithm’s intrinsic spatial consistency.

In conclusion, an average clustering degree of c between 4 and 6 is most appropriate for balancing coverage, discrimination, and generalisability. In contrast, higher values of c are better suited for microscopic studies that require deeper investigation and fine-grained decomposition.

6 Discussion

The experimental findings provide several important insights when positioning BKIFF within the broader landscape of clustering methods for uncertain and imbalanced data. First, classic partition-based algorithms such as k-means or standard FCM, whose performance heavily depends on centroid initialisation and tends to deteriorate under skewed distributions, BKIFF demonstrates that an informed initialisation strategy can fundamentally alter the optimisation landscape. Another notable insight is the interaction between IR, fuzziness parameter m, and distance metrics. Traditional FCM-based methods tend to collapse at relatively low m under high IR, whereas BKIFF extends. This indicates that the proposed framework implicitly regularises the membership distribution, making it less sensitive to over-smoothing effects induced by large fuzziness values. In conclusion, from a methodological perspective, BKIFF’s consistent performance across all experimental scenarios suggests that the combination of adaptive initialisation and an appropriate probabilistic distance constitutes a more effective strategy than introducing additional model complexity.

6.1 Advantages of the current study

One of the significant advantages of the BKIFF algorithm is its capability to eliminate reliance on randomness, a limitation often found in previous clustering algorithms that employ adaptive initialisation. Traditional methods may require multiple trials to achieve an optimal initialisation. Randomness causes instability in partition clustering, leading to slower convergence rates and potentially incorrect solutions. Therefore, BKIFF enhances clustering accuracy from the first iterations. The biologically inspired BKO step provides high-quality initial prototypes, allowing the clustering process to converge much faster, enabling the process to be deterministic, thereby delivering superior temporal stability and reliability in high-demand, large-scale applications.

Moreover, conventional clustering methods often assign higher membership values to samples belonging to large clusters. In Comparison, minority clusters exhibit lower membership values. This imbalance in membership assignment can result in small clusters being misrepresented or absorbed by dominant clusters, leading to inaccurate results. BKIFF presents a two-phase clustering approach that addresses this issue by incorporating a more refined membership assignment mechanism, ensuring a fairer representation of all clusters.

Beyond its methodological refinements, the theory of BKIFF has been formally established through Zangwill’s convergence theory, i.g, it has proven to fulfil the conditions for convergence as in Theorem 4. Likewise, traditional clustering metrics based on - or -distance consistently struggle with imbalanced datasets. Consequently, blending an initialisation strategy, a theoretically sound fuzzy clustering model, and an advanced similarity measure makes BKIFF reliable for uncertain and imbalanced data.

Finally, applying the pdfs to the Yam Island image analysis based on colour distribution is interesting. We do not necessarily need to emphasize deeper technicals to recognize each object on the Island, as it already has extremely high colour correlation, allowing for a less resource-intensive algorithm. Therefore, this application is a novel and promising approach in image processing.

6.2 Limitations

The improvement BKIFF focuses on observing the invariant behaviour in the environment of increasing imbalance. The innovations in determining the legitimacy and trustworthiness of the number of clusters c, in discovering complexities of overlapping, high-dimensional data, and in extending to multi-view clustering [54,55] must be further developed. Furthermore, the indices related to internal quality are also studied more deeply because of the ambiguity about its specific distances.

Moreover, real-world uncertain data are challenging to collect and analyse due to their complexity and the conditions under which they are collected. Therefore, the uncertainty assessment for natural structure analysis and imbalance detection based on random image cropping (i.e., Example 2) is an exciting application. Nevertheless, caution is needed in cluster analysis of density functions. Truthfully, the density functions of colour are generally invariant when the image is rotated, scaled, and contains a small amount of noise [56]. It is noticeable that colour density functions can exemplify the similarity between features in images; however, not all of them. It raises considerable challenges because colour is not permanently a more attractive feature than the object within the image in many difficulties, such as detection or classification. In summary, the density functions have advantages over the intensity matrix, but this feature can describe image similarities only for specific problems and needs further improvement.

7 Conclusion

This study proposed the BKIFF algorithm, a two-phase framework designed to address the challenge of clustering pdfs under severe class imbalance. Building on this foundation, the method seamlessly integrated an enhanced initialisation process with imbalance-aware fuzzy clustering, thereby addressed the shortcomings of existing approaches while ensuring provable convergence. Moreover, extensive experiments demonstrate that BKIFF consistently achieved high clustering accuracy and computational efficiency across a wide range of imbalance ratios. In addition, its effectiveness was validated through Landsat image segmentation, where it successfully preserved minority spectral structures even under highly skewed conditions. Taken together, these results showed that BKIFF offers a robust and practical solution for clustering uncertain data. Future research may focus on the automatic determination of the number of clusters and explore extensions toward deep learning or ensemble-based clustering strategies.

After overcoming the inherent shortcomings, several potential research directions can be considered. These include the integration of deep clustering or ensemble clustering to increase the exploration of the pdf object and further extensions toward contrastive learning for images and videos [57]. In addition, more automatics will be applied to the optimal clustering phase [58] and accommodate a wider range of data types [59]. Furthermore, the proposed density function framework shows strong potential to provide more interpretable insights and theoretical support across various application domains, such as daily solar radiation analysis [60] and species distribution modelling. Finally, handling imbalanced data remains a promising direction, where advanced resampling techniques can be further explored and integrated [61,62].

A Proofs of convergence

By this procedure, an iterative sequence is generated, and the theoretical task is to resolve whether or not the sequence converges. Consider F and G as two functions . Using F and G, we must modify TIFF so that it generates Picard sequences in both and simultaneously. From here, we define the operator

(11)

where, the mapping , and the mapping .

In summary, the proof is considered whether or not convergent. The following two lemmas confirm the establishment of two formulas for the partition matrix and the prototypes given by the update for IFF and determine that the descent constraint holds for JIFF in Zangwill’s theorem [63]. The fuzzifier parameter in the proposed algorithm is typically set to m > 1.

Lemma 1 (Improved Fuzzy membership degree). Let and the OF , where is fixed. Then is a strict local minimum solution of if and only if , and

(12)

Proof 1 Minimisation of over Uf is an optimisation Karush-Kuhn-Tucker problem with (cn+n)-linear constraints. The original optimisation problem is rewritten as

(13)(14)

Suppose that is a minimiser of the above objective function . Then, it must satisfy the following KKT conditions [44,63]

  • The partition is feasible, i.e., and ;
  • There exist multipliers for the equality constraints and for the inequality constraints such that complementary slackness holds;
  • Stationarity:(15)
    Since , we have . Similarly, so that . Therefore, So that, Recall the constant , we have (12).
  • To the sufficiency, we examine the , the (cn × cn) Hessian matrix of evaluated at . It is easy to reduce that
    where . Since we assume m > 1 and distance is non-zero in this section, accordingly is a positive definite matrix with all the diagonal elements positive. Therefore, , calculated by (12), is the solution to the relaxed optimisation problem under consideration.

Next, we fix and consider minimisation of JIFF with respect to .

Lemma 2. Let is Hellinger-distance around prototype calculated when is constant (). Then, is a strict local minimum solution of if and only if formed by following

(16)

Proof 2 For fixed membership matrix , the objective function is separable in the prototype functions . Using the squared Hellinger distance, we obtain

For each i, the integrand is a quadratic polynomial in . Differentiating twice with respect to gives

Since implies , we also have for all i. Therefore the Hessian is diagonal with strictly positive diagonal entries, hence positive definite. Thus is strictly convex on .

Strict convexity ensures that minimizes if and only if it satisfies

Computing the derivative gives

Hence, the optimality condition becomes . Next, solving for yields and therefore, we have

Since the objective is strictly convex, this solution is the unique strict global minimum.

The above two lemmas are only necessary conditions for to be a global minimum of the function JIFF. From here, the next question is whether or not the iterate sequence converges to .

The final condition required for the Zangwill theorem is the compactness of a subset of which contains all of the possible iterative sequences generated by TIFF. The three theorems 1, 2, and 3 are referred to in [31,44,64]

These three theorems are part of the proof of IFF by Zangwill’s theorem that directly supports the results of Theorem 4.

Theorem 1 (Compactness constraint). Let be the c-fold Cartesian product of the convex hull of , and be the starting point of iteration with JIFF with and . Then

and is compact in .

Proof 3 Let be chosen, then calculated by (12). So that

Let . In view of constraints (C1) and (C3), it must be that , . So , we rewrite , with

Thus, , and hence . Continuing recursively, we know that by (12), and then by the same argument as above. Thus, every iterative sequence of JIFF belongs to for any t ≥ 1. Although we may choose because its initialisation, , so that . Furthermore, it is clear that is a compact set in finite [44].

Theorem 2. Consider be the solution set. Then, JIFF is a descent function for .

Proof 4 First, since , , and are continuous, and JIFF is the sum of products of such functions so JIFF is continuous on . Next, suppose . Then it follows from (11) that

Finally, Lemma 1 and Lemma 2 implies that .

Theorem 3 (Continuity constraint). TIFF is continuous on .

Proof 5 Since and the composition of continuous functions is again continuous, it suffices to show that T1 and T2 are each continuous. Since , T1 is continuous if G is. To see that G is continuous in the cn variables note that G is a vector field, with the resolution by (cd) scalar field as . Now is continuous, is continuous. The sum of continuous functions is again continuous. Thus, Gik is the quotient of two continuous functions. In view of constraint (C3), the denominator never vanishes, then Gik are also continuous for all (i, k). Therefore, G and T1 are continuous on their entire domains.

Similarly, since , it suffices to show that F is a continuous function in the variable . F is a vector field with the resolution by (cd) scalar fields . Since is continuous for all i, and is continuous. The sum of continuous functions is again continuous; thus, Fij is the quotient of two continuous functions. Given our general hypothesis that . Therefore, F and T2 are continuous on their entire domains. Finally. is continuous on .

We now assemble the assumptions and results of the above lemmas into a formal theorem regarding the convergence of IFFs.

Theorem 4 (Convergence theorem for JIFF). Consider the set , given the OF of the form (2), where U satisfies (C1)-(C3) and . If is an algorithm (Picard) iterative operator of JIFF, and for every t such that then for any , or

  1. terminates at a local minimum of JIFF; or
  2. contains a subsequence such that a local minimum of JIFF as .

Proof 6 Because of the OF JIFF is continuous on , Theorem 2 shows that JIFF is a Zangwill descent functional for the solution set where SIFF is the set of strict local minima of JIFF. Theorem (3) asserts that iterative operator TIFF is continuous on and by Theorem (1), the iterate sequences operator JIFF are always in a compact subset of the domain of JIFF. The result follows immediately from the Zangwill theorem.

B Hypothesis test results

B.1 Sensitivity analysis of BKIFF

B.1.1 BKO-based initialisation.

In all three Examples, the sensitivity to population size (pop) and iteration number () makes almost no significant difference to the OF value. The Fig 14 corresponding to different IR levels exhibits a relatively stable trend, with only minor fluctuations within the error range, suggesting that the performance of BKO is robust to changes in the basic configuration parameters.

thumbnail
Fig 14. Sensitivity of population and on BKO of each IR.

https://doi.org/10.1371/journal.pone.0349753.g014

B.1.2 IFF clustering algorithrm.

Table 14 presents the results of the Morris and ANOVA sensitivity analyses for ARI across the three Examples. Each configurations repeated 10 times evaluations. Overall, the agreement between the Morris and ANOVA analyses highlights the dominant role of parameter c, the secondary but still meaningful influence of IR and m, and the weak and stable impact of the distance factor.

thumbnail
Table 14. Sensitivity metrics and ANOVA indices for ARI of three examples.

https://doi.org/10.1371/journal.pone.0349753.t014

Firstly, the Morris indices indicate that parameter c consistently exerts the strongest overall influence, showing the highest absolute effects and substantial variability , particularly in Example 2 where and . By contrast, the factors IR and m exhibit moderate effect sizes, while the distance parameter shows the smallest and most stable impacts, reflected in comparatively low values across all three Examples. Moreover, the ANOVA results reinforce these observations. In Example 2, c dominates the variance contribution with , indicating a markedly higher explanatory power than the remaining factors. Example 3 shows a similar pattern, with c accounting for nearly one quarter of the variance, , followed by IR with a moderate contribution, . In Example 1, the effects of c, IR, and m are more balanced, although all remain statistically significant at p < 0.001. Across all Examples, the distance factor consistently yields the smallest values, confirming its limited influence on ARI compared with the other parameters.

Simulate the entire set of parameters, including IR, the sets m and c, and the distance measures. Each configuration is repeated 10 times. Tables 15–17 display the Sum of Squares (SS), Mean Square (MS), F-statistic, p-value, and influence coefficient for each factor in three independent examples, respectively. Because of the missing true label, Table 18 excludes ARI. These values are calculated using one-way ANOVA. The larger these values, the greater the factors’ influence on clustering quality at the statistical significance level p.

thumbnail
Table 15. Post-hoc of ANOVA for each factors (Example 1).

https://doi.org/10.1371/journal.pone.0349753.t015

thumbnail
Table 16. Post-hoc of ANOVA for each factors (Example 2).

https://doi.org/10.1371/journal.pone.0349753.t016

thumbnail
Table 17. Post-hoc of ANOVA for each factors (Example 3).

https://doi.org/10.1371/journal.pone.0349753.t017

thumbnail
Table 18. Post-hoc of ANOVA for each factors (Application).

https://doi.org/10.1371/journal.pone.0349753.t018

B.2 Pair configuations

The Figs 1517 show the correlation between each pair of BKIFF parameters, including IR, c, m, and distance for three Examples, respectively. Each coloured box shows the degree of correlation between the two factors across the four criteria, including ARI, Silhouette, number of iterations, and computation time. Because of the missing true label, Fig 18 excludes ARI. Light colours indicate strong correlation, while dark blue indicates weak or almost nonexistent effects. The results show that the c cluster and the choice of distance have the greatest influence, while IR typically contributes very little to the variation in the indices.

thumbnail
Fig 15. Correlation of pair factors of BKIFF (Example 1).

https://doi.org/10.1371/journal.pone.0349753.g015

thumbnail
Fig 16. Correlation of pair factors of BKIFF (Example 2).

https://doi.org/10.1371/journal.pone.0349753.g016

thumbnail
Fig 17. Correlation of pair factors of BKIFF (Example 3).

https://doi.org/10.1371/journal.pone.0349753.g017

thumbnail
Fig 18. Correlation of pair factors of BKIFF (Application).

https://doi.org/10.1371/journal.pone.0349753.g018

Acknowledgments

The authors sincerely thank the Associate Editors and anonymous reviewers for their constructive comments, which greatly improved the quality of this paper. The authors also express their gratitude to Dr. Thao Nguyen-Trang for his valuable advice and support during the preparation of this work.

References

  1. 1. Nguyentrang T, Vovan T. Fuzzy clustering of probability density functions. J Appl Stat. 2016;44(4):583–601.
  2. 2. Nguyen-Trang T, Nguyen-Thoi T, Nguyen-Thi K-N, Vo-Van T. Balance-driven automatic clustering for probability density functions using metaheuristic optimization. Int J Mach Learn & Cyber. 2022;14(4):1063–78.
  3. 3. Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W. Model-based approximate querying in sensor networks. The VLDB Journal. 2005;14(4):417–43.
  4. 4. Tavakkol B, Jeong MK, Albin SL. Validity indices for clusters of uncertain data objects. Annals Operat Res. 2021;303:321–57.
  5. 5. Qin B, Xia Y, Li F. DTU: a decision tree for uncertain data. In: Lecture notes in computer science. Springer Berlin Heidelberg; 2009. 4–15. https://doi.org/10.1007/978-3-642-01307-2_4
  6. 6. Kriegel H-P, Pfeifle M. Density-based clustering of uncertain data. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005. 672–7. https://doi.org/10.1145/1081870.1081955
  7. 7. Hron K, Menafoglio A, Templ M, Hrůzová K, Filzmoser P. Simplicial principal component analysis for density functions in Bayes spaces. Computational Statistics & Data Analysis. 2016;94:330–50.
  8. 8. Minami M, Lennert-Cody CE. Regression tree and clustering for distributions, and homogeneous structure of population characteristics. J Agricul Biol Environ Stat. 2024;30(4):1019–38.
  9. 9. Gullo F, Tagarelli A. Uncertain centroid based partitional clustering of uncertain data. arXiv preprint. 2012.
  10. 10. Gullo F, Ponti G, Tagarelli A. Clustering uncertain data via K-medoids. In: Lecture notes in computer science. Springer Berlin Heidelberg; 2008. 229–42. https://doi.org/10.1007/978-3-540-87993-0_19
  11. 11. Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons; 2009.
  12. 12. Duda RO, Hart PE. Pattern classification and scene analysis. New York: Wiley. 1973.
  13. 13. Jain AK, Duin PW, Jianchang Mao. Statistical pattern recognition: a review. IEEE Trans Pattern Anal Machine Intell. 2000;22(1):4–37.
  14. 14. Chau M, Cheng R, Kao B, Ng J. Uncertain data mining: an example in clustering location data. In: Lecture notes in computer science. Springer Berlin Heidelberg; 2006. 199–204. https://doi.org/10.1007/11731139_24
  15. 15. Lee SD, Kao B, Cheng R. Reducing UK-means to K-means. In: Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), 2007. 483–8. https://doi.org/10.1109/icdmw.2007.40
  16. 16. Vo Van T, Pham-Gia T. Clustering probability distributions. JAppl Stat. 2010;37(11):1891–910.
  17. 17. Ngai W, Kao B, Chui C, Cheng R, Chau M, Yip K. Efficient Clustering of Uncertain Data. In: Sixth International Conference on Data Mining (ICDM’06), 2006. 436–45. https://doi.org/10.1109/icdm.2006.63
  18. 18. Diem HK, Trung VD, Trung NT, Van Tai V, Thao NT. A differential evolution-based clustering for probability density functions. IEEE Access. 2018;6:41325–36.
  19. 19. Yang B, Zhang Y. Kernel based K-medoids for clustering data with uncertainty. In: Lecture notes in computer science. Springer Berlin Heidelberg; 2010. 246–53. https://doi.org/10.1007/978-3-642-17316-5_23
  20. 20. Kriegel H, Pfeifle M. Hierarchical density-based clustering of uncertain data. In: Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE; 689–92. https://doi.org/10.1109/icdm.2005.75
  21. 21. Zhang X, Liu H, Zhang X. Novel density-based and hierarchical density-based clustering algorithms for uncertain data. Neural Netw. 2017;93:240–55. pmid:28686946
  22. 22. Gullo F, Ponti G, Tagarelli A, Greco S. A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In: 2008 Eighth IEEE International Conference on Data Mining, 2008. 821–6.
  23. 23. Phamtoan D, Vovan T. Automatic fuzzy clustering for probability density functions using the genetic algorithm. Neural Comput Applic. 2022;34(17):14609–25.
  24. 24. Vo-Van T, Nguyen-Thoi T, Vo-Duy T, Ho-Huu V, Nguyen-Trang T. Modified genetic algorithm-based clustering for probability density functions. J Stat Comp Simul. 2017;87(10):1964–79.
  25. 25. Nguyen-Trang T, Nguyen-Thoi T, Vo-Van T. Globally automatic fuzzy clustering for probability density functions and its application for image data. Appl Intell. 2023;53(15):18381–97.
  26. 26. Tran-Nam H, Nguyen-Trang T, Che-Ngoc H. A new possibilistic-based clustering method for probability density functions and its application to detecting abnormal elements. Sci Rep. 2024;14(1):17871. pmid:39090197
  27. 27. Phamtoan D, Vovan T. Improving fuzzy clustering algorithm for probability density functions and applying in image recognition. MAS. 2020;15(3):249–61.
  28. 28. PhamToan D. A enhanced fuzzy clustering algorithm for probability density functions and image clustering using Inception Resnet-v2 features. Data Min Knowl Disc. 2025;39(5).
  29. 29. Chen JH, Chang YC, Hung WL. A robust automatic clustering algorithm for probability density functions with application to categorizing color images. Commun Statistics - Simulation and Comp. 2017;47(7):2152–68.
  30. 30. Nguyen-Trang T, Nguyen-Hoang Y, Vo-Van T. A new semi-supervised clustering algorithm for probability density functions and applications. Neural Comput Applic. 2024;36(11):5965–80.
  31. 31. Pu Y, Yao W, Li X. EM-IFCM: fuzzy c-means clustering algorithm based on edge modification for imbalanced data. Inform Sci. 2024;659:120029.
  32. 32. Zhou K, Yang S. Exploring the uniform effect of FCM clustering: a data distribution perspective. Knowledge-Based Syst. 2016;96:76–83.
  33. 33. Zhou K, Yang S. Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Applic. 2019;23(1):455–66.
  34. 34. Xiong H, Wu J, Chen J. K-means clustering versus validation measures: a data distribution perspective. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006. 779–84.
  35. 35. Liang J, Bai L, Dang C, Cao F. The $K$-means-type algorithms versus imbalanced data distributions. IEEE Trans Fuzzy Syst. 2012;20(4):728–45.
  36. 36. Celebi ME, Kingravi HA, Vela PA. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl. 2013;40(1):200–10.
  37. 37. Hellinger E. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. J für die reine und angewandte Mathematik. 1909;1909(136):210–71.
  38. 38. Wu C-F, Lai J-H, Chen S-H, Trac LVT. Key factors promoting the niche establishment of black-winged kite Elanus caeruleus in farmland ecosystems. Ecol Indicators. 2023;149:110162.
  39. 39. Ramli R, Fauzi A. Nesting biology of Black-shouldered Kite (Elanus caeruleus) in oil palm landscape in Carey Island, Peninsular Malaysia. Saudi J Biol Sci. 2018;25(3):513–9. pmid:29686514
  40. 40. Wang J, Wang W, Hu X, Qiu L, Zang H. Black-winged kite algorithm: a nature-inspired meta-heuristic for solving benchmark functions and engineering problems. Artif Intell Rev. 2024;57(4).
  41. 41. Arthur D, Vassilvitskii S. k-means++: the advantages of careful seeding. Stanford; 2006.
  42. 42. Haohao M, As’arry A, Yanwei F, Lulu C, Delgoshaei A, Ismail MIS, et al. Improved black-winged kite algorithm and finite element analysis for robot parallel gripper design. Advances in Mechanical Engineering. 2024;16(10).
  43. 43. Jiang M, Feng X, Wang C, Fan X, Zhang H. Robust color image watermarking algorithm based on synchronization correction with multi-layer perceptron and Cauchy distribution model. Appl Soft Comp. 2023;140:110271.
  44. 44. Bezdek JC. A convergence theorem for the fuzzy ISODATA clustering algorithms. IEEE Trans Pattern Anal Mach Intell. 1980;2(1):1–8. pmid:22499617
  45. 45. Hubert L, Arabie P. Comparing partitions. J Classification. 1985;2:193–218.
  46. 46. Bradley PS, Fayyad UM. Refining initial points for k-means clustering. In: ICML. Citeseer; 1998. 91–9.
  47. 47. Dunn JC. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. 1973.
  48. 48. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
  49. 49. Brodatz P. Textures: a photographic album for artists and designers. New York: Dover Publications; 1966.
  50. 50. Scott DW. On optimal and data-based histograms. Biometrika. 1979;66(3):605–10.
  51. 51. Vovan T. Cluster width of probability density functions. Intell Data Anal. 2019;23(2):385–405.
  52. 52. Tsvetkova O, Ouarda TBMJ. A review of sensitivity analysis practices in wind resource assessment. Energy Conversion and Management. 2021;238:114112.
  53. 53. Nguyen‐Trang T, Vo‐Van T, Che‐Ngoc H. An efficient automatic clustering algorithm for probability density functions and its applications in surface material classification. Statistica Neerlandica. 2023;78(1):244–60.
  54. 54. Gan Y, You Y, Huang J, Xiang S, Tang C, Hu W, et al. Multi-view clustering via multi-stage fusion. IEEE Trans Multimedia. 2025;27:4571–83.
  55. 55. Yang C, Yue H. Consensus partition guided incomplete multi-view clustering. IEEE Access. 2025;13:40198–209.
  56. 56. Che-Ngoc H, Nguyen-Trang T, Nguyen-Bao T, Nguyen-Thoi T, Vo-Van T. A new approach for face detection using the maximum function of probability density functions. Ann Oper Res. 2020;312(1):99–119.
  57. 57. Chen Y-J, Lin S-S, Shi Y, Ho T-Y, Xu X. MCC: multi-cluster contrastive semi-supervised segmentation framework for echocardiogram videos. IEEE Access. 2025;13:30543–54.
  58. 58. Hassan E, Malik F, Khan QW, Ahmad N, Sardaraz M, Karim FK, et al. A hybrid K-Means++ and particle swarm optimization approach for enhanced document clustering. IEEE Access. 2025;13:48818–40.
  59. 59. Xue F, Knight S, Connolly E, Shirsath MA, Newman L, Duggan E, et al. Functional clustering of systolic blood pressure and frontal brain oxygenation during an active stand test in the irish longitudinal study on aging (TILDA): a comparison of tissue saturation index versus absolute oxygenated hemoglobin concentration approaches. IEEE Sensors J. 2025;25(1):871–80.
  60. 60. Gastón-Romeo M, Leon T, Mallor F, Ramírez-Santigosa L. A morphological clustering method for daily solar radiation curves. Solar Energy. 2011;85(9):1824–36.
  61. 61. Abuzeid A, Jolkver E. Rare event detection by progressive clustering undersampling. PLoS One. 2026;21(1):e0340758. pmid:41616015
  62. 62. Hemmatian J, Hajizadeh R, Nazari F. Addressing imbalanced data classification with cluster-based reduced noise SMOTE. PLoS One. 2025;20(2):e0317396. pmid:39928607
  63. 63. Zangwill WI. Nonlinear programming: a unified approach. 1969.
  64. 64. Qiao K, Zhang J, Chen J. Two effective heuristic methods of determining the numbers of fuzzy clustering centers based on bilevel programming. Applied Soft Computing. 2023;132:109718.