Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Distributed K-Means algorithm based on a Spark optimization sample

  • Yongan Feng ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Software, Writing – original draft

    fengyongan@lntu.edu.cn

    Affiliation Liaoning Technical University, Huludao, China

  • Jiapeng Zou,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Writing – original draft

    Affiliation Liaoning Technical University, Huludao, China

  • Wanjun Liu,

    Roles Conceptualization, Data curation, Writing – original draft

    Affiliation Liaoning Technical University, Huludao, China

  • Fu Lv

    Roles Conceptualization, Data curation, Methodology, Writing – original draft

    Affiliation Liaoning Technical University, Huludao, China

Abstract

To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.

1 Introduction

Clustering is an unsupervised learning algorithm that can partition multiple classes without prior training. Common clustering algorithms can be classified into the following categories [1]: hierarchical, partitioning, density-based, grid-based, and model-based methods. The K-Means algorithm is one of the partitioning methods and has been widely used in scientific and industrial fields due to its simplicity, low time consumption, and low space complexity. However, prior to its application, we have to determine two parameters: the number of clusters and the initial clustering centers. The two parameters can directly affect the accuracy of the clustering result. The classical K-Means algorithm is especially sensitive to the initial clustering centers; that is, the clustering results change according to the choice of initial centers, and random centers easily cause the algorithm to fall into a local optimum. The direct application of the K-Means method may obtain poor results. In addition, when traditional serial K-Means clustering is used to handle massive data, the performance of the algorithms decreases rapidly because the memory and time can be restricted.

Therefore, for the above two problems in the traditional K-Means algorithm, this paper makes the following improvements based on the research of:

  1. We utilize a weighted jump library approach to perform random sampling and preclustering, incorporating the concepts of weights and jump pointers.
  2. We employ a weighted max-min distance with variance technique, which considers both the weight and variance information during distance calculation.
  3. We determine the best initial centers using the mean square error approach, ensuring that the initial centers accurately represent the distribution and structure of the dataset.
  4. A novel distance comparison method is employed to optimize the iterative process and reduce computation time.
  5. The algorithm also describes the DAG, which can be used to optimize performance based on a distributed strategy.

The organization of the remaining paper is as follows: Section 2 introduces the improvement of the K-Means algorithm, including related distributed work; Section 3 introduces the basic knowledge of this paper; Section 4 introduces the SOSK-Means algorithm; Section 5 shows a comparison of experiments and analysis; and we conclude the paper in Section 6.

2 Related work

Researchers are addressing the sensitivity of the K-Means algorithm to initial centroids and outliers, enhancing algorithmic stability and scalability in processing large datasets. Researchers have developed various methods to address the sensitivity of the K-Means algorithm to initial centroids and outliers. Oliveira et al. [2] proposed two scalable metaheuristic algorithms for clustering large datasets in MapReduce. The first method iteratively enhances k-means clustering using evolutionary operators, while the second method applies evolutionary k-means to the distributed part of the dataset and merges the results. Al-Kababchee et al. [3] proposed an enhanced K-means clustering method using balanced optimization that adjusts the number of clusters and selects attributes for best results. Chen et al. [4] proposed a method that calculates the maximum distance between two data objects to determine initial centroids. Pun et al. [5] suggested using entropy and kurtosis coefficient values to select the distance metric for clustering. Liao et al. [6] combined density-based selection with distance measures. Thamer et al. [7] proposed an enhanced kernel K-means clustering method that effectively tunes the hyperparameters of the kernel function and the number of clusters. Arthur et al. [8] introduced a sampling-based K-Means++ algorithm that prioritizes points further away from the current centroids. Bahmani et al. [9] improved the K-Means++ algorithm with the K-Means|| algorithm suitable for massive data on MapReduce. Cui et al. [10] proposed a weighted and distribution-based merge strategy. Sarch et al. [11] proposed an improved nature-inspired algorithm for penalized regression-based clustering, which enhances its estimation capabilities in genetic analysis and data mining applications. Fahim et al. [12] improved the center allocation process by considering the distance from new and old centroids. Zhao et al. [13] developed a parallel K-Means algorithm based on MapReduce, optimizing network communication. Moertini et al. [14] enhanced the algorithm to handle noise and outliers through data preprocessing. Yin et al. [15] introduced a random sample-based K-Means algorithm on MapReduce. Cai et al. improved centroid selection using sample neighborhood and intra-size variance. Lei et al. [16] proposed an efficient clustering algorithm based on local optimality and graph search. These algorithmic improvements have mostly been implemented on MapReduce, although Spark has also been used for optimization due to its memory computing capabilities. Zakariya et al. [17] proposed an improved nature-inspired algorithm for penalized regression-based clustering that enhances its estimation capabilities in genetic analysis and data mining applications. Kusuma et al. [5] introduced an intelligent K-Means algorithm on Spark that clusters data close to outliers separately, removes objects in the outlier-formed cluster, and iteratively finds new outliers as initial centroids. Kababchee et al. [17] proposed an improved nature-inspired algorithm for penalized regression-based clustering, which enhances its estimation capabilities in genetic analysis and data mining applications. Wang et al. [18] emphasized the flexibility of the Spark-based K-Means algorithm, allowing the selection of different distance functions and computational methods for distance and centroid updates. Lydia et al. [19] demonstrated that Spark-based K-Means outperforms MapReduce in terms of execution time, scheduling delay, acceleration ratio, and resource consumption. Liu et al. [20] developed a parallel K-Means algorithm for massive texts on Spark using resilient distributed datasets (RDDs) for improved computational efficiency. Santhi et al. [21] proposed an optimized K-Means clustering technique incorporating the Bat algorithm and Firefly algorithm to determine optimal initial centroids on Spark. Amal et al. [22] proposed an adaptive firefly optimization algorithm that improves the performance of K-means clustering methods by solving the challenges of determining the number of clusters and the optimal centroids. Sinha et al. [23] addressed the issue of determining the number of clusters and proposed a novel K-Means-based algorithm that dynamically decides the number of clusters based on increasing k values and setting a threshold for optimal cluster generation.

3 Basic knowledge

3.1 K-Means algorithm

K-Means is one of the partition-based clustering algorithms. For the dataset containing n data objects, the K-Means algorithm clusters these objects into k clusters. Data objects are assigned to the cluster centres, which are chosen randomly, and at the end of each iteration, the cluster centres are updated until there is no change in the centres.

The pseudocode of the K-Means algorithm is shown in Algorithm 1:

Algorithm 1 K-Means Clustering

1: Input: dataset D, the number of clusters K, maximum number of iterations itr

2: Output: K clusters and K cluster centroids

3: randomly choose K objects from D as the initial centroids C

4: while t > itr do

5:  for each object xi in D do

6:   for each centroid Cj in C do

7:    calculate the distance from xi to Cj

8:   end for

9:   assign xi to the nearest centroid

10:  end for

11:  calculate new cluster centres

12: end while

3.2 Random sampling

Random sampling selects a random sample of size n from a set of size N (where nN) with the goal of ensuring that all data can be selected. Although this method is simple, it may not be suitable for uncertain datasets that require numbering during sampling. However, in data mining, where the overall data is known, the uncertainty of the entire dataset does not affect sampling, leading to better results. Random sampling is particularly suitable for large-scale data processing as a larger sample size can better represent the entire dataset.

3.3 Reservoir sampling

Reservoir sampling is a simple and unbiased random sampling technique proposed in the literature [24], which selects without replacement a random sample of size n from a set of size N. The term “reservoir” defines a storage area for storing potential candidates for the sample. The first step of the reservoir algorithm is to put the first m records of the file into a “reservoir”. Then, the record is iterated in sequence to evaluate whether the record replaces the existing record in the reservoir. If it passes the evaluation, a record is randomly selected from the reservoir for replacement. When all records are traversed, a random sample of size n can be generated. Reservoir sampling can ensure that each record is drawn with equal probability.

4 SOSK-Means algorithm

4.1 Concept of the algorithm

The classical K-Means algorithm begins by randomly selecting initial centroid points for clustering. However, this approach is sensitive, as the clustering results are influenced by the initial centroids, and random selection can lead to local optima [1]. Therefore, it is crucial to choose suitable initial centroids for the K-Means algorithm. Sampling, a commonly used statistical method, has been introduced into clustering. Arthur et al. [8] selects one sample at a time based on probabilities, with objects that are farther away having a higher probability of being chosen. Other references select multiple samples at once, considering the probability and incorporating the mutual distance between data objects. However, this method’s limitation is that the selection of the next sample depends on the previously selected sample, and only distance is considered for selecting outliers.

In a method that considers overall features and determines the appropriate number of samples, initial centroids are selected through pre-clustering. Yin et al. [15] shows that increasing the number of samples can significantly improve the clustering results’ quality. If the number of pre-clusters k′ ≫ k, it increases the likelihood of obtaining the global optimum.

In this paper, the SOSK-Means algorithm utilizes the weighted jump reservoir method to perform random sampling i times on the dataset, obtaining n′ samples for pre-clustering, and forming k′ clusters and k′ initial cluster centroids at each sampling. This approach considers that intra-cluster data distributions, cluster sizes, and inter-cluster distances are different. The algorithm calculates the weighted intra-cluster variance by evaluating the radius and variance of each cluster, where variance reflects the degree of dispersion of data objects from the centroid in the cluster. The SOSK-Means algorithm employs the weighted max-min distance with variance to select k clusters, reducing the chance of selecting sparse clusters with outliers. The selected k clusters have dense intra-cluster data and well-separated inter-cluster data. By calculating the centroid values of the k clusters, ik centroid values can be obtained. The best initial centroid is then selected based on the mean square error, representing the original data’s clustering centroids. During the iteration, a novel distance comparison method is used to reduce computation time.

4.2 Weighted jump researvoir sampling

From the previous section, reservoir sampling can ensure the randomness of sampling when dealing with large data of unknown size. Since each data point has the same probability of being extracted, this method can be regarded as uniform random sampling. However, if the distribution of the dataset is extremely uneven, the sample taken may not be sufficient to represent the entire dataset. If the sample is selected by iterating all the data, it increases the computational cost.

Therefore, based on the literature [25], this paper improves reservoir sampling and proposes weighted jump reservoir sampling. vi is assumed to be a data point, and the weight of vi is wi (wi > 0). The random function random(L, H) generates a uniform random number in (L, H). Given a random variable X, FX(x) is its distribution function.

Definition 1: Assume U1 and U2 are random variables that obey a uniform distribution in [0, 1], , , w1, w2 > 0, then: (1) Therefore, the probability that the data point is selected for sampling is proportional to its weight wi, and the probability of each data point being selected is not known before sampling. If the weight of a data point is greater than that of other data points, the closer the value of Xi is to 1, the more likely the data are to be saved. In the following, Xi represents the feature value ki.

Definition 2: It can be seen from the mathematical induction that for α ∈ [0, 1], wi > 0, , UiU(0, 1), then: (2) The smallest key in R is the current threshold T. The following random variable Xw can be used to generate the appropriate exponential “jumps”: (3) Let Wi,j = wi + wi+1 + … + wj(ij); when Xw satisfies the following formula, the data for vi is obtained: (4)

The probability of jumping to read vi from Eq (2) is as follows: (5) The size of vi’s eigenvalues, ki and T are compared to decide whether to put it in the pool for replacement. Then, the probability of entering the pool by Eq (2) is as follows: (6)

Definition 3: Assume that the algorithm has processed c − 1 items (c > m). T is the current threshold in the reservoir. The next item to be examined is Vc. According to Eqs (2) and (4), for each i = c, c + 1, …, n, the probability Pnext(Vi) that Vi is the next item that enters the reservoir is: (7)

The two main operations of weighted jump reservoir sampling are generating weights for data and reading data by jumping and using a reservoir of size m to store the final sample candidates. The comparison of feature values can ensure that the data with greater weight are more likely to be retained. Jumping reduces the number of generated random variables. All operations can be completed by one iteration of the data set. Since the computational cost of generating a large number of random variables may be high, weighted jumping reservoir sampling improves the complexity of classical reservoir sampling.

The pseudocode of weighted jump reservoir sampling is as Algorithmic 2.

Algorithm 2 Weighted Jump Reservoir Sampling

1: Input: A population V of n weighted data

2: Output: A reservoir R of size m

3: Insert the first m items of V into R

4: for each item xiR do

5:  Calculate a key , where ui = random(0, 1)

6: end for

7: The smallest key ki in R is the current threshold T

8: repeat

9:  Generate Xw according to Eq (3)

10:  Jump from the current data Vc to Vi and satisfy inequality (4); then, make Vi become Vc

11:  Let , r = random(tw, 1), calculate

12: until the data set is processed

13: if ki > T then

14:  Replace the data item with the minimum key T in R with item v

15:  Calculate the new T in R

16: end if

17: Calculate the new T in R

4.3 Sample-based initial cluster centroid selection scheme

The original dataset is assumed to be {x1, x2, …, xn}, where n is the size of the dataset, and each data point has w-dimensional features, i.e., xi = (xi1, xi2, …, xiw). The dataset is divided into k clusters, and the centres of the data clusters are c1, c2, …, ck.

After sampling the weighted jump reservoir to obtain the sample dataset, pre-clustering the dataset is equivalent to performing a cluster. The pre-clustering number k′ is much larger than the real clustering number k because the K-Means algorithm is a stochastic hill-climbing algorithm on the logarithm likelihood function space, making k′ ≫ k similar to developing more climbing paths, which increases the possibility of reaching the global optimum.

The literature [16] recommends that the value of k′ is , where μmin is the amount of data contained in the smallest cluster, and n is the overall amount of data. However, sometimes it is difficult to know the value of n for a large amount of data, so the algorithm in this paper automatically counts n values and then uses the value of n to calculate k′.

Definition 4: The selection of the pre-clustering number k′ is: (8) where n/k is replaced by μmin because if the value of n cannot be determined, the value cannot be accurately obtained.

At the same time, to ensure that k′ centroids can be clustered in the algorithm, the number of samples should be greater than k′; otherwise, clustering cannot be performed. Therefore, when the number of samples is larger than max(αk, n5/8), it is obvious that max(αk, n5/8) ≥ min(αk, n5/8).

4.4 Selecting initial cluster centroids

k’ clusters are formed by pre-clustering the sample, k clusters are selected by the maximum and minimum distances with intra-cluster variance, and then the initial centre is obtained.

(1) Cluster Radius

The cluster radius is the maximum distance between all the points and the centroid, and its value is: (9) where xi is the object in cluster k, and d(xi, ck) is the Euclidean distance between the data point xi and the centroid ck. (2) Intra-cluster Variance

The variance is used to measure the discrete degree between data and expectation values (i.e., the mean). The intra-cluster variance is calculated as follows: (10) where Mk is the average distance of object xi to centroid ck, |Ck| is the number of data points in cluster k, and the intra-cluster variance reflects the degree of deviation of clusters xi and Mk. If the intra-cluster variance is smaller, the intra-cluster objects in the cluster are denser.

After obtaining the radius and variance of each cluster, the variance within the cluster is: (11) where R is the set of all cluster radii, and the fraction in Eq (11) represents the normalization of rk to a number in [0, 1].

(3) Weighted Max-Min Distance with Variance

After pre-clustering into k′ clusters, the cluster centres constitute the candidate set T = {t1, t2, …, tk} of the initial cluster centres, and the final set of initial cluster centroids is C = {c1, c2, …, ck}. For the candidate tk in T, the max-min distance with variance is: (12) where min{d(ci, tk)} is the minimum distance between the candidate tk and the set C of selected initial centroids.

The larger min{d(ci, tk)} is, the greater the distance between the candidate and the set of selected initial centroids, the smaller the weighted intra-cluster variance of cluster k, and the denser the data distribution. When the value of the weighted max-min distance is larger, it means that intra-cluster data are dense and inter-cluster data are far apart, and the cluster centroid is used as the next initial cluster centroid.

Therefore, the steps to select the initial cluster centre after pre-clustering are as follows:

  1. Randomly choose a candidate ta from T as the first initial clustering centroid c1, c1 = ta, and delete ta from T.
  2. Choose a candidate tb as the next initial clustering centroid c2, c2 = tb, tb satisfies , tkT, and delete tb from T.
  3. Repeat Step 2 until a total of k centres are chosen from the remaining candidates in T.

In the K-Means algorithm, the Euclidean distance is used to calculate the distance between data, and the data are divided into clusters. In the execution of the algorithm, it is only necessary to calculate which centre is closest to a certain point, and it is not necessary to calculate the exact Euclidean distance value from each data point to each centroid, so a quick comparison method to calculate distance is used in this paper.

Definition 5: A distance formula newDist is as follows: (13)

The Euclidean distance euclDist of the K-Means algorithm can be calculated by the following equation: (14) where w is the number of features. It is assumed that the centre point c is (c11, c12, …, c1w) and the data point x is (x11, x12, …, x1w). Comparing the two distance formulas, it is easy to show that newDisteuclDist.

Comparing each data point and the centre point distance, the minimum distance obtained previously by bestDist and newDist can be obtained by the L2 norm of m and x. If newDist > bestDist, then euclDist > bestDist, then there is no need to calculate euclDist, which saves much calculation work. If newDist < bestDist, then calculate the Euclidean distance euclDist, compare its size with bestDist, and then directly use the L2 norm of m and x obtained when calculating newDist.

4.5 Algorithm distributed strategy

It is found in the literature that traditional clustering algorithms become ineffective in clustering large datasets [26]. MapReduce can solve this problem, but MapReduce is sensitive to iterative calculations. Each round of jobs needs to read data from the disk and write the results back to the magnetic disc file system, which greatly increases system overhead such as I/O, and iterative calculations often occur. Therefore, the K-Means algorithm needs to be optimized for distributed computing. Spark is a new generation of large-scale data processing and computing frameworks. In terms of computing efficiency, its memory-based execution is much faster than MapReduce, which reads and writes to disk multiple times. Lydia et al. [27] pointed out that experiments prove that when the data are in memory, the execution speed of Spark is 100 times faster than MapReduce, and the speed of accessing data from disk is also 10 times faster.

According to the sample dataset obtained by sampling in Section 4.2, the subsequent processing of the SOSK-Means algorithm only processes the sample dataset instead of all the data.

(a) pre-clustering

Put the RDD samples in, whitch number is k’. The progress will output an array of pre-clustering results preResults. The step is as folows:

  1. Step 1: Sort the RDD samples in descending order, select the first k’ data points as the initial cluster centroids of the pre-clustering, and then broadcast the k’ centroids to all worker nodes.
  2. Step 2: Define the internal function merge, which is used to accumulate the value of two key-value sequences with the same key.
  3. Step 3: Perform the mapPartitions operation on the RDD example, calculate which cluster the data in each partition belongs to, and generate a key-value sequence.
  4. Step 4: Perform a reduceByKey operation on each key-value sequence to summarize, use the merge function to merge the values of the same key, and calculate the global data attributes of each cluster.
  5. Step 5: Calculate the sum of squares of the difference between the data in each cluster and the mean sqSums(j) and obtain key-value pairs of the form (j, sqSums(j)).
  6. Step 6: Combine the same key j to obtain the global key-value array tempList1.
  7. Step 7: According to tempList and tempList1, the key-value array preResults is generated to obtain k’ clusters.

(b) selecting cluster centroids

Input the RDD sample and Pre-clustering results preResults, then the array of initial cluster centroids will be output.

  1. Step 1: Add the cluster centroid of the cluster with the smallest variance in preResults to centersList as the first initial centroid and delete the centre from preCenters.
  2. Step 2: Calculate the weighted variance of each cluster in preResults.
  3. Step 3: Do follow steps until the number of elements in centersList is more than k:
    1. (1) Perform a foreach operation on preResults and calculate the weighted max-min distance from the centre of each cluster to the centre of centersList.
    2. (2) Select the cluster centre corresponding to the largest distance to add to centersList and delete it from preResults.
  4. Step 4: Calculate the error sum squares of the centroid of the group to the RDD sample.

(c) parallel K-Means clustering

This progress need input the dataset RDD, maximum number of iterations iter, threshold and the initial centroid array centerList. The updated centre array centerList will be output.

  1. Step 1: Broadcast centerList to Worker node.
  2. Step 2: Do follow step until the number of iterations is less than or equal to iter or the cluster centre change value is less than threshold:
    1. (1) Perform the mapPartitions operation on the RDD input, compare the distance from the centroid to the data in each partition, record the index of the centroid to which it belongs, and form a key-value sequence.
    2. (2) Perform the reduceByKey operation on the key-value sequence, merge the key-value sequence, remove the farthest and nearest data in each key, calculate the new centre point and update the centerList.

4.6 Analysis of the processing time

Assuming that the number of rows of the original dataset is n0, the number of sampling times is m, ns rows of data are extracted each time, and the feature number of each row is C. F1 shuffle operations are generated for n0, and F2 shuffle operations are generated for ns. The number of cluster nodes is S, the communication bandwidth between the machines in the cluster is B, the i-th machine is Si, the total time for processing data is T(Si), and other time is t.

Assuming that the data of each machine are shuffled to other nodes and the amount of data remains unchanged, the total time spent T is: (15)

From the above Eq (15), the total time is mainly determined by the amount of data, the number of samples, the number of shuffles, and the execution efficiency of each machine, where ns is determined by n0. When the amount of data is constant and m is determined, reducing the number of shuffles and optimizing the time overhead of shuffles are the keys to solving the algorithm performance bottleneck.

4.7 Algorithm performance optimization

According to the theoretical analysis in the previous section, shuffling is the most performance-consuming part of the algorithm in Spark because this part contains a large number of disk IOs, serialization, network transmission, and other operations. If it is not properly tuned, the execution speed of the algorithm is very slow. Therefore, it is optimized according to the logical DAG diagram of the SOSK-Means algorithm. As shown in Fig 1, the application of the SOSK-Means algorithm generates at least four jobs:

  • job_0: Triggered by the count operator, there is only one stage, used to read in the statistics of the data set.
  • job_1: Triggered by the take operator and divided into two stages: use the accumulator variable to summarize the calculation results and use the shuffle operation to extract sample data from the original data set.
  • job_i: Triggered by the pre-clustering collectAsMap operator, where i ≥ 1, and the value of i depends on the number of samples. Two stages are generated: the broadcast operation of the pre-clustering centre and the shuffle operation. Used for pre-clustering to select the initial centre.
  • job_j: Triggered by the collectAsMap operator of iterative clustering, where j ≥ 1, and the value of j depends on the number of iterations. Contains two stages: the broadcast operation of the optimal initial cluster centre and the shuffle operation. According to the algorithm’s DAG graph, it is optimized in the following aspects:

(1) Shuffle operator is avoided

No shuffle is generated in job_0. This job reads the original data set to form an RDD. Since this RDD is used multiple times, it is cached and persists in the memory to avoid cumbersome calculation from the source every time the operator is executed. Because the RDD data are directly extracted from the memory and then the shuffle operation is performed, the cost of reading and writing to the disk is reduced, and it is suitable for the iterative calculation of the algorithm.

(2) Shuffle operator is used to process partition

In job_1, the foreachPartition operator calculates the sum of all samples in each partition with one call. Compared with the one-time call of the foreach operator to process a piece of data, it is of great help to the performance improvement. Similarly, the mapPartition operator that performs local clustering in each partition in job_i improves the performance similarly to foreachPartition, instead of the normal map operator, calling to process one partition at a time.

(3) Shuffle operators are used for pre-aggregation

Pre-aggregation means that when the task of the current stage executes shuffle write, each node performs an aggregation operation on the same key locally. Since multiple identical keys are merged, each node has only one key locally, and then the next one. When the stage task executes shuffle read, when other nodes pull the same key on all nodes, it greatly reduces the amount of data that needs to be pulled, thereby reducing disk IO and network transmission overhead.

The reduceByKey operator is used in both job_i and job_j to replace the groupByKey operator. Because the reduceByKey operator pre-aggregates the same key locally according to a user-defined function, the groupByKey operator does not pre-aggregate, and the performance is relatively poor compared to that of reduceByKey.

(4) Broadcast external variables

Both the pre-clustering and iterative clustering of the algorithm use the external variables that save the initial centre. By default, Spark replicates [replicate] multiple copies of this variable and transfer them to each task in Executor via the network. Especially in pre-clustering, the number of initial centres is much larger than in iterative clustering. If this variable is large, a large number of copies is transmitted in the network, occupying the memory of Executor in each node and affecting performance.

Therefore, for the above situation, the variable is broadcast. The broadcast variable ensures that only one copy of the variable resides in the memory of each Executor, and the task shares the copy, which not only reduces the number of copies and Executor memory overhead but also avoids variables. The load generated by multiple transmissions between nodes improves execution efficiency.

(5) Merge temporary files

In the shuffle write process, by default, the task of the current stage creates a temporary disk file for each task of the next stage. When it is the turn of the next batch of tasks to execute, a new temporary file is created, so many temporary files are generated, which affects performance.

The consolidation mechanism is turned on for this. When Executor executes a batch of tasks and then executes the next batch of tasks, the next batch of tasks does not create new files but reuse the files created by the previous batch of tasks and write the data into an existing temporary file. In this way, files created by multiple tasks can be merged to a certain extent, thereby greatly reducing the number of temporary files and improving shuffle write performance.

(6) Dual aggregation

In clustering, if the amount of data of individual clusters is larger than that of other clusters, that is, the amount of data corresponding to individual keys in the program is too large, when shuffle operators such as reduceByKey are executed, data skew occurs and task execution is particularly slow. During shuffle reading, individual tasks are allocated a larger amount of data, the execution time is slower than other tasks, and the running time of the entire spark job is determined by the slowest task.

In this regard, a dual aggregation solution is adopted to solve this problem. The first time is partial aggregation. After finding the key that caused the data skew, a random number prefix is added to each key. At this time, the same key becomes multiple different keys. Then, aggregation operations are performed on the prefixed data, and partial aggregation is performed so that a large amount of data originally processed by one task can be distributed to multiple tasks for processing. As a result of partial aggregation, multiple different keys become a small number of different keys, but the corresponding data have been aggregated once. Finally, the prefix of each key are removed and another aggregation is performed to obtain the global aggregation result, thereby solving the problem of excessive data processing by a single task.

(7) Kryo is used for serialization

When shuffling and caching the RDD above and using broadcast variables, the data need to be serialized before they can be stored. Spark’s default serialization speed is slow, and the serialized data occupy a large amount of memory. Therefore, the Kryo serialization library is set in the configuration file for final optimization to ensure high serialization efficiency.

5 Experiment and analysis

5.1 Experimental environment

The Spark cluster consists of 1 master node and 8 slave nodes in the Linux system environment. Each node of the Spark cluster runs the CentOS 6.5 operating system, equipped with dual Intel Core (TM) CPUs, 4GB of RAM, and a 500GB hard disk. The cluster is configured with Hadoop 2.6.4, Spark 3.2.3, Scala 2.10.5, and JDK 1.8.5.2.

5.2 Description of the datasets

To validate the performance of the proposed algorithm, this paper selects several commonly used synthetic datasets for clustering analysis as well as some real-world large datasets, as detailed below:

The Iris flower (Iris), banknote (Bank), seeds, Wifi_Localization (Wifi), Planning Relax (Plrx), and wine datasets were selected from the UCI Machine Learning Repository. Additionally, the R15 and D31 datasets were obtained from the University of Eastern Finland.

Regarding the real-world datasets, we selected the credit card fraud detection dataset (CCFD) [28] and the KDD Cup 1999 Data (KDDc99) [29]. The CCFD dataset contains 284,807 instances with 30 features, including Time, Amount, and 28 anonymized PCA-transformed features (V1 to V28), and is highly imbalanced with only 0.172% fraudulent transactions. The KDDc99 dataset includes 4,898,431 instances with 41 features categorized into basic, content, and traffic features, suitable for clustering and classification tasks in network intrusion detection, with multiple classes representing different types of intrusions and normal connections.

All the datasets along with their details are shown in Table 1.

5.3 Experimental design

We conducted a comparative experiments and an performance experiment. We compared SOSK-Means with the following baseline methods:

  • K-means: Spark MLlib implements a scalable and efficient K-means clustering algorithm, utilizing distributed memory computing to handle large-scale datasets
  • Bisecting K-means: Spark MLlib provides a Bisecting K-means clustering algorithm, leveraging hierarchical clustering and distributed memory computing for large-scale data processing.
  • EOAK-means [3]: EOAK-means enhances the traditional K-means clustering algorithm by integrating the Equilibrium Optimization Algorithm (EOA) to select the optimal number of clusters dynamically and improve clustering quality.
  • K-Plus Anticlustering [30]: This method extends k-means by considering distribution moments (means, variance, higher-order moments) to maximize between-group similarity. It partitions elements into disjoint groups to maximize between-group similarity and within-group heterogeneity, reversing the logic of cluster analysis.
  • LBKC [31]: This method utilizes the lower bound of Euclidean distance to safely avoid a large number of unnecessary distance calculations, thereby accelerating the k-means process.

The comparative experiment‘s performance was evaluated using 5 standard metrics: accuracy, recall, jaccard index, rand index, and F-score.

5.4 Experiment of the clustering effect

For the accuracy, recall, Jaccard index, Rand index, F1-score and time values of each algorithms on each dataset, the results are shown in Tables 2 to 7.

thumbnail
Table 2. Comparison of the ACCURACY of the datasets between each algorithm.

https://doi.org/10.1371/journal.pone.0308993.t002

thumbnail
Table 3. Comparison of the RECALL of the datasets between each algorithm.

https://doi.org/10.1371/journal.pone.0308993.t003

thumbnail
Table 4. Comparison of the JACCARD INDEX of the datasets between each algorithm.

https://doi.org/10.1371/journal.pone.0308993.t004

thumbnail
Table 5. Comparison of the RAND INDEX of the datasets between each algorithm.

https://doi.org/10.1371/journal.pone.0308993.t005

thumbnail
Table 6. Comparison of the F1-score of the datasets between each algorithm.

https://doi.org/10.1371/journal.pone.0308993.t006

thumbnail
Table 7. Comparison of the TIME of the datasets between each algorithm(unit:s).

https://doi.org/10.1371/journal.pone.0308993.t007

According to the experimental results, the clustering indices of the SOSK-Means algorithm are similar to those of the baseline algorithm, indicating that the algorithm achieves high computational accuracy while maintaining high computational speed. Overall, the SOSK-Means algorithm performs well on most datasets. However, due to the random generation of multiple sets of initial centers, the indeterminable randomness of the SOSK-Means algorithm results in average clustering performance.

5.5 Performance experiment

To measure the calculation speed of the algorithm, the SOSK-Mean algorithm uses 5 million rows (689 M), 10 million rows (1.34 G), 15 million rows (2.01 G), 20 million rows (2.69 G), 25 million rows (3.36 G), and 30 million rows (4.03 G) of artificially synthesized data sets. The results are compared with those of the K-Means algorithm.

The acceleration ratio is the ratio of the running time consumed by the same task in a uniprocessor system and a parallel processor system. It measures the effect of algorithm parallelization. Figs 25 shows the accelerating ratio changes of the four algorithm under different numbers of cores and different data volumes.

thumbnail
Fig 2. Different algorithms’ performance across 4 cores.

https://doi.org/10.1371/journal.pone.0308993.g002

thumbnail
Fig 3. Different algorithms’ performance across 8 cores.

https://doi.org/10.1371/journal.pone.0308993.g003

thumbnail
Fig 4. Different algorithms’ performance across 12 cores.

https://doi.org/10.1371/journal.pone.0308993.g004

thumbnail
Fig 5. Different algorithms’ performance across 16 cores.

https://doi.org/10.1371/journal.pone.0308993.g005

On the whole, when the number of cores is constant, as the amount of data increases, the acceleration ratio of the SOSK-Means algorithm shows an increasing trend, and when the number of cores is 4, the acceleration ratio of the algorithm increases slowly. When the data volume is less than 15 million rows, the SOSK-Means accelerating ratio grows slowly because the cluster startup and communication between nodes take considerable time and overhead. When the data volume is greater than 15 million rows, the accelerating ratio approximates a linear increase, and the time to process the data is greater than the cluster deployment time. When the amount of data is constant, the acceleration ratio of the algorithm increases with the increase in the number of cores because an increase in the number of cores means an increase in nodes, which gives play to the time advantage of parallel computing. In general, as the amount of data and the number of cores increase, the acceleration ratio of the SOSK-Means algorithm is larger.

5.6 Ablation experiment design

5.6.1 Components.

As outlined above, the proposed method improves upon the basic k-means algorithm through five distinct modules. The codes and descriptions of these modules are detailed in Table 8. We evaluated the F1 score and execution time for each selected module and conducted experiments on the CCFD and KDD99 datasets.

5.6.2 Combination ablation experiments.

The results are shown in Tables 9 and 10, which indicate that when the A, B, or C components are not used, clustering accuracy decreases. The clustering accuracy gradually improves with the sequential addition of these components. When the D or E components are not used, the efficiency in handling large-scale data significantly decreases. When only the A and B modules are added, clustering accuracy improves noticeably but does not reach optimal levels. When the A and E modules are not added, the efficiency in handling large-scale data significantly decreases.

6 Conclusions and future works

This paper proposes an improved K-Means algorithm based on the Spark optimization sample. This algorithm uses the weighted max-min distance with variance, which can find distant and dense clusters. Finally, the best initial center is selected by the mean square error. In the iteration, a novel distance comparison method is used to reduce computation time. The algorithm also describes the DAG, which can optimize performance based on distributed strategies. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.

In response to the current algorithm’s limitations in handling complex shapes and uneven densities in clustering, future research will explore several key directions. First, we plan to enhance the algorithm’s adaptability to complex data structures by integrating advanced clustering techniques, such as density-based methods or neural network approaches. Additionally, we will investigate the application of dynamic weighting schemes to adjust the significance of max-min distance and variance based on real-time data characteristics, thereby improving the algorithm’s flexibility and accuracy.

References

  1. 1. Jigui Sun JL. Clustering algorithms research. Journal of software. 2008;19(1):48–61.
  2. 2. Oliveira GV, Coutinho FP, Campello RJGB, Naldi MC. Improving k-means through distributed scalable metaheuristics. Neurocomputing. 2017;246:45–57.
  3. 3. Al-Kababchee SGM, Algamal ZY, Qasim OS. Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm. Journal of Intelligent Systems. 2023;32(1):20220230.
  4. 4. Guang-ping C, Wen-peng W. An improved K-means algorithm with meliorated initial center. In: 2012 7th International Conference on Computer Science & Education (ICCSE). IEEE; 2012. p. 150–153.
  5. 5. Kusuma I, Ma’Sum MA, Habibie N, Jatmiko W, Suhartanto H. Design of intelligent k-means based on spark for big data clustering. In: 2016 international workshop on Big Data and information security (IWBIS). IEEE; 2016. p. 89–96.
  6. 6. Liao Q, Yang F, Zhao J. An improved parallel K-means clustering algorithm with MapReduce. In: 2013 15th IEEE International Conference on Communication Technology. IEEE; 2013. p. 764–768.
  7. 7. Thamer MK, Algamal ZY, Zine R. Enhancement of Kernel Clustering Based on Pigeon Optimization Algorithm. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 2023;31(Supp01):121–133.
  8. 8. Arthur D, Vassilvitskii S, et al. k-means++: The advantages of careful seeding. In: Soda. vol. 7; 2007. p. 1027–1035.
  9. 9. Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable k-means++. arXiv preprint arXiv:12036402. 2012;.
  10. 10. Cui X, Zhu P, Yang X, Li K, Ji C. Optimized big data K-means clustering using MapReduce. The Journal of Supercomputing. 2014;70:1249–1259.
  11. 11. Al-Kababchee SG, Algamal ZY, Qasim OS. Improving Penalized-Based Clustering Model in Big Fusion Data by Hybrid Black Hole Algorithm. Fusion: Practice and Applications. 2023;11(1):70–76.
  12. 12. Huang Q. Model-based or model-free, a review of approaches in reinforcement learning. In: 2020 International Conference on Computing and Data Science (CDS). IEEE; 2020. p. 219–221.
  13. 13. Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. In: Cloud Computing: First International Conference, CloudCom 2009, Beijing, China, December 1-4, 2009. Proceedings 1. Springer; 2009. p. 674–679.
  14. 14. Moertini VS, Venica L. Enhancing parallel k-means using map reduce for discovering knowledge from big data. In: 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE; 2016. p. 81–87.
  15. 15. Yin A, Wu Y, Zhu M, Zhang Y; Fuzhou University Zhicheng College; Fuzhou University; Fuzhou University Zhicheng College. Improved algorithm based on K-means in MapReduce framework. Computer Applications Research. 2018;35(8):2295–2298.
  16. 16. Lei X, Xie K, Lin F, Xia Z. An Efficient Clustering Algorithm Based on Local Optimality of K-Means. Journal of Software. 2008;19(7):1683–1692.
  17. 17. Al-Kababchee SGM, Qasim OS, Algamal ZY. Improving penalized regression-based clustering model in big data. In: Journal of Physics: Conference Series. vol. 1897. IOP Publishing; 2021. p. 012036.
  18. 18. Wang B, Yin J, Hua Q, Wu Z, Cao J. Parallelizing k-means-based clustering on spark. In: 2016 International Conference on Advanced Cloud and Big Data (CBD). IEEE; 2016. p. 31–36.
  19. 19. Lydia EL, Pradesh A, Mohan AK, Swarup MB. Implementing K-Means for Achievement Study between Apache Spark and Map Reduce. 2016;.
  20. 20. Liu P, Teng Jy, Ding Ej, Meng L. Parallel K-means algorithm for massive texts on spark. In: The 2nd CCF Big Data Conference; 2014.
  21. 21. Santhi V, Jose R. Performance analysis of parallel k-means with optimization algorithms for clustering on spark. In: Distributed Computing and Internet Technology: 14th International Conference, ICDCIT 2018, Bhubaneswar, India, January 11–13, 2018, Proceedings 14. Springer; 2018. p. 158–162.
  22. 22. Al Radhwani AMN, Algamal ZY. Improving K-means clustering based on firefly algorithm. In: Journal of Physics: Conference Series. vol. 1897. IOP Publishing; 2021. p. 012004.
  23. 23. Sinha A, Jana PK. A novel K-means based clustering algorithm for big data. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE; 2016. p. 1875–1879.
  24. 24. Vitter JS. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS). 1985;11(1):37–57.
  25. 25. Tan Z, Karakose M. Optimized deep reinforcement learning approach for dynamic system. In: 2020 IEEE International Symposium on Systems Engineering (ISSE). IEEE; 2020. p. 1–4.
  26. 26. Sardar TH, Ansari Z, Khatun A. An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means. In: 2017 IEEE International Conference on Circuits and Systems (ICCS). IEEE; 2017. p. 17–20.
  27. 27. Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, et al. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proceedings of the VLDB Endowment. 2015;8(13):2110–2121.
  28. 28. Sreekala K, Sridivya R, Rao NKK, Mandal RK, Moses GJ, Lakshmanarao A. A hybrid Kmeans and ML Classification Approach for Credit Card Fraud Detection. 2024 3rd International Conference for Innovation in Technology, INOCON 2024. 2024;.
  29. 29. Stolfo FWLWPA Salvatore, Chan P. KDD Cup 1999 Data; 1999. UCI Machine Learning Repository.
  30. 30. Papenberg M. K-Plus anticlustering: An improved k-means criterion for maximizing between-group similarity. British Journal of Mathematical and Statistical Psychology. 2024;77(1):80–102. pmid:37431687
  31. 31. Zhang H, Li J, Zhang J, Dong Y. Speeding up k-means clustering in high dimensions by pruning unnecessary distance computations. Knowledge-Based Systems. 2024;284:111262.