Mining algorithm of accumulation sequence of unbalanced data based on probability matrix decomposition

Shaoxia Mou; Heming Zhang

doi:10.1371/journal.pone.0288140

Abstract

Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.

Citation: Mou S, Zhang H (2023) Mining algorithm of accumulation sequence of unbalanced data based on probability matrix decomposition. PLoS ONE 18(7): e0288140. https://doi.org/10.1371/journal.pone.0288140

Editor: Praveen Kumar Donta, TU Wien: Technische Universitat Wien, AUSTRIA

Received: January 12, 2023; Accepted: June 20, 2023; Published: July 7, 2023

Copyright: © 2023 Mou, Zhang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In modern technology and business applications, people are often faced with various types of data sets. Many of these data sets are lopsided, with some categories having very small sample sizes and others having very large sample sizes. This situation is very common in machine learning and data mining, which will not only affect the prediction results of the model, but also lead to the failure of the data mining algorithm to explore the potential rules and information. Therefore, processing unbalanced data sets is a very important problem in the field of data mining and machine learning.

The methods to deal with unbalanced data sets include but are not limited to: using different evaluation indicators, oversampling or under sampling, integrated learning, cost-sensitive learning, transfer learning, etc., and the popular meta-learning methods in recent years can also be applied to unbalanced data sets. Processing unbalanced data sets can improve the performance and generalization ability of the model and find more valuable information, so it has high practical application value. In the past decades, the rapid development of global information technology has led to the emergence of powerful computers, data collection devices and storage devices. With these devices, a large amount of data information can be collected for transaction management, information retrieval and data analysis [1]. Although the amount of data collected is very large, the data useful to people is often very limited, usually only a small part of the total data. This kind of data accumulation sequence in which the number of samples of a certain class is significantly less than that of other classes is called accumulation sequence of unbalanced data [2]. In unbalanced data, there is a large difference in the number of samples in each category. Non balanced data classification, that is, the number of one type of samples in the data set is far greater than that of other types. The type of samples that account for the majority is called the majority type, while the other type that only accounts for a small part is called the minority type. The cumulative sequence of unbalanced data is widely used in credit risk assessment, fraud detection, disease diagnosis, weather forecast and other application fields. The classification of such unbalanced data has one thing in common, that is, the focus is on a few types of information. For non-equilibrium data sets, an intuitive feature is data scarcity, which is obviously one of the reasons for the sharp decline in classification performance. If it is misclassified, it will bring significant losses. The classification problem of accumulation sequence of unbalanced data exists in people’s real life and industrial production [3]. For example, the number of fleeing customers looking for telecom operators is generally far less than that of non fleeing customers; Using detection data to diagnose patients’ diseases, such as cancer, the probability of people suffering from cancer is very low, so cancer patients are far less than healthy people; Others include locating oil wells from satellite images, learning the pronunciation of words, automatic text classification, and distinguishing malicious harassing calls. In these applications, people are mainly concerned about the minority classes in the data accumulation sequence, and the cost of misclassification of these minority classes is very large [4]. Taking the credit evaluation problem as an example, there are relatively few samples of customers with poor credit. When mining them, the traditional mining algorithm focuses on the overall mining accuracy, which is easy to cause miscalculations of a few categories, resulting in a high error rate that a few customers with poor credit are misclassified as good customers. If loans are allocated to them, it will cause great capital losses to the enterprise. In cumulative sequence mining, the length of accumulated sequences of different categories may vary greatly. More common categories may form longer accumulation sequences, and correspondingly, less common categories may form shorter accumulation sequences. As a result, when the algorithm is mining, the more common patterns will be more easily excavated, while the less common patterns will be easily ignored. Moreover, the length of the excavated sequence may be much larger than the length of the real sequence. All these problems will affect the accuracy of the mining results. If the scale of raw materials is large, the data accumulation sequence mining algorithm may need to support large-scale data processing, and the efficiency and scalability of the algorithm become challenges. However, if the scale of the original material is too small, the cumulative sequence mined may not represent the whole data set, which will lead to the inaccuracy of the mining results. Therefore, in practical applications, the mining of accumulation sequence of unbalanced data has been paid more and more attention by the academic circles of data mining and machine learning, and has become one of the hot issues in the field of data mining and machine learning [5]. In 2000, the American Association of Artificial Intelligence (AAAI) and the 2003 International Conference on Machine Learning (ICML) held a symposium on learning unbalanced data. In 2004, the American Computer Society (ACM) published a newsletter on this topic. Therefore, the mining of accumulation sequence of unbalanced data is very important for data application, and it is urgent to study an effective mining method for accumulation sequence of unbalanced data.

In order to solve the problem of sparsity in accumulation sequence of unbalanced data and effectively realize the mining of accumulation sequence of unbalanced data, domestic and foreign scholars have done a lot of research. In reference [6], El-Saeiti and Khalil et al. had made in-depth research on the H-likelihood estimation method of accumulation sequence of unbalanced data. However, although random replication of a few samples in this method could quickly and directly increase the number of samples, the actual effect was not ideal and it was easy to produce over fitting. In reference [7], Tang and Cai et al. conducted in-depth research on mixed feature selection for diagnosis of unbalanced cancer data. However, although this method improved the classification performance of a few samples to a certain extent, it did not take into account the imbalance within a few samples and was easy to introduce noise points. In reference [8], Carrillo-Alarcón and Morales-Rosales conducted in-depth research on parameter estimation in unbalanced data classification. The time complexity and spatial complexity of the research methods were particularly high.

Based on the above problems, this paper studies the mining algorithm for accumulation sequence of unbalanced data under probability matrix decomposition to improve the mining performance of accumulation sequence of unbalanced data.

Materials and methods

Accumulation sequence of unbalanced data preprocessing

In order to solve the problems of imbalance in the minority samples and noise and overlap in the newly generated samples in the accumulation sequence of unbalanced data [9], an unbalanced data processing method is designed, which combines the natural nearest neighbor relationship of the minority samples and the internal cluster to which they belong. The proposed method first calculates the natural nearest neighbor of minority samples according to the definition of natural nearest neighbor, obtains the core points in the densely distributed area of samples, then clusters the minority samples based on the natural nearest neighbor relationship, and finally uses the core points and non core points in the same category to synthesize new samples, and adds them to the original accumulation sequence of unbalanced data, so as to balance the accumulation sequence of unbalanced data.

Minority clustering in accumulation sequence of unbalanced data based on natural nearest neighbor.

The density clustering of minority samples in the accumulation sequence of unbalanced data based on the natural nearest neighbor relationship mainly includes three steps [10]: firstly, the core point is determined according to the natural nearest neighbor relationship of all minority samples in the accumulation sequence of unbalanced data; the self-neighborhood radius of the core point is calculated, to determine the core point directly connected with its density, and cluster the core points according to the process of density clustering [11], so as to classify the two core points within their own neighborhood radius values into the same category; According to the nearest neighbor relationship between non core points and core points, non core points are classified into the existing classification.

The implementation process of clustering algorithm based on natural nearest neighbor for minority classes in accumulation sequence of unbalanced data is as follows.

Input: natural eigenvalue s_k of a few samples in the accumulation sequence of unbalanced data, natural nearest neighbor N_N(x_i) and natural nearest neighbor N_B(x_i) of sample point x_i.

Output: core point set C_s of minority samples, non core point set N_Cs, number of clusters G and sample set C_G of various clusters.

Step 1: determine C_s and N_Cs from s_k, N_N(x_i) and N_B(x_i) according to definition clustering algorithm.

Step 2: cluster x_i, x_i∈C_s.

Step 2.1: calculate the neighborhood radius ϕ_i of x_i and the core point set D_i of which the direct density can be reached.

Step 2.2: using ϕ_i and D_i and referring to the density clustering process, classify all density connected core points into the same cluster to obtain C_G and G.

Step 3: divide x_j into G classes, x_j∈N_Cs.

Step 3.1: calculate ϕ_j for x_j.

Step 3.2: get all core points that are natural neighbors to x_j and within the radius of each other’s neighborhood.

Step 3.3: determine the core point x_i closest to x_j, and add x_j to the cluster where x_i is located; If there is no such core point, mark x_j as a noise point.

In the above process, new samples are generated by core points and non core points belonging to the same cluster. Noise points do not participate in the synthesis of sample points, which can avoid the introduction of new noise points.

Oversampling based on natural nearest neighbor.

After completing the clustering operation for a few samples in the accumulation sequence of unbalanced data, it can determine the number of samples to be synthesized for each class according to the proportion of non core points in each class. The more the non core points in the class are, the more new samples to be synthesized are. The reason is that the non core points are the samples whose number of natural nearest neighbors is less than the natural eigenvalue [12]. The distribution of samples around them is relatively sparse. More new samples are synthesized inside them, which can avoid the excessive concentration of new samples in densely distributed areas and reduce the overlap of composite samples. The specific steps of oversampling based on natural nearest neighbors are as follows.

Input: accumulation sequence of unbalanced data S.

Output: minority class sample set S_g generated by oversampling.

Step 1: search the natural nearest neighbor of a few samples [13] to obtain s_k, N_N(x_i) and N_B(x_i).

Step 2: cluster a few samples to get G, C_K, C_s and N_Cs.

Step 3: number of newly synthesized samples N = number of samples of most classes—number of samples of few classes.

Step 4: calculate the proportion p_i of non core points in class C_i, and determine the number of samples to be synthesized for class C_i according to the proportion.

Step 5: randomly select non core point x and core point y in class C_i to synthesize new samples, i = 1,2,⋯,G. z = x+a×(y−x), a is a random number in the range of [0,1], add z to the set S_g and repeat this step until the required number of samples is obtained for each category.

S_g is the composite sample set of a few classes.

Through the above series of processes, new minority samples are synthesized and added to the original data accumulation sequence, so as to balance the data accumulation sequence.

Mining of accumulation sequence of unbalanced data

Mining of accumulation sequence of unbalanced data based on probability matrix decomposition.

Probabilistic Matrix Factorization (PMF) is a common recommendation system algorithm that uses a user-item rating matrix to recommend items that users might be interested in. In PMF, the original score matrix is decomposed to map both users and items into a potentially low-dimensional space, and then the score matrix is recalculated until its error from the original score matrix is minimal. Probability matrix decomposition is to map some potential information of users and projects (accumulation sequence of unbalanced data after balanced processing, hereinafter referred to as data sequence) to low dimensional feature space [14] from the perspective of probability, and then use the linear combination of low dimensional feature vectors to explain the preference of specific users for data sequences. Given the scoring matrix R_m×n of user data series, MATLAB is used to generate a random number matrix H_f×m with mean value of 0, variance of and a random number matrix Q_f×n with variance of , where f is the decomposition dimension, H_f×m is defined as the user’s f-dimensional characteristic matrix, Q_f×n is the f-dimensional characteristic matrix of data series, and its column vectors H_i and Q_j represent the corresponding potential characteristic vectors respectively, generally . After learning the training set, H_f×m and Q_f×n are constantly updated, to make .

Assuming that the error between the real score R_ij and the predicted score conforms to the Gaussian distribution [15] with a mean value of 0 and a variance of , there is a probability distribution: , and there is through translation, then the conditional probability of the scoring matrix R is as shown in Eq (1): (1)

Where, I_ij is the indicator function, I_ij value of 1 indicates that user i has scored the data sequence j, and I_ij value of 0 indicates that user i has not scored the data sequence j. N is the actual error, δ_R is the forecast error.

Since H and Q are independent of each other and satisfy the Gaussian distribution with mean value of 0 and variance of and respectively, there are Eqs (2) and (3): (2) (3)

Where, I is the function value. The joint probability distribution of H and Q can be obtained from Eqs (1), (2) and (3): (4)

Taking logarithm of probability distribution of U and V can get: (5)

The maximum value of Eq (5) can be equivalently replaced by the minimum value of the error function with regularization parameters [16], as shown in Eq (6): (6)

Where, , , and , the objective function can be described by Eq (7): (7)

The relationship between the regularization parameter λ and , , is obtained.

In order to solve the objective function, the random gradient descent method is used. This algorithm finds the direction in which the parameters of the objective function decline fastest by deriving the parameters [17], and makes the variables move along this direction until they move to the minimum point.

By deriving from Eq (7), it can be found that at each iteration, the updated equations of H_i and Q_j become: (8) (9) (10)

Where α is the learning rate of random gradient descent.

In addition, in order to improve the recommendation efficiency of the algorithm, the batch processing module [18] is added. For 90000 training data in the experiment, it is divided into 9 batches and 10000 data are processed each time, which greatly reduces the amount of calculation of model training and the instability of model convergence caused by the calculation of each training data.

The data sequence mining algorithm based on probability matrix decomposition is as follows:

Input: training set train_vce, test set probe_vce.

Output: mining result pred_out, square root error REST.

Set the regularization parameters, the maximum number of iterations maxepoch, and decompose the dimension feat;
Generate two standard normal distribution matrices: data sequence (1682) feat and number of users (943) feat;
Iteration times epoch < maxepoch;
Batch processing is adopted, which is divided into 9 batches, 10000 scoring records are processed each time, and the batch processing times are batch < 9;
The loss function p is calculated, and the two matrices of (2) are continuously updated according to the negative gradient direction;
End.

Through the above process of accumulation sequence of unbalanced data preprocessing and probability matrix decomposition, accumulation sequence of unbalanced data mining can be realized. The unbalanced data accumulation sequence is partially coded as follows:

#include <iostream>#include <cstring>#include <map>using namespace std;const int N = 1e6+10;#define int long longint n, maxn, minn, top1, top2, ans;int a[N], Lb[N], Rb[N], Ls[N], Rs[N];int st1[N], st2[N];signed main(){

cin >> n;

for (int i = 1; i < = n; i++) cin >> a[i];

top1 = 0, top2 = 0;

for (int i = 1; i < = n; i++)

{

while (top1 && a[st1[top1]] > a[i]) top1—;

while (top2 && a[st2[top2]] < a[i]) top2—;

if (!top1) Ls[i] = 0;

else Ls[i] = st1[top1];

if (!top2) Lb[i] = 0;

else Lb[i] = st2[top2];

st1[++top1] = i;

st2[++top2] = i;

}

memset(st1, 0, sizeof(st1));

memset(st2, 0, sizeof(st2));

top1 = 0, top2 = 0;

for (int i = n; i > = 1; i—)

{

while (top1 && a[st1[top1]] > = a[i]) top1—;

while (top2 && a[st2[top2]] < = a[i]) top2—;

if (!top1) Rs[i] = n+1;

else Rs[i] = st1[top1];

if (!top2) Rb[i] = n+1;

else Rb[i] = st2[top2];

st1[++top1] = i;

st2[++top2] = i;

}

for (int i = 1; i < = n; i++)

{

ans + = (i—Lb[i]) * (Rb[i]—i) * a[i];

ans - = (i—Ls[i]) * (Rs[i]—i) * a[i];

}

cout << ans;

return 0;

}

Optimization of probability matrix decomposition.

In the probability matrix decomposition algorithm, the random gradient descent method is used to learn the characteristic matrix of users and data sequences. However, each prediction error in the learning process is only scored for a single user data sequence, resulting in the random gradient descent method without considering the overall prediction error effect. No matter how to optimize the feature vectors of users and data sequences, all predictions cannot be accurate, which leads to the phenomenon that although the algorithm has good prediction ability in the training set, it has poor prediction ability in the test set. To solve this problem, an adaptive error Perr is introduced, which is calculated by Eq (11): (11)

Where: w is the sample weight, and the initial value conforms to the uniform distribution; numRate is the number of scores in the training set, which results in the minimum initial value of w. Therefore, in Eq (11), w is multiplied by numRate to amplify the influence of w. In the optimization process, when the prediction error err is large, the sample weight w will become larger, and the adaptive error Perr will be larger; Similarly, the smaller the values of err and w are, the smaller the value of Perr is. The variation of w makes Perr associated with the global error. The random gradient descent method finds the appropriate eigenvector by minimizing the test error err of the training set [14], which is modified to minimize the adaptive error Perr of the training set to find the appropriate eigenvector. Therefore, the core idea of this paper is to optimize the global error as well as the single sample error more effectively.

Here, it is worth considering how to define the judgment threshold of prediction error. When Perr is small to a certain extent, if it continues to learn, it will lead to over fitting of the algorithm. Similar to the classification threshold of ADA Boost, it needs to meet the global needs. It can be known that standard deviation is the most commonly used quantitative form to reflect the degree of data dispersion. If the mean errMean of the error is used to represent the mean of the absolute prediction error of all samples, then the standard deviation errStd represents the dispersion degree of the overall error of the samples, that is, the greater the value of errStd is, the lower the prediction accuracy is, in turn, the higher the prediction accuracy is. Suppose φ represents the number of scores in the training set, then errStd is calculated with Eq (12): (12)

In Ada Boost, the weight of the weak classifier represents the impact of the weak classifier on the final strong classification, which is derived by optimizing an index loss [19–21]. Similarly, in order to improve the optimization speed of the algorithm, this weight is also added. If the threshold value of prediction error is η, then: (13)

Where: a represents the weight of the learner, and the weight calculation method is the same as that of ADA boost weak classifier. In Ada Boost, the error rate represents the proportion of samples with classification errors in the total samples [22, 23]. However, in the mining problem, the error rate C is used to represent the global error in the current learner, and φ is set to represent the number of scores in the training set, then Eq (14) calculates C: (14)

Therefore, according to ε, the sample weight w can be adjusted automatically to prevent over fitting to a certain extent [24, 25]. The method of adjusting the sample weight w is shown in Eq (15): (15) (16)

In Eq (15): D_t represents the weight distribution of the sample at the t-th learning; D_t(i) represents the i-th sample weight at the t-th learning; Z_t represents the normalization factor of the t-th learning. In Eq (16), Z_t+1 represents the calculation method of normalization factor at the t+1-th learning. C_t represents the feature weights of the sample at the t-th learning.

From the implementation process of the optimized probability matrix decomposition algorithm, it can be seen that in the t-th learning process, the characteristic matrices H_t and Q_t are mainly learned according to Perr, and then Perr is updated according to the value of the characteristic matrix. ADA Boost only adjusts the sample weight within each random gradient descent learning. After learning convergence, ADA Boost algorithm also ends; However, the idea of ADA Boost integration is to combine multiple probability matrix decomposition algorithms with different parameters, but does not optimize a single probability matrix decomposition model itself. This is the essential difference between ADA Boost adaptive lifting idea and ADA Boost integration idea. At the same time, according to the needs of real data, the parameters ξ₁ and ξ₂ are introduced: in order to ensure that the algorithm does not converge at the beginning, ξ₁ is used to reduce the influence of w; ξ₂ is to restore w in the learning process and strengthen the intervention of w in the prediction error.

Eq (17) is used to initialize the sample weight: (17)

In Eq (17), μ represents the error correction coefficient.

Meanwhile, improved Eq (11) is Eq (18) to calculate Perr. The values of parameters ξ₁ and ξ₂ can be obtained from cross validation experiments.

(18)

Results

In order to verify the mining algorithm of accumulation sequence of unbalanced data under the probability matrix decomposition studied in this paper, the accumulation sequence of unbalanced data provided by MovieLens25M is used as the experimental data accumulation sequence. The accumulation sequence contains 9 groups of data. Among them, KDD-Cup99 Intrusion dataset has 5 samples, Mammography dataset has 2 samples, and Satimage dataset has 2 samples. The algorithm in this paper is used to mine and test them. Two DELL high-density servers are used in the experiment, and each computing node is configured as follows: CPU is 16Core; Memory is 128GB; Disk is 3TB and bandwidth is 1000Mb/s. Run KVM virtual machine. Configuration: CPU is 2core; Memory is 4GB; Disk is 50GB; Bandwidth is 1000Mb/s. The server adopts Debian 6.0.5, and the virtual machine adopts Windowsver2008 operating system.

Preprocessing test of accumulation sequence of unbalanced data

In order to verify the effectiveness of the proposed method, the first set of data in the experimental data accumulation sequence is used to compare the distribution of new samples generated by different oversampling methods. Suppose the data sample point is (x_i,y_i), where x_i is a two-dimensional feature, which is uniformly distributed in both dimensions, and y_i is a class label. The algorithm in this paper is used to process the accumulation sequence of experimental data to generate a few samples. The results are shown in Fig 1. In Fig 1, most classes are represented by a hollow circle, a few classes are hollow triangles, and the newly generated minority samples are represented by a hollow square. The number is the difference between the majority samples and the minority samples. The solid circle is most of the classes in the initial dataset, and the hollow circle is most of the classes in this method.

Download:

Fig 1. Test results of composite sample distribution.

(a) Initial data set, (b) Paper method.

https://doi.org/10.1371/journal.pone.0288140.g001

Fig 1(A) shows the original distribution of the data. Most classes contain a small number of noise points in minority samples; Fig 1(B) shows the distribution of new samples synthesized by the algorithm in this paper. By comparing Fig 1(A) and 1(B), it can be seen that the algorithm in this paper is used to process the accumulation sequence of experimental data, less noise points are introduced in the process of generating minority samples, and the generated new samples are more evenly distributed in the two clusters of minority classes, with less overlapping samples.

In order to further verify the effectiveness of the algorithm in this paper, the algorithm in reference [6] proposed adequacy of h-likelihood estimation method for unbalanced clustered counting data models, the algorithm in reference [7] proposed A novel hybrid feature selection and ensemble learning framework for unbalanced cancer data diagnosis with transcriptome and functional proteomic, the algorithm in reference [8] proposed A metaheuristic optimization approach for parameter estimation in arrhythmia classification from unbalanced data are taken as the comparison algorithm, which are named as comparison method 1, comparison method 2 and comparison method 3 respectively. The algorithm in this paper and the three comparison algorithms are used to cluster the eight groups of data in the accumulation sequence of the experimental data after the samples are generated. The data information of each group is shown in Table 1, in which the imbalance rate is the ratio of the number of samples in a few categories to the number of samples in a majority category. For multi-class data, one class is set as a minority class and the rest as a majority class. The eigenvalues of all sample points are scaled to [0,1]. All the data are divided into training set and test set by the method of 50% cross validation, and the average value is taken as the experimental result.

Download:

Table 1. Description of data sets.

https://doi.org/10.1371/journal.pone.0288140.t001

F-value, G-mean and AUC, the commonly used evaluation indicators for unbalanced data classification, are selected as the classification results. Where AUC is the sum of the areas of each part under the ROC curve, indicating the probability that the classifier will rank the positive instances of the random test higher than the negative instances of the random test. The larger the value is, the better the classification performance is. The calculation process of F-value and G-mean is constructed from the confusion matrix. The definition of confusion matrix is shown in Table 2.

Download:

Table 2. Confusion matrix.

https://doi.org/10.1371/journal.pone.0288140.t002

In each index, the larger the F-value is, the better the classification performance of a few classes is. G-mean considers the classification accuracy of both majority and minority classes, and can be used to measure the overall classification effect.

Table 3 shows the F-value, G-mean and AUC values obtained by clustering with different methods. The best data are marked in black bold. For each evaluation index, the values obtained by the four unbalanced treatment methods are sorted from the best to the worst. The optimal sequence number is 1, increasing in turn. The average ranking of the four methods in all data accumulation sequences is given in the last row of the table, and the best ranking is marked in black bold.

Download:

Table 3. Comparison of clustering effects.

https://doi.org/10.1371/journal.pone.0288140.t003

The comparison results in Table 3 show that:

Average ranking. Compared with the algorithms in reference [6], reference [7] and reference [8], the algorithm in this paper has better classification performance on the accumulated sequence of balanced data processed by the algorithm, and the average ranking of indicators F-value, G-mean and AUC are the best, which proves that the algorithm in this paper has obvious advantages in processing unbalanced data.
F-value. The algorithm in this paper achieves the optimal value on all data sets. The three comparison methods achieve the optimal value in a few data sets. Overall, this algorithm effectively improves the classification accuracy of a few samples.
G-mean and AUC values. In the classification results, the proposed algorithm achieves the optimal value in most data accumulation sequences, which proves that the proposed method improves the overall classification performance of unbalanced data.
On data group No. 2, No. 4, No. 6 and No. 7, the algorithm in this paper obtains the optimal F-value, G-mean and AUC values at the same time.
The reason is that this algorithm takes into account the imbalance problem in a few samples, while other methods do not take into account the internal distribution of samples.

Analysis of excavation performance

The accuracy of mining performance is used to measure the consistency between the mining results calculated by the algorithm in this paper and the real results given by users. The error is used to indicate whether the mining results meet the needs of users. In this paper, RMSE is selected as the evaluation index to analyze the impact of different conditions on the mining performance of this method.

Influence of parameter setting on mining performance of the proposed algorithm.

According to Eq (7), and , and , where is the variance of Gaussian observation noise = . The cross validation method and empirical parameters of can be obtained according to the experimental data. In the experiment, the influence of λ = 0 = 0.01, 0.01, 0.05, 0.1, 0.5 on RMSE is dynamically considered. During the experiment, the decomposition dimension is taken as 5 and the result is iterated for 50 times. The experimental results are shown in Fig 2.

Download:

Fig 2. Influence of parameter setting on mining performance.

https://doi.org/10.1371/journal.pone.0288140.g002

It can be seen from Fig 2 that when λ∈[0.05,0.1], the square root error of the model is obviously lower. In the following experiments, λ = 0.1 is taken as the optimal value.

Impact of decomposition dimension and iteration times on mining performance.

In the experiment, the influence of decomposition dimension and iteration times is dynamically considered. The iteration times are 10, 20, 30, 40 and 50 respectively, and the decomposition dimensions are 3, 6, 10 and 20. The experimental results are shown in Fig 3.

Download:

Fig 3. Impact of decomposition dimension and iteration times on mining performance.

https://doi.org/10.1371/journal.pone.0288140.g003

As can be seen from Fig 3, when the decomposition dimension is unchanged, RMSE decreases slightly with the increase of iteration times; When the number of iterations does not change, the decomposition dimension first decreases and then increases. When the decomposition dimension is, 20, the minimum RMSE is obtained. Therefore, the number of iterations is 50, and when the decomposition dimension is 5, the minimum RMSE is obtained.

Analysis of mining accuracy.

In order to further verify the mining effect of the proposed method, it is compared with the method in literature [6] and the method in literature [7] to test the mining accuracy, and the comparison results are shown in Table 4.

Download:

Table 4. Comparison of mining accuracy.

https://doi.org/10.1371/journal.pone.0288140.t004

According to the analysis of Table 4, the mining accuracy of the method in this paper is above 0.92, and that of the two literature methods is below 0.84. Compared with the two literature methods, the mining accuracy of the method in this paper is higher.

Conclusion

With the advent of the information age, the generation of data is increasing. Although such a huge data system can provide enough information for decision-making, it also poses new challenges for the application of these large-scale data. The application of accumulation sequence of unbalanced data is one of the difficult problems in data application, which is mainly caused by the characteristics of accumulation sequence of unbalanced data. To solve these problems, this paper studies the mining algorithm of accumulation sequence of unbalanced data based on probability matrix decomposition. The accumulation sequence of unbalanced data is balanced, and then the accumulation sequence of balanced data is mined by probability matrix decomposition. Aiming at the problem that the overall prediction error effect is not considered in the traditional probability matrix decomposition process, the adaptive error is introduced to optimize it, and the mining results are obtained.

Only by deeply understanding the essence of data accumulation sequence can this paper design a more suitable algorithm to deal with this problem. At present, the research on the mining of accumulation sequence of unbalanced data is still in its infancy, and there is still a lot of room for development, which also puts forward many open challenges for future research.

References

1. Han M. H., Wu Y., Huang Y. and Wang Y., “A fault diagnosis method based on improved synthetic minority oversampling technique and svm for unbalanced data.” IOP Conference Series Materials Science and Engineering, 1043(5), 052034, 2021.
- View Article
- Google Scholar
2. Lopez-Nava I. H., M Valentín-Coronado L., Garcia-Constantino M. and Favela J., “Gait activity classification on unbalanced data from inertial sensors using shallow and deep learning.” Sensors, 20(17), 4756, 2020. pmid:32842459
- View Article
- PubMed/NCBI
- Google Scholar
3. Wu Y., Zhang W., Zhang L., Qiao Y., Yang J. et al, “A multi-clustering algorithm to solve driving cycle prediction problems based on unbalanced data sets: a chinese case study.” Sensors (Basel, Switzerland), 20(9), 2020. pmid:32344855
- View Article
- PubMed/NCBI
- Google Scholar
4. Zhou F., Yang S., Fujita H., Chen D. and Wen C., “Deep learning fault diagnosis method based on global optimization gan for unbalanced data.” Knowledge-Based Systems, 187(Jan.), 104837.1–104837.19, 2020.
- View Article
- Google Scholar
5. Hang Q., Yang J. and Xing L., “Diagnosis of rolling bearing based on classification for high dimensional unbalanced data.” IEEE Access, 7(99), 79159–79172, 2020.
- View Article
- Google Scholar
6. El-Saeiti I., and Khali M. A., Intesar N. E. S.and Gebriel M. S., “Adequacy of h-likelihood estimation method for unbalanced clustered counting data models.” International Journal of Mathematics, 3(12), 18–28, 2021.
- View Article
- Google Scholar
7. Tang X., Cai L., Meng Y., Gu C. and Yang J., “A novel hybrid feature selection and ensemble learning framework for unbalanced cancer data diagnosis with transcriptome and functional proteomic.” IEEE Access, PP(99), 1–1, 2021.
- View Article
- Google Scholar
8. Carrillo-Alarcón J. C., Morales-Rosales L. A.,, Rodríguez-Rángel H., Lobato-Báez M., Muoz and I A. et al, “A metaheuristic optimization approach for parameter estimation in arrhythmia classification from unbalanced data.” Sensors (Basel, Switzerland), 20(11), 2020. pmid:32498271
- View Article
- PubMed/NCBI
- Google Scholar
9. Liang X. L., “Multi-Segment Support Data Mining Simulation Based on Artificial Bee Colony Optimization.” Computer Simulation, 36(7), 273–276, 2019.
- View Article
- Google Scholar
10. Duan M. and Cheng X., “Hierarchical culling algorithm of unbalanced big data under asynchronous transmission.” International Journal of Performability Engineering, 15(12), 3312, 2019.
- View Article
- Google Scholar
11. Zhang X. and Wang B., “Design of estimation algorithm of island intelligent tourist volume based on data mining.” Journal of Coastal Research, 95(sp1), 985, 2020.
- View Article
- Google Scholar
12. Chen L., Gao S., Liu B., Lu Z., Jiang Z., “FEW-NNN: A fuzzy entropy weighted natural nearest neighbor method for flow-based network traffic attack detection.” China Communications, 17(5):151–167,2020.
- View Article
- Google Scholar
13. Wang X. L., Xie W. X. and Li L. Q., “Interacting t-s fuzzy particle filter algorithm for transfer probability matrix of adaptive online estimation model.” Digital Signal Processing, 110(5), 102944, 2020.
- View Article
- Google Scholar
14. Yao Z. and Gao K., “User recommendation method based on joint probability matrix decomposition in cps networks.” Computer Communications, 157(4), 2020.
- View Article
- Google Scholar
15. pan H., wang J. and zhang Z., A movie recommendation model combining time information and probability matrix factorisation. International Journal of Embedded Systems, 14(3), 239, 2021.
- View Article
- Google Scholar
16. kassem M. A., khoiry M. A. and hamzah N., Using probability impact matrix (pim) in analyzing risk factors affecting the success of oil and gas construction projects in yemen (abstract and conclusion). International Journal of Energy Sector Management, 14(3), 527–546, 2019.
- View Article
- Google Scholar
17. Saranya K. and Premalatha K., “Privacy-preserving data publishing based on sanitized probability matrix using transactional graph for improving the security in medical environment.” The Journal of Supercomputing, 76(8), 5971–5980, 2020.
- View Article
- Google Scholar
18. Lai L. Y., Ye F., Wang Y. and Zhang L., “Interference probability matrix for disassembly sequence planning under uncertain interference.” Journal of Manufacturing Systems, 60(3), 214–225, 2021.
- View Article
- Google Scholar
19. Hasegawa S. F. and Takada T., “Probability of deriving a yearly transition probability matrix for land-use dynamics.” Sustainability, 11(22), 6355, 2019.
- View Article
- Google Scholar
20. Georgios D., Fernando B., and Felix L. “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE.” Informationences, 2018, 465:1–20.
- View Article
- Google Scholar
21. Wang Y. W.,Feng L. Z., Zhu J. M.,Li Y.,F, Chen,”Improved AdaBoost algorithm using misclassified samples oriented feature selection and weighted non-negative matrix factorization.” Neurocomputing, 508(7),153–169, 2022.
- View Article
- Google Scholar
22. SuryaNarayana G., Kolli K., Ansari M.D., Gunjan V.K. (2021). A Traditional Analysis for Efficient Data Mining with Integrated Association Mining into Regression Techniques. In: Kumar A., Mozar S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_127.
23. Prasanna K., & Seetha M. (2015). A doubleton pattern mining approach for discovering colossal patterns from biological dataset. International Journal of Computer Applications, 119(21).
- View Article
- Google Scholar
24. Rudra Kumar M., Pathak R., & Gunjan V. K. (2022). Machine Learning-Based Project Resource Allocation Fitment Analysis System (ML-PRAFS). In Computational Intelligence in Machine Learning: Select Proceedings of ICCIML 2021 (pp. 1–14). Singapore: Springer Nature Singapore.
25. Kumar, S., Gunjan, V. K., Ansari, M. D., & Pathak, R. (2022). Credit Card Fraud Detection Using Support Vector Machine. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021 (pp. 27–37). Springer Singapore.

[ref1] 1. Han M. H., Wu Y., Huang Y. and Wang Y., “A fault diagnosis method based on improved synthetic minority oversampling technique and svm for unbalanced data.” IOP Conference Series Materials Science and Engineering, 1043(5), 052034, 2021.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Lopez-Nava I. H., M Valentín-Coronado L., Garcia-Constantino M. and Favela J., “Gait activity classification on unbalanced data from inertial sensors using shallow and deep learning.” Sensors, 20(17), 4756, 2020. pmid:32842459
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Wu Y., Zhang W., Zhang L., Qiao Y., Yang J. et al, “A multi-clustering algorithm to solve driving cycle prediction problems based on unbalanced data sets: a chinese case study.” Sensors (Basel, Switzerland), 20(9), 2020. pmid:32344855
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Zhou F., Yang S., Fujita H., Chen D. and Wen C., “Deep learning fault diagnosis method based on global optimization gan for unbalanced data.” Knowledge-Based Systems, 187(Jan.), 104837.1–104837.19, 2020.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Hang Q., Yang J. and Xing L., “Diagnosis of rolling bearing based on classification for high dimensional unbalanced data.” IEEE Access, 7(99), 79159–79172, 2020.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref6] 6. El-Saeiti I., and Khali M. A., Intesar N. E. S.and Gebriel M. S., “Adequacy of h-likelihood estimation method for unbalanced clustered counting data models.” International Journal of Mathematics, 3(12), 18–28, 2021.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Tang X., Cai L., Meng Y., Gu C. and Yang J., “A novel hybrid feature selection and ensemble learning framework for unbalanced cancer data diagnosis with transcriptome and functional proteomic.” IEEE Access, PP(99), 1–1, 2021.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref8] 8. Carrillo-Alarcón J. C., Morales-Rosales L. A.,, Rodríguez-Rángel H., Lobato-Báez M., Muoz and I A. et al, “A metaheuristic optimization approach for parameter estimation in arrhythmia classification from unbalanced data.” Sensors (Basel, Switzerland), 20(11), 2020. pmid:32498271
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref9] 9. Liang X. L., “Multi-Segment Support Data Mining Simulation Based on Artificial Bee Colony Optimization.” Computer Simulation, 36(7), 273–276, 2019.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref10] 10. Duan M. and Cheng X., “Hierarchical culling algorithm of unbalanced big data under asynchronous transmission.” International Journal of Performability Engineering, 15(12), 3312, 2019.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref11] 11. Zhang X. and Wang B., “Design of estimation algorithm of island intelligent tourist volume based on data mining.” Journal of Coastal Research, 95(sp1), 985, 2020.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Chen L., Gao S., Liu B., Lu Z., Jiang Z., “FEW-NNN: A fuzzy entropy weighted natural nearest neighbor method for flow-based network traffic attack detection.” China Communications, 17(5):151–167,2020.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref13] 13. Wang X. L., Xie W. X. and Li L. Q., “Interacting t-s fuzzy particle filter algorithm for transfer probability matrix of adaptive online estimation model.” Digital Signal Processing, 110(5), 102944, 2020.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref14] 14. Yao Z. and Gao K., “User recommendation method based on joint probability matrix decomposition in cps networks.” Computer Communications, 157(4), 2020.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref15] 15. pan H., wang J. and zhang Z., A movie recommendation model combining time information and probability matrix factorisation. International Journal of Embedded Systems, 14(3), 239, 2021.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref16] 16. kassem M. A., khoiry M. A. and hamzah N., Using probability impact matrix (pim) in analyzing risk factors affecting the success of oil and gas construction projects in yemen (abstract and conclusion). International Journal of Energy Sector Management, 14(3), 527–546, 2019.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref17] 17. Saranya K. and Premalatha K., “Privacy-preserving data publishing based on sanitized probability matrix using transactional graph for improving the security in medical environment.” The Journal of Supercomputing, 76(8), 5971–5980, 2020.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref18] 18. Lai L. Y., Ye F., Wang Y. and Zhang L., “Interference probability matrix for disassembly sequence planning under uncertain interference.” Journal of Manufacturing Systems, 60(3), 214–225, 2021.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref19] 19. Hasegawa S. F. and Takada T., “Probability of deriving a yearly transition probability matrix for land-use dynamics.” Sustainability, 11(22), 6355, 2019.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. Georgios D., Fernando B., and Felix L. “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE.” Informationences, 2018, 465:1–20.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref21] 21. Wang Y. W.,Feng L. Z., Zhu J. M.,Li Y.,F, Chen,”Improved AdaBoost algorithm using misclassified samples oriented feature selection and weighted non-negative matrix factorization.” Neurocomputing, 508(7),153–169, 2022.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref22] 22. SuryaNarayana G., Kolli K., Ansari M.D., Gunjan V.K. (2021). A Traditional Analysis for Efficient Data Mining with Integrated Association Mining into Regression Techniques. In: Kumar A., Mozar S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_127.

[ref23] 23. Prasanna K., & Seetha M. (2015). A doubleton pattern mining approach for discovering colossal patterns from biological dataset. International Journal of Computer Applications, 119(21).
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref24] 24. Rudra Kumar M., Pathak R., & Gunjan V. K. (2022). Machine Learning-Based Project Resource Allocation Fitment Analysis System (ML-PRAFS). In Computational Intelligence in Machine Learning: Select Proceedings of ICCIML 2021 (pp. 1–14). Singapore: Springer Nature Singapore.

[ref25] 25. Kumar, S., Gunjan, V. K., Ansari, M. D., & Pathak, R. (2022). Credit Card Fraud Detection Using Support Vector Machine. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021 (pp. 27–37). Springer Singapore.

Figures

Abstract

Introduction

Materials and methods

Accumulation sequence of unbalanced data preprocessing

Minority clustering in accumulation sequence of unbalanced data based on natural nearest neighbor.

Oversampling based on natural nearest neighbor.

Mining of accumulation sequence of unbalanced data

Mining of accumulation sequence of unbalanced data based on probability matrix decomposition.

Optimization of probability matrix decomposition.

Results

Preprocessing test of accumulation sequence of unbalanced data

Analysis of excavation performance

Influence of parameter setting on mining performance of the proposed algorithm.

Impact of decomposition dimension and iteration times on mining performance.

Analysis of mining accuracy.

Conclusion

References