Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

It is difficult for learning models to achieve high classification performances with imbalanced data sets, because with imbalanced data sets, when one of the classes is much larger than the others, most machine learning and data mining classifiers are overly influenced by the larger classes and ignore the smaller ones. As a result, the classification algorithms often have poor learning performances due to slow convergence in the smaller classes. To balance such data sets, this paper presents a strategy that involves reducing the sizes of the majority data and generating synthetic samples for the minority data. In the reducing operation, we use the box-and-whisker plot approach to exclude outliers and the Mega-Trend-Diffusion method to find representative data from the majority data. To generate the synthetic samples, we propose a counterintuitive hypothesis to find the distributed shape of the minority data, and then produce samples according to this distribution. Four real datasets were used to examine the performance of the proposed approach. We used paired t-tests to compare the Accuracy, G-mean, and F-measure scores of the proposed data pre-processing (PPDP) method merging in the D3C method (PPDP+D3C) with those of the one-sided selection (OSS), the well-known SMOTEBoost (SB) study, and the normal distribution-based oversampling (NDO) approach, and the proposed data pre-processing (PPDP) method. The results indicate that the classification performance of the proposed approach is better than that of above-mentioned methods.


Introduction
Imbalanced data set problems are the issue in the real world and present challenges to both academics and practitioners. It should be noted that the imbalanced dataset is quite common in medical fields due to the imbalance of their class labels. In addition, the high risk/target patients tend to appear in the minority class of the medical dataset. The risk/cost of miss- classification in the minority class is much higher than that in the majority class in medical fields. Most existing classification methods do not have the required qualities in the performance of classification especially when the dataset is extremely imbalanced. For example, Murphey et al. [1], Cohen et al. [2], Sun et al. [3], Sun et al. [4], Li et al. [5,6], Song et al. [7], Wang et al. [8], and Zou et al. [9] have shown that when limited training data are available, the small size of the minority data will significantly affect the accuracy of medical diagnoses. With imbalanced datasets, when some classes are much larger than the others, most machine learning and data mining classifiers are overly influenced by the larger classes and ignore the smaller ones. As a result, the classification algorithms often exhibit poor learning performances due to slow convergence in the minority classes [3,4,10,11]. A number of solutions for dealing with class imbalance problems have been proposed to handle classification problems in various fields. These approaches can be divided into two types. One creates new algorithms or modifies existing algorithms; example of this type can be found in Hong et al. [11], Peng and King [12], Nguwi and Cho [13], and Lo et al. [14]. For certain types of data sets, this approach can be highly effective for specific classifiers, but the performance of those classifiers is still less than optimal with data sets that have varied characteristics because it is usually difficult to transform the modification procedures from one algorithm to another. The other type of approach in the literature utilizes sampling techniques; these include undersampling and oversampling to adjust the sizes of data to balance the data sets [2][3][4][5][15][16][17][18]. The undersampling method reduces the size of data by eliminating samples from the majority class, thus decreasing its degree of influence. However, eliminating data raises the risk of partially removing the complete characteristics that may be represented in the majority class samples. Researchers have discussed various undersampling methods such as the random and directed approaches. These approaches include Kubat and Matwin [19] presented a method called one-sided selection (OSS) that randomly eliminates examples from majority class data sets until the amount of data for the majority class is equal to that of the minority class. Yen and Lee [20] proposed a cluster-based undersampling approach to select representative examples from the majority data to avoid the loss of crucial information.
As for undersampling approach, this study differs from other approaches that randomly draw data from the majority data, raising the probability of imprecisely characterizing the majority data due to the increased influence of noise or outliers in the samples set [21,22]. Therefore, we propose a systematic procedure using the box-and-whisker plot approach to exclude outliers and the Mega-Trend-Diffusion (MTD) method proposed by Li et al. [23] to construct the distribution of the majority data. The MTD which is a data expansion method used in this study is to reasonably evaluate the domain range of the observed data. Within the estimated domain range, it includes both the reasonable/fitting data and the outliers. MTD is used to construct the membership function of the observed data and to calculate the membership degree of them. The smaller value the membership degree of the data, the more likely an outlier. This study uses theα-cut based on the MTD method to keep the suitable data and to eliminate the outliers. Further, under the estimated distribution of data, this paper takes representative samples from the majority data by settingα-cut values, providing a suitable value forα-cut to determine an appropriate amount of the majority data.
With regard to oversampling, direct resampling is a widely used strategy to balance a class distribution by duplicating minority class examples. Many researchers have adopted oversampling techniques such as those described in Piras and Giacinto [24], Xie and Qiu [21], Tahir [22], and Fernández-Navarro et al. [25]. However, these approaches may suffer from the overfitting problem. In Chawla et al. [26], rather than duplicate examples from a data set, the authors proposed the synthetic minority oversampling technique (SMOTE) to generate synthetic samples in a feature space. Many subsequent studies such as AdaBoost [27] and SMOTEBoost (SB) [28] have adopted this method, and all have confirmed the effectiveness of this approach with regard to enhancing the classification accuracy of minority class data. Unfortunately, these oversampling methods focus on resampling from rare minority class data. Therefore, when the ratio of the minority data to the overall samples is decreasing, the resampling will be too conservative to behave realistically with imbalanced data sets.
Other oversampling methods consider the underlying minority class data distributions. For instance, working in a feature space, Zhang and Wang [29] proposed a normal distributionbased oversampling (NDO) approach to generate normal-synthetic samples with characteristics that are close to those of the raw minority class data with regard to the expected mean and variance. However, with the imbalanced data sets, when there are very few data in the minority class, it is difficult to know whether the data follow a normal distribution. Therefore, in this paper, based on a two-parameter Weibull distribution, we propose a new oversampling method for generating representative synthetic samples to extend the minority class data. One reason for this is that the distribution used can appropriately characterize the shape of a data set through various shape parameters of the density function [30][31][32]. Consequently, the method presented in this work is more flexible with regard to the shape of small data sets. Moreover, in our approach, a uniquely counterintuitive hypothesis-testing procedure is constructed to evaluate the shape parameter of the Weibull distribution by choosing the maximal p-value of a small data size.
This paper uses four real data sets, Wisconsin Diagnostic Breast Cancer (WDBC) and Parkinson's Disease (PD), Vertebral Column (VC) with two categories: normal and abnormal, and Haberman's Survival (HS), to illustrate the performance of the proposed method. Although accuracy is an appropriate criterion for measuring classification performance, it is not adequate for imbalanced data sets due to the impact of the minority class. As a result, the three criteria including Accuracy (ACC), Geometric Mean (G-mean), and F-measure (F1) are recommended to measure the performance of learning with imbalanced data sets [33]. For the learning tool, we tested the support vector machine (SVM) with a linear kernel function (SVM-linear), Naïve Bayes (NB), k-nearest neighbor (KNN), and another type of SVM with a polynomial kernel function (SVM-poly). The experiments show that the SVM with the polynomial kernel function has the best classification performance for raw imbalanced data sets; thus, it is chosen as the learning tool in the subsequent performance comparison among the OSS method, the SB method, the NDO method, the proposed data preprocessing (PPDP) method, the D3C method, and PPDP+D3C method. The D3C is a new hybrid model which combines the ensemble pruning based on k-means clustering and dynamic selection and circulating combination. The D3C model was proposed by Lin et al. [34] to improve the learning of imbalance dataset. It is noted that our proposed method mainly focuses on data pre-processing and the D3C is an ensemble method. Hence, the study proposes the concept of combination of PPDP with LibD3C (PPDP+D3C), that is, the imbalanced datasets are pre-processed by PPDP+D3C, and then are trained by D3C method. The four classifiers set in D3C includes NB, KNN(K = 3), SVM-linear, and SVMpoly. The results show that the combination of PPDP with LibD3C (PPDP+D3C) method has the best classification performance for imbalanced data sets.
The remainder of this paper is organized as follows: Section 2 reviews the literature on the related criteria for evaluating classification performance, the box-and-whisker plot method, and the MTD method. Section 3 introduces the detailed procedure of the proposed method. In Section 4, we present the four real data sets and the detailed experiment methodology, and then compare the results derived from the OSS, SB, NDO, PPDP, D3C, and PPDP+D3C methods. Finally, we present conclusions in Section 5.

Related techniques
In this section, we review the literature on the evaluation criteria for classification performance, the box-and-whisker plot method, and the MTD method.

Evaluation criteria
By convention, the minority class data is the positive class label, and the majority class data is the negative class label. For imbalanced class distributions, the accuracy rate for the minority class is frequently close to zero, which means that evaluations of learning results are not appropriate for use with minority class data. Consequently, the accuracy rate measure is not used to consider the classification performance in this work; instead, other criteria are described in this section. Table 1 shows a confusion matrix, which is used in this work to construct the relevant criteria for a two-class classification problem.

Review of the box-and-whisker plot
The box-and-whisker technique was first proposed by Tukey [35] to show the distribution of data, examine its symmetry, and indicate outliers. Box-and-whisker plots are used to exclude outliers, where the box's lower boundary is the lower quartile (Q1) of the data and the upper boundary is the upper quartile (Q3). The length of the box is the interquartile range (IQR), which is calculated by where Q3 and Q1 are the 75th and 25th percentiles of the samples, respectively. In addition, Q2 is the median of the data set. There are two inner fences in a box plot: the lower inner fence (LIF) and upper inner fence (UIF). When data are outside the [LIF,UIF], they are considered suspected outliers. The calculations for this region are as follows:

The MTD method
Li et al. [23] proposed the MTD method to construct the distribution of manufacturing data. The MTD method which combines mega diffusion and data trend estimation is used to generate virtual samples to provide a strategy for the knowledge of small data set learning and obtain a higher degree of classification accuracy. As shown in Fig 1, a triangular membership function μ A (x) is constructed from the MTD method to calculate the domain range of observed/collected data x, which is the interval [a,b], described mathematically as: and N U indicate the number of data less than and greater than u set that are equal to (min + max)/ 2, respectively, and "min" and "max" are the actual minimum and maximum values in the observed/collected data set. From Eqs (4) and (5)

The model structure
This section describes the proposed procedure to deal with imbalanced data set classification problems. It describes the undersampling process and explains the oversampling technique to find the shape of the data distribution with limited samples to generate synthetic samples for learning the skewed class distribution. Step 1, the imbalanced data set is separated into two sets by class, where the majority class has M data and the minority class has m data. In Step 2, based on the undersampling strategy, we utilize the box-and-whisker plot to determine whether data are outliers in each feature. Then, we delete the outliers in the majority class. The MTD method is then applied to draw representative observations from the majority class. Regarding the oversampling strategy, because the number of samples in the minority class is small and may follow an arbitrary probability distribution, we consider the two-parameter Weibull distribution recommended by Little [36] to fit the data in the minority class and form various shapes of density functions, including skewed and mound-shaped curves, thus achieving greater flexibility. Therefore, by assuming that the minority class data are distributed into a two-parameter Weibull density function, we propose a method to evaluate the two parameters of the Weibull distribution and generate synthetic samples from that estimated distribution. In Step 3, given that these valuable parameters have been found and the data size in the majority class has been reduced from M to M', the size of synthetic data becomes M'−m, and we can then form the learning model by inputting the new balanced data set.

The undersampling method
The following method is proposed to rebuild the model of the data in the majority class. First, we employ the box-and-whisker plot to detect outliers and eliminate them from the majority data. Second, we use the remaining data to compute the range of the data, that is, the interval [a,b], as explained in Section 2.3. As shown in Fig 1, the triangular membership function μ A (x) is formed based on the interval [a,b], as follows: where X is assumed to be a universal set, and x is an element in X. The A set is a fuzzy set of X, and the value of μ A (x) is the membership function with regard to each x in [0,1].
Here, we apply theα-cut to draw the valuable data from the corresponding μ A (x) in X, where theα-cut of A is a crisp set that contains the total number of x in X that have values of μ A (x) greater than or equal toα-cut, denoted as follows: where A α can be derived from Eq (6) as We then use the data set in which all the data belong to A α as a learning model for the majority class. In the majority class, when setting the value ofα-cut, we can implement this undersampling process to find the representative majority data.

The oversampling method
In this section, we first describe some basic properties of a two-parameter Weibull distribution, and then present the proposed method for oversampling in detail.

Preparation for a two-parameter Weibull distribution.
Given a data set x = {x i }, i = 1,2,Á Á Á,N that can be denoted by a two-parameter Weibull distribution, the probability density function and cumulative distribution function of the Weibull distribution are respectively expressed as follows: Fðx; l; bÞ where λ is the scale parameter and β is the shape parameter. Synthetic samples generation to improve imbalanced data set diagnosis With regard to the shape parameter, Nelson et al. [37] demonstrated that the Weibull distribution has some special expressions. For example, when the value of β is one or two, the Weibull distributions are identical to the Exponential and Rayleigh distributions, respectively, and the shape of the Weibull density function is close to a normal distribution when the value of β is within [3,4]. The least square estimation (LSE) is widely utilized by researchers to estimate the β and λ of Eq (8). The sum of squares error (SSE) can be derived from Eq (9) as where x (i) is the observed data, i = 1,Á Á Á,N, N is the sample size, and the Bernard's median rank estimator isF i ðxÞ ¼ ði À 0:3Þ=ðN þ 0:4Þ. This study executes the shape-first method to fit the optimal value of β. Then the different values ofb are used to estimate λ based on the minimized SSE, as Eq (10).

The estimation of the two parameters.
The proposed method utilizes the Gini statistic [38] in counterintuitive hypothesis testing to find the best-fitting shape parameter β of the Weibull distribution. With a given level of significance α and a data size of N, the proposed testing procedure is constructed as follows: Step 1. The null hypothesis is set to Step 2. The alternative hypothesis is set to Step 3. The testing statistic uses the Gini statistic as shown below: Step 4. I. The rejection region for a sample size of N between 3 and 20 is set to where the critical value ξ α/2 is the 100(α/2) percentile of the G N statistic. Moreover, the p-value = P{|G N | > |g N ||β = β 0 }, where g N indicates the estimated value of G N , as follows: where c j = (N−j)/(N−1), and m is the largest index, such that x c m . Note that the corresponding two-tailed percentiles ξ α/2 of the Gini statistic G N are described in Gail and Gastwirth [38].

II.
The rejection region for a sample size of N that is greater than 20 is set to where g N is the observed value of [12(N−1)] 1/2 (G N −0.5) which follows an approximately standard normal distribution (normal(0,1)) expressed as shown below: Step 5. The decision rule of the statistical test is designed as follows: When β = β 0 , the p-value has a maximal value, which means that there is strong evidence that the null hypothesis, H 0 should be accepted. The best-fitting shape parameter β can be found based on this testing procedure. After β is estimated, we can compute the scale parameter λ using the following equation: where Bernard's median rank estimator isF i ðxÞ ¼ ði À 0:3Þ=ðN þ 0:4Þ, i = 1,Á Á Á,N.

Synthetic sample generation.
As mentioned above, the minority class data are assumed to fit a two-parameter Weibull distribution. For a given data set, we employ the inversion method to derive the Weibull variate, which is the approach used here to create synthetic samples. In the inversion method, a random variable X is distributed in a Weibull distribution containing both a scale parameter λ and a shape parameter β (i.e., X * Weibull(λ,β)). Given that F(x,λ,β) is the CDF of the data shown in Eq (9), it can be used to derive the formula of the Weibull variate as follows: where x ! 0, λ ! 0, β > 0. Subsequently, in the generation of the synthetic sampleŝ (16) is modified tô where the Bernard's median rank estimatorF i ðxÞ ¼ ði À 0:3Þ=ðN 0 þ 0:4Þ, a desired number of N', and the two estimatorsl andb are calculated by the proposed approach.

The detailed procedure
Assume that a training data set has N samples with P mutually independent features denoted as T = {(X 1 ,y 1 ),(X 2 ,y 2 ),. . .,(X N ,y N )}, and the two-class data set where each sample . .x iP )), and y i 2 {+,−} is the target value of X i . Note that the class label of the minority class is positive (+), and the negative (−) label is for the majority data set. To explain the proposed procedure in detail, we provide the following steps: Step 1. Separate the data set T into minority and majority data by the corresponding target value, denoted as T ¼ ft Step 2. Use the box-and-whisker plot and the MTD method as the undersampling methods to exclude outliers and select the valuable data to reduce the data size of the majority class t * À from M to M'. Note that the number of items in the majority class becomes M', which can be calculated as M−S box −S mtd , where S box and S mtd are the quantity of outliers and valueless samples, respectively. S box is the sample quantity that lies outside of [LIF,UIF], they are considered suspected outliers. LIF and UIF are shown as Eqs (2) and (3). S mtd is the sample quantity that exceeds the value of given anα-cut. That means the data which exceeds the range of A α will be removed.
Step 3. Utilize the oversampling method to increase the data size in the minority class t Step 4. The reduced t * À and extended t * þ sets are merged into a new training data set to establish a learning model.
For every data set, we can implement the above steps to balance the raw data set from (M + m) × P into (M' + m') × P dimensions. Besides, the remainder of the raw data set functioned as the testing data set. As for the testing procedure, this study will use the testing data and iterate the experiment 50 times concerning all of the scenarios given anα-cut to compare the result with that of the OSS, SB, NDO, PPDP, D3C methods. The testing procedure is shown as Fig 3.

Experiments
To demonstrate the classification performance of the PPDP+D3C method, we used four real data sets and compared the result with that of the OSS, SB, NDO, PPDP, D3C methods. Furthermore, paired t-tests were used in the comparison among them to examine the significance of the results with various sets of imbalanced data.

Four real data sets and classifier selection
In this section, we employ four real data sets (WDBC (available at: https://archive.ics.uci.edu/ ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29), PD (available at: https://archive. ics.uci.edu/ml/datasets/parkinsons), VC (available at: http://archive.ics.uci.edu/ml/datasets/ vertebral+column), and HS (available at: https://archive.ics.uci.edu/ml/datasets/Haberman's +Survival), downloaded from the UCI Machine Learning Repository database [39]) to demonstrate the performance of the PPDP+D3C with regard to the imbalanced two-class classification problems. The details of these four data sets are summarized in Table 2, where "r" indicates the percentage of minority classes in the samples.
This study applied four different classifiers including NB, KNN, and two types of SVM to the raw data of these four data sets. In KNN, the parameter of k was set to 3. The kernel functions in the two SVMs were linear and polynomial (notated as SVM-linear and SVM-poly, respectively); the cost parameter was set to 1 and the degree in kernel function set to 2 in the linear and polynomial kernel functions. The algorithms of the NB and KNN classifiers were implemented in Matlab, using the Statistics Toolbox. The SVM-linear and SVM-poly classifiers use LIBSVM [40] as the analysis tool. We selected the best classifier among the four classifiers in the imbalanced scenario (r = 5), using G-mean and F1 as the criteria for assessing classification performance with an imbalanced data set. Using the four raw data sets, we ran the experiment 50 times, set the training data size (N) as 60. The percentage of the minority classes was 5%, and the results, including ACC, G-mean, and F1, are shown in Table 3. The results show that the SVM-poly has a greater G-mean and F1 than NB, 3-NN, and SVM-linear.
The bold values indicate that the SVM-poly achieved the best classification performance on both the WDBC, PD, VC, and HS data sets; it has the best G-mean and F1 scores.

The suggested value ofα-cut
In the majority class, the value ofα-cut is important because it creates a region that controls the amount of representative data. To find an appropriate parameter setting forα-cut, we examined  Synthetic samples generation to improve imbalanced data set diagnosis the classification performances of variousα-cut settings for both the WDBC, PD, VC, and HS data sets. According to the classifier selection results in Section 4.1, we utilized the SVM-poly classifier to analyze data based on the same parameter settings. Considering the imbalanced data set at (r,N) = (5,60) and performing the experiment 50 times, the results of the classifier's performance on ACC, G-mean, and F1 for different values ofα-cut are given in Table 4. As shown in Table 4, we can achieve better classification performance when the values of α-cut are 0.4 or 0.5. In our opinion, with a smallerα-cut, the data in the created region do not effectively represent the majority class, and the nature of the minority class gradually becomes fuzzy because of the corresponding increase in the number of synthetic samples M'−m. For other, higherα-cut values, the learning model may experience overfitting because the total amount of data (M' + m') becomes smaller. For this reason, we suggest that the value of α-cut should be set to 0.5.

Experiment design
To create imbalanced scenarios, this experiment drew samples from a raw data set according to the percentage of the minority class, which was variously set to 5%, 10%, 15%, and 20% (r = {5,10,15,20}). The training data size, N, was set to 60, 80, 100 and 150 (a total of 16 scenarios). The remainder of the raw data set functioned as the testing data set. Note that the minority class size must be at least three due to the limitation with regard to sample size described in Step 4 in Section 3.3.2. To comply with this restriction, any value of r(%)×N less than three was changed to three. Using the four data sets (WDBC, PD, VC and HS), we iterated this experiment 50 times (16 scenarios at one time) at α-cut = 0.5. The results in Tables 5, 6, 7 and 8 are the averages of the values of ACC, G-mean and F1 for the imbalanced data sets taken from the WDBC, PD, VC and HS data sets. We used the paired t-test to examine whether the PPDP+D3C achieved statistically significant superiority compared with those methods such as

Experiment results
The experiment results are listed in Tables 5, 6  Synthetic samples generation to improve imbalanced data set diagnosis 2. For a fixed N, when r is increasing, the values of G-mean and F1 increase for all six methods, and the improvement in ACC is not significant.
3. For a fixed N, when r has become large, the PPDP+D3C consistently achieves better Gmean and F1 scores than the other five methods do, although this superiority is not statistically significant with regard to D3C in a few scenarios.
4. For a fixed r, when N is increasing, the PPDP+D3C consistently achieves higher G-mean and F1 scores than D3C does in the most scenarios. 5. For a small r and N, the results of the paired t-tests between PPDP+D3C and D3C are significant with regard to G-mean and F1 in the some scenarios. Synthetic samples generation to improve imbalanced data set diagnosis Synthetic samples generation to improve imbalanced data set diagnosis As shown in Tables 5, 6, 7 and 8, we can see in the WDBC dataset the performance of PPDP+D3C in ACC, G-mean and F1 is better compared to the other methods. In other data sets, the D3C can achieve better performance concerning ACC, but less statistically significant differences. While the PPDP+D3C method achieves excellent performance concerning G-mean and F1 compared to the other methods and most of which have statistically significant differences. These results show that the PPDP+D3C method achieves higher classification performance on imbalanced data sets among the other five tested methods. In other words, the results show that the proposed method can effectively achieve better classification performance with small values of r and N. Thus, it is obvious that when a data set includes imbalanced classes, the classification performance can be significantly improved by using the PPDP+D3C method. In addition, for a fixed N, when r is increasing, the number of generated synthetic samples M'−m becomes smaller, as shown in Table 9. Synthetic samples generation to improve imbalanced data set diagnosis Synthetic samples generation to improve imbalanced data set diagnosis

Summary and discussion
Four data sets, the WDBC, PD, VC, and HS, were used in this research to show the performance of the PPDP+D3C method with regard to learning with imbalanced data sets. Based on the results of experiments, as shown in Tables 5, 6, 7 and 8, the findings can be summarized as follows. The merging of the PPDP+D3C achieves better classification performance than the other five methods, and this superiority is statistically significant. In a few scenarios, when the value of r is 15 or 20 with a larger N, the PPDP+D3C has better G-mean and F1 scores than those of the D3C method, although some comparisons of the results of the paired t-test between the PPDP+D3C and the D3C showed no significant differences with regard to the G-mean and F1 scores. This may be because the ratio of minority data to the overall samples is rather large, and the amount of data in the minority part is thus sufficient for learning to occur based on the minority class. For instance, in the VC data set, the Pvalue of the paired t-test for F1 is 0.08, which is greater than 0.05 at (r,N) = (20,150). In fact, the results in Tables 5, 6, 7 and 8 show that when the value of r(%)×N is smaller, the PPDP Synthetic samples generation to improve imbalanced data set diagnosis +D3C method improves significantly in classification performance with regard to the Gmean and F1 measures.

Conclusion
Imbalanced data classification problems are common in the field of data mining, often leading to low classification performances because the existing learning algorithms are more suitable for the majority class data. In this work, we combined undersampling and oversampling to balance the training data sets; the undersampling method uses the box-and-whisker plot and the MTD method to reduce the size of the majority class data, while the oversampling method extends the minority class data set by adding generated synthetic samples. Experiments were carried out based on four imbalanced data sets. In particular, imbalanced data of a certain disease may differ based on different region, era, and medical environment. It leads to the phenomenon of diverse distribution concerning the certain disease. When the distributed condition of imbalanced data is not severe, a good diagnostic model could be obtained using a general analysis method. Otherwise, Synthetic samples generation to improve imbalanced data set diagnosis the proposed method in this study can assist to obtain a more correct diagnostic model. The results showed that our approach achieves a better classification performance than the other methods. Thus our approach can be considered an effective way to enhance the analytical performance for learning imbalanced class distributions. Our plans for future research include exploring how to find better density functions to generate useful synthetic samples to enhance classification performance for specific applications.