A novel dimension reduction algorithm based on weighted kernel principal analysis for gene expression data

Gene expression data has the characteristics of high dimensionality and a small sample size and contains a large number of redundant genes unrelated to a disease. The direct application of machine learning to classify this type of data will not only incur a great time cost but will also sometimes fail to improved classification performance. To counter this problem, this paper proposes a dimension-reduction algorithm based on weighted kernel principal component analysis (WKPCA), constructs kernel function weights according to kernel matrix eigenvalues, and combines multiple kernel functions to reduce the feature dimensions. To further improve the dimensional reduction efficiency of WKPCA, t-class kernel functions are constructed, and corresponding theoretical proofs are given. Moreover, the cumulative optimal performance rate is constructed to measure the overall performance of WKPCA combined with machine learning algorithms. Naive Bayes, K-nearest neighbour, random forest, iterative random forest and support vector machine approaches are used in classifiers to analyse 6 real gene expression dataset. Compared with the all-variable model, linear principal component dimension reduction and single kernel function dimension reduction, the results show that the classification performance of the 5 machine learning methods mentioned above can be improved effectively by WKPCA dimension reduction.


Introduction
DNA is organized structurally into chromosomes and functionally into genes, which are essentially pieces of DNA containing genetic information [1]. In humans, genes carry genetic information to express hair and eye colour, among many other traits, as well as information about when the body's cells grow, divide and die. When a gene is turned on, this is called gene expression. Genetic mutations in normal cells of the human body are closely related to environmental stimuli, age, smoking, diet and other external factors, which can lead to the uncontrolled reproduction of normal cells and, ultimately, to cancer (malignant tumours) [2]. In February 2018, the National Cancer Center of China released the registration data of the pulmonary nodule recognition based on multiple kernel learning support vector machine particle swarm optimization and obtained a better recognition efficiency [23]. Although the above kernel methods have achieved remarkable practical results in many fields, these methods are all single kernel methods based on a single feature space. Because different kernel functions have different characteristics, so that in different applications, the performance of the kernel function is very different, and there is no perfect theoretical basis for the construction or selection of the kernel function. In addition, when the sample features contain Heterogeneous information, the sample size is large, the multi-dimensional data is Unnormalised data or the data is non-flat in the high-dimensional feature space, it is not reasonable to process all the samples by mapping with a single simple kernel. In view of these problems, there are many researches on kernel combination method, namely multiple learning methods. Multiple models are a kind of based kernel learning model with stronger flexibility. Recent theories and applications have proved that using multiple instead of single kernel can enhance the interpretability of decision functions, taking advantage of the feature mapping ability of each basic kernel, and can obtain better performance than single kernel model or single-kernel machine combination model [24].
In view of the advantages of multiple kernel learning, this paper proposes a novel dimension reduction algorithm based on weighted kernel principal component analysis (WKPCA). Its basic idea is to use the vectorization method to calculate the kernel matrix, construct the kernel function weights according to the eigenvalues of the kernel matrix, combine multiple kernel functions, and give the theoretical proof of the weighted kernel functions. Moreover, the t-class kernel function is constructed as a subpart of the weighted kernel function. Through a large number of comparison experiments on 6 real data sets, the results show that compared with the whole variable model, linear principal component dimension reduction and single kernel function dimension reduction, the WKPCA algorithm proposed in this paper can effectively improve the classification prediction performance of the current mainstream machine learning methods.

Kernel principal component analysis
Traditional dimension reduction methods assume that the mapping from the high-dimensional feature space to the low-dimensional feature space is linear. However, in many practical tasks, nonlinear mapping may be needed to find the appropriate low-dimensional embedding [25]. To compensate for the lack of linear dimension reduction, the nonlinear dimension reduction method based on the kernel function was proposed, among which applications kernel principal component analysis was the most commonly used method. The basic idea is that the original data set by the nonlinear function maps the data to the appropriate high-dimensional feature space, introducing the kernel function whose form is known, so knowing the concrete expression of the nonlinear mapping is not necessary. Then, the kernel matrix and its eigenvectors are calculated, giving the projection of the data set in the high-dimensional space based on the eigenvectors.
Suppose that the original data is ., x ip } 0 , m is the sample size, i is the sample number, and p is the data dimension. In the high-dimensional feature space, the mapping of x i is z i = ϕ(x i ), and the data set is D 0 = {z 1 , z 2 , . . ., z m }, whose covariance matrix is

PLOS ONE
The solving goal of KPCA is Introducing the kernel function Common kernel functions can be found in the literature [26]. Substitute Eqs (4) and (5) into Eq (2) to get . . .; a j m Þ T is the eigenvector corresponding to the j th largest eigenvalue λ j of the kernel matrix K. After projection, the j th coordinate component of sample x is where α j is the normalized vector. It can be seen from Eq (7) that in order to obtain the projection of new samples, all the original data need to be summed, so the calculation cost is large. However, in the algorithm designed in Section 3.2, vectorization programming specific to the R language can be adopted to improve the calculation efficiency [27].

Weighted kernel function
To further improve the low-dimensional embedding ability of a single kernel function for the original data and make the selection of the kernel function more flexible, this paper proposes a weighted kernel function method to reduce the dimensionality of gene expression data with super high dimensionality, and its principle is given in the form of the following theorem. Theorem 1 [28] Let X be the input space, and κ(�,�) is a symmetric function defined based on X × X. Then, κ(�,�) is the kernel function if and only if for any dataset D = {x 1 , x 2 , . . ., x m }, the "kernel matrix" K is always positive semi-definite.
Theorem 1 shows that as long as the kernel matrix of a symmetric function is semi-positive definite, it can be used as a kernel function.
The kernel matrix of κ i (x, y) is K i = (κ i (x i , x j )) m×m ; thus, the kernel matrix of κ(x, y) is According to Theorem 1, if κ(x, y) is the kernel function, the kernel matrix K is positive semi-definite.
Let Kx = λx. Then x and λ are the eigenvectors and eigenvalues of K, respectively, so Eq (10) can be expanded where λ i is the eigenvalue of the matrix K i .

PLOS ONE
Since the kernel matrices K 1 , K 2 , . . ., K n are positive semi-definite matrices, their eigenvalues λ 1 , λ 2 , . . ., λ n are non-negative. According to Eq (11), all the eigenvalues of K are non-negative, so the matrix K is a positive semi-definite matrix. Because κ(x i , y j ) = κ(y j , x i ), is the kernel function.□ When the weighted kernel function of Eq (8) is used to reduce the dimensionality of the original data, the problem of weight value will be encountered. The basic criterion of weight construction is the ratio of the eigenvalues of each K i in the weighted kernel to the sum of all of them. The detailed construction process is as shown below.
Assume that all the eigenvalues of the kernel matrix K i are l i 1 � l i 2 � . . . � l i m in sequence, where i = 1, 2, . . ., n, p is the dimension of the original data set, and d is the dimension taken after the reduction of the kernel function. Generally, d < p or d � p. The weight of the kernel function is Through the concept of "weighted kernel function dimension reduction efficiency", the value range of the final number d of feature extractions is determined.
Definition Suppose that the eigenvalues of the kernel matrix K = (κ(x i , x j )) m×m are λ 1 � λ 2 � . . . � λ m � 0. Then, we determine that is the dimension reduction efficiency of the kernel discriminant function z j ¼ is the cumulative dimension reduction efficiency of the first d(d � m) kernel discriminant function z 1 , z 2 ,� � �z d . According to the cumulative contribution rate of principal component analysis [29], the number of features d after dimension reduction can make R d reach 0.8~0.9.

T-class kernel function
The weighted kernel function is the combination of multiple single kernel functions. The selection of a single kernel function will directly affect the dimensional reduction effect of the weighted kernel function. Therefore, we need to try to construct the new kernel function to improve the ability of weighted kernel functions to reduce the dimensionality of high-dimensional data to improve the classification performance of subsequent machine learning algorithms. According to the following Theorem 3 and probability density function of the t distribution, the t-class kernel function can be constructed. Theorem 3 [30] Suppose that f: X ! R is a bounded continuous integrable function. Then,

is a kernel function if and only if its Fourier transform
Theorem 4 When n ! +1, the probability density function of the t distribution is the kernel function.
Proof: First, f ð0Þ ¼ Gð n 2 Þ > 0. We just have to prove that the Fourier transform is nonnegative, as n ! +1.

PLOS ONE
Upon substituting Eq (17) into Eq (16), we havẽ According to Theorem 3, the probability density function of the t distribution is the kernel function. □ In practice, generally, n � 30.
Corollary 1 When n = 1, the density function of the t distribution is Then, Eq (18) is the kernel function. Proof: Eq (18) is the Cauchy distribution density function, whose Fourier transform is [31] f ðxÞ ¼ Therefore, Eq (18) is the kernel function. Theorem 4 When n ! +1, the function is the kernel function.

Proof: lim
According to Theorem 3 Z þ1 is the kernel function. We call Eq (19) the pseudo t function. Corollary 2 When n = 1, the pseudo t function is the kernel function.
Proof: According to Theorem 1, we just have to prove that

PLOS ONE
A novel dimension reduction algorithm based on WKPCA for gene expression data Now, the key problem is whether There is the following Corollary. Corollary 3 When c > 0, the function is the kernel function. The constant c in the above equation can be regarded as the scale parameter, so the kernel function in Corollary 3 is a multi-scale kernel function, which can adapt to the samples with drastic changes when the scale parameter is small, and can adapt to the samples with gentle changes when the scale parameter is large. The figure of multi-scale t kernel function with different parameters is as follow.
It can be seen from the Fig 1 that the kernel function gradually flattens with the increase of scale parameters. The multi-scale t-class kernel function constructed in Corollary 3 has rich scale choices, which makes it have better adaptability when processing complex data.
If only the traditional kernel functions such as polynomial kernel and hyperbolic tangent kernel are combined linearly, there is no basis for the selection and combination of kernel function parameters, and the uneven distribution of samples still cannot be solved satisfactorically, which limits the expression ability of decision function. The t-class kernel functions constructed by us can be generalized to multi-scale functions eventually. With the gradual maturity and improvement of wavelet theory and multi-scale analysis theory, the multi-scale kernel method has a good theoretical background by introducing scale space.
Some t-class kernel functions are constructed in this section, and they can be part of the weighted kernel function. By the experimental analysis in Section 4, the t-class kernel function can reduce the dimensionality of gene expression data effectively and improve the classification performance of subsequent machine learning methods.

WKPCA dimension reduction algorithm
According to the theory of kernel principal component analysis and weighted kernel function construction, the basic framework of the WKPCA dimension reduction algorithm is shown in Fig 2. 3.3.1 WKPCA dimension reduction algorithm design. Obviously, the kernel principal component depends on the selection of the kernel function. When constructing the weighted kernel function to reduce the dimensionality, kernel functions such as the Gaussian kernel, Laplace kernel, hyperbolic tangent kernel and polynomial kernel functions are generally

PLOS ONE
selected. We can also choose the t-class kernel function, which is constructed in Section 3.2. Since the weighted kernel principal component requires calculating the eigenvalues and eigenvectors of the weighted kernel matrix, first, the corresponding weighted kernel matrix should be computed using the training samples.
According to Eq (22), if the sample size is only a few hundred samples, for example, m = 400, then the kernel matrix will contain 160,000 data points. With the increase of the sample size, the time cost of calculating the weighted kernel matrix will greatly increase.
To improve the operational efficiency of the algorithm, the following methods can be adopted. The Gaussian kernel and t-class kernel functions can be regarded as the distance function of any two samples, while the hyperbolic tangent kernel and polynomial kernel functions can be regarded as the function of the inner product of any two samples. Take the pseudo-t kernel function when n = 1 as an example, Its kernel matrix is where dist ij = kx i − x j k is the Euclidean distance of any two samples and Dist = (dist ij ) m×m is the distance matrix of the sample set.

PLOS ONE
Let M = (x ij ) m×n , and define the matrix function as According to Eqs (24) and (25), the kernel matrix based on the pseudo-t kernel can be regarded as a function of the distance matrix, i.e., Similarly, the kernel matrix based on the hyperbolic tangent kernel and polynomial kernel can be regarded as a function of the inner product matrix, i.e., Therefore, the distance matrix and the inner product matrix can be substituted into the kernel function as a whole to get the corresponding "kernel matrix". The above process is called the vectorized computation method. In terms of algorithm design, vectorization is faster and more efficient than the multiple loop statements shown.
The dimensional reduction algorithm flow of WKPCA is given as shown in Table 1.
First, the input of the WKPCA algorithm includes 3 to 4 parts-the original data matrix D = (x ij ) m×p , the number p of features contained in the data, and the distance matrix or inner product matrix corresponding to the original data set D. If we define both the t-class kernel (or Laplace kernel) and the hyperbolic tangent kernel (or polynomial sum) in step 1 of the algorithm, we need to use both the distance matrix and the inner product matrix; otherwise, only one type of matrix will be input.
For the first line of the algorithm, in order to ensure the simplicity of the algorithm, two or three kernel functions are generally defined. Based on the distance matrix and inner product matrix, the kernel matrix and its eigenvalues are computed from Lines 2 to 4. In Line 5, d represents the selected dimension after feature reduction, where d < p. The weight of each kernel function is determined between Line 6 and 8. The kernel matrix of the weighted kernel function and its eigenvalues and eigenvectors are calculated between Lines 9 and 11. The d dimensional coordinates of all the samples in the new feature space are calculated in Line 12.
Time complexity analysis: Due to vectorized computation, the time used to calculate the distance and inner product matrix in Line 2 is O(1), and the time used in Lines 1 to 4 is O(q). The time consumption of the WKPCA algorithm mainly occurs in Lines 5 to 13, and its time complexity is O((m + q)p). Since the number q of kernel functions is much smaller than the sample size m, the total time complexity of this algorithm is O(mp). It is important to point out that in general, m > p or m >> p, but for some data sets, such as the gene expression data set, m � p. Through the experimental analysis in Section 4, it can be concluded that after the dimension reduction of the WKPCA, the value of p only needs to be a few percent of the total number of variables to achieve a better classification prediction effect, and the time cost is moderate.

Experimental results and analysis
In this section, the t-class kernel functions constructed in Section 3.2 are weighted and combined. WKPCA dimension reduction is performed on 6 real gene expression data sets based on the t-class weighted kernel function to obtain unrelated principal components. According to Eq (14), the number d of principal components retained is determined. Then, the current mainstream machine learning methods including naive Bayes (NB) [32], support vector machines (SVM) [33], k-nearest neighbour (KNN) [34], random forest (RF) [35], and iterative random forest (IRF) [36][37][38] are used to make classification predictions for the subset after dimension reduction. The above machine learning algorithm is used to perform classification prediction on the all-variable (AV) data set, and the data subsets of linear principal component analysis (PCA) dimension reduction, single kernel principal component (SKPCA) dimension reduction and weighted kernel principal component analysis (WKPCA) dimension reduction.

Experimental design
The experiments were conducted on a machine equipped with the Windows 10 64-bit operating system, an Intel i7-10510 μ 2.3 GHz CPU and 16 GB memory. The algorithm was implemented in the R language (R 3.6.3). The 6 real data sets used in this paper are from the Broad Institute Genome Data Analysis Center (http://portals.broadinstitute.org/cgi-bin/cancer/ datasets.cgi). See Table 2 for detailed information.   11. Compute the eigenvalues and eigenvectors λ j and a j ¼ ða

PLOS ONE
To compare the performances of the machine learning classification algorithms in different dimensions, the classification macro accuracy, macro precision, macro recall, macro F1 are used and their specific definitions are as follows.
Suppose that the data set D has k categories. The i th category is considered as a positive class, and the remaining k − 1 categories are deemed to be negative class. We use P i , R i , F1 i to denote the precision, recall and of i th category respectively.
From Eqs (28) to (30), it can be seen that the so-called macro is to calculate the precision, recall rate and F1 of each category, and calculate their average value respectively, so as to evaluate the performance of the algorithm in multi-class problems. The larger the macro precision, macro recall and macro F1, the better the performance of the algorithm. AUC value of the area under the ROC curve is also used in the evaluation criteria [39].
Since the number of categories of the 6 datasets is more than 2, the definition of AUC for multi-classification problems given by Hand and Till [40] is adopted. Nonlinear SVM based on Gaussian kernel function is used. The parameters of the SVM and KNN classification methods are realized by the machine learning adjustable parameter functions tune.svm and tune. kknn in the R language [41]. In the tune.svm, the parameter grid search range is set to 0.1 to 4 at step length 0.1. In the tune.kknn, the parameter grid search range is set to 1 to 30 at step length 1. RF is set to 500 trees by default, and the number of IRF iterations is set to 6.
To evaluate the overall classification performance of WKPCA combined with various machine learning algorithms, the definition of the optimal performance rate (OPR) of WKPCA is given in this paper.
where MN is the number of machine learning algorithms, DN is the number of data sets, EN is the number of evaluation indexes, and PN is the number of WKPCA dimension reduction algorithms reaching the maximum under each evaluation index.

PLOS ONE
By extending Eq (31), the cumulative optimal performance rate (COPR) of WKPCA is given where PN i is the number of WKPCA dimension reduction algorithms reaching the j th maximum under each evaluation index and s is the number of methods compared with WKPCA.

Comparison experiment
Based on the t-class weighted kernel function, WKPCA dimension reduction is performed for the 6 gene expression data sets in Table 2. Through a large number of comparative experiments, for different datasets and different classification methods, Different kernel combination formulas for dimensionality reduction will result in different classification performance. In order to achieve the relatively optimal performance of the classification algorithm after dimensionality reduction of kernel principal component, the following three forms of kernel combination formula are mainly adopted.
The above equations are used to reduce dimension of the original data set based on the kernel principal component, and compared with the traditional Gaussian kernel, the experimental results are shown in Tables 3 to 8 in this paper. For SKPCA dimension reduction, the selected single kernel function is the Gaussian kernel The weights in the Eqs (29), (30) and (31) are determined according to the Eq (12) in the paper. For the determination of scale parameters c 1 , c 2 , and γ, the wrapper learning algorithm is used. The parameter selection of the kernel function is combined with the subsequent machine learning classification algorithm, and the parameters that make the classification performance optimal are selected through cross validation. Finally, these parameters are set to c 1 = 0.1, c 2 = 0.2 and γ = 0.1.
According to the experimental results in Tables 3 to 8, we can find that it is not difficult to find the relatively optimal parameters.
The above five machine learning methods are used to classify and predict the following 4 data sets: (1) one with all variables; (2) one obtained by linear principal component analysis dimension reduction; (3) one obtained by single kernel function dimension reduction; and (4) the last obtained by weighted kernel function dimension reduction. The comparison results obtained through nested 5-fold cross validation are shown in Tables 3 to 8, in which the optimal performance index values are bolded.

PLOS ONE
We combine the five machine learning methods with AV, PCA, SKCPCA and WKPCA, so each table (Table 3 through Table 8) contains 20 methods. In these six tables, the machine learning classification algorithm combined with WKPCA corresponds to the best performance. Taking the Breast data set as an example, compared with AV, PCA and SKPCA, NB_WKPCA, SVM_WKPCA, KNN_WKPCA and RF_WKPCA were all the largest in the four evaluation indexes. However, IRF_WKPCA did not reach the maximum on four evaluation indexes, and the other tables showed similar results. According to the experimental results from Table 3 through Table 8, among the 5 machine learning methods combined with WKPCA, there are 4, 14, 5, 13, 10 and 3 that do not reach the maximum on the four evaluation indexes. Therefore, according to Eq (31), the optimal performance rate of WKPCA on these 6 data sets is According to Eq (32), the cumulative optimal performance rate of WKPCA on these 6 data sets is Through OPR and COPR values, it can be concluded that the WKPCA algorithm is optimal at 71 and suboptimal at 37, and the cumulative optimal performance rate of the first two positions reaches 95%. This indicates that WKPCA dimension reduction can effectively improve the classification performance of the current mainstream machine learning algorithms. In other words, WKPCA is superior to AV, PCA and SKPCA in most cases.
It should be noted that for the SVM classification algorithm, if all variables are involved in the modelling without dimension reduction, the classification accuracy of SVM_AV on the 6 data sets is only 0.5200, 0.4833, 0.3803, 0.3188, 0.1933 and 0.7053. After WKPCA dimension reduction, the SVM classification accuracy was greatly improved, reaching 0.9184, 0.9556, 0.8071, 0.9758, 0.9805 and 0.9796, respectively. It is shown that when the number of features in the data set is much larger than the number of samples, the classification performance of some algorithms will be degraded or even become invalid if all variables are involved in the model. However, after WKPCA dimension reduction, a few principal components unrelated to each other are retained, redundant information (noise interference) is eliminated and the main information related to the sample category is retained, which improves the classification performance of the machine learning algorithm. In Tables 5 and 7, NB_AV has missing values (NA) on four performance indexes. According to experimental analysis, the reason for this problem is that the sample variance is 0 for at least 1 column variable. If all variables are included in the NB model for classification, this algorithm will fail. However, after the dimension reduction of WKPCA, PCA and SKPCA, the zero variance can be avoided, and normal classification results can be obtained.
To intuitively compare the classification effects of AV, PCA, SKPCA and WKPCA combined with the above five machine learning methods, the SVM, KNN and RF classifiers are taken as examples (other classifiers have similar situations). A bar chart of nested 5-fold crossvalidation AUC values is drawn based on these six data sets, and the results are shown in Figs 3 to 5.
As seen from Fig 3, except that SVM_WKPCA is slightly inferior to SVM_PCA for the Leukaemia data, the AUC values of SVM_WKPCA for the other 5 data sets reach the maximum, which is a significantly better performance than those of SVM_AV and SVM_SKPCA and slightly better than that of SVM_PCA. In Fig 4, the AUC value of RF_WKPCA for the Breast data set is lower than those of RF_AV and RF_PCA, while for the other 5 data sets, the AUC values of RF_WKPCA all achieve the optimal values, but its advantage is not very significant. As seen from Fig 5, the AUC values of KNN_WKPCA reach the maximum for Multi-A and Lung data sets. The AUC value of KNN_WKPCA is similar to that of KNN_AV or KNN_PCA in Breast, DLBCL-B and Leukemia data sets. For the DLCBCL-D data set, the AUC value of

PLOS ONE
KNN_WKPCA is the lowest. It is shown that for different data sets, dimension reduction using WKPCA can not make all classification algorithms achieve the optimal performance. From Figs 3 to 5, overall it can be concluded that the AUC values of the SVM, RF and KNN classifiers can be improved after WKPCA dimension reduction for most data sets. The results show that WKPCA dimension reduction can effectively improve the predictive performance of the current mainstream machine learning classification algorithms.

Conclusion
Aiming at the characteristics of the high dimensionality, high redundancy and small sample sizes of gene expression data sets, a principal component dimension reduction algorithm based on the weighted kernel function is proposed in this paper to improve machine learning classification prediction performances and reduce the complexity of the classification process. By calculating the eigenvalues of the kernel matrix, the kernel function weight is constructed, and the t-class kernel function is also constructed to further improve the dimension reduction efficiency of WKPCA. Finally, the cumulative optimal performance rate is constructed to evaluate the overall classification level of WKPCA combined with mainstream machine learning algorithms. Through the analysis of the experimental results in 6 real data sets, compared with the all-variable model, traditional linear principal component analysis dimension reduction and single kernel principal component analysis dimension reduction, the WKPCA dimension reduction algorithm proposed in this paper can effectively improve the classification prediction performance of the current mainstream machine learning methods.
The key to WKPCA dimension reduction lies in how to choose a 'suitable kernel function'. Our weighted kernel function makes the form of the kernel function more diversified and the selection more flexible, which allows better adaptation to data sets in different fields. In realworld problem analysis, to achieve the desired performances of machine learning on each data set in this paper, we have to attempt different kernel function combinations with different parameter settings. In other words, the best algorithm configuration is dataset-dependent.

PLOS ONE
However, our WKPCA dimension reduction algorithm is quite insensitive to parameter settings.