Gene shaving using a sensitivity analysis of kernel based machine learning approach, with applications to cancer data

Md. Ashad Alam; Mohammd Shahjaman; Md. Ferdush Rahman; Fokhrul Hossain; Hong-Wen Deng

doi:10.1371/journal.pone.0217027

Abstract

Background

Gene shaving (GS) is an essential and challenging tools for biomedical researchers due to the large number of genes in human genome and the complex nature of biological networks. Most GS methods are not applicable to non-linear and multi-view data sets. While the kernel based methods can overcome these problems, a well-founded positive definite kernel based GS method has yet to be proposed for biomedical data analysis.

Methods and findings

Since the kernel based methods on genomic information can improve the prediction of diseases, here we proposed a noble method, “kernel based gene shaving” which is based on the influence function of kernel canonical correlation analysis. To investigate the performance of the proposed method in comparison to state-of-the-art-method in gene saving, we analyzed extensive simulated and real microarray gene expression data set. The performance metrics including true positive rate, true negative rate, false positive rate, false negative rate, misclassification error rate, the false discovery rate and area under curves were computed for each methods. In colon cancer data analysis, the proposed method identified a significant subsets of 210 genes out of 2000 genes and suggestive superior performance compared with other methods. The proposed method can be applied to the study of other disease process where two view data is a common task.

Conclusions

We addressed the challenge of finding unique kernel based GS methods by using the influence function of kernel canonical correlation analysis. The proposed method has shown to have better performance than state-of-the-art-methods in gene saving and has identified many more significant gene interactions, suggesting that genes function in a concerted effort in colon cancer. In similar biomedical data analysis, kernel based methods could be applied to select a potential subset of genes. The positive definite kernel based methods can overcome the non-linearity problem and improve the prediction process.

Citation: Alam MA, Shahjaman M, Rahman MF, Hossain F, Deng H-W (2019) Gene shaving using a sensitivity analysis of kernel based machine learning approach, with applications to cancer data. PLoS ONE 14(5): e0217027. https://doi.org/10.1371/journal.pone.0217027

Editor: Enrique Hernandez-Lemus, Instituto Nacional de Medicina Genomica, MEXICO

Received: February 1, 2019; Accepted: May 2, 2019; Published: May 23, 2019

Copyright: © 2019 Alam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data set and summary statistics are available at Princeton University, gene expression project, http://genomics-pubs.princeton.edu/oncology/. This data set is also available at ‘rda’ R package, https://cran.r-project.org/web/packages/rda/.

Funding: Our research was partially supported by grants from the National Institutes of Health [R01AR057049, R01AR059781, P20 GM109036, R01MH107354, R01MH104680, R01GM109068, U19AG055373, and R01AR069055], and the Edward G. Schlieder Endowment fund to Tulane University.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Gene shaving (GS), to identify significant subsets of the genes, is an important research area in the analysis of DNA microarray gene expression data for biomedical discovery. GS methods aim to remove redundant and irrelevant genes so that performing in supervised learning will be more accurate [1, 2]. It leads to gene discovery relevant for a particular target annotation and contributes to better medical diagnosis and prognosis. GS is not relevant to the hierarchical clustering and other widely used methods for analyzing gene expression in the genome-wide association studies. GS leads to gene discovery relevant for a specific target annotation. The selected genes using GS play an important role in the gene expression data analysis since they can differentiate samples from different populations [3–6]. Despite their successes, these studies are often hampered by their relatively low reproducibility, nonlinearity and multi-view data.

The incorporation of various statistical machine learning approaches into genomic analysis is a rather recent area of study. Since large-scale microarray data presents significant challenges for the statistical data analysis, in addition the classical approaches, there is a need for an advanced method. The kernel methods (methods based on positive definite kernel) are the appropriate tools to deal with such data set that map data from a high dimensional space to a feature space using a nonlinear feature map. The main advantage of these methods is to combine statistics and geometry in an effective way [7–9]. As a machine learning approach, kernel canonical correlation analysis (kernel CCA) have been extensively studied for decades to analyze multi-view data set [10–12]. Using the influence function (IF) of kernel canonical correlation analysis, we proposed a novel kernel method to select a significant subset of genes of biomedical data analysis.

Nowadays, IF based methods (e.g., sensitivity analysis) have been used to detect an influence observation. IF is used to find a set of vectors that have much greater effect on the estimator of the parameter [13]. A visualization method for detecting influential observations using the IF of Kernel principal component analysis has been proposed by Debruyne et al. [14]. Filzmoser et al. also developed a method for outlier identification in high dimensions [15]. However, these methods are limited to a single view data set. Due to the properties of eigen-decomposition, kernel CCA and its variant are still well used methods for the biomedical data analysis [16–18].

The contribution of this paper is three-fold. First, we address the IF of kernel CCA. Second, we use the distribution based methods to confirm the influential observations. Finally, the proposed method is applied to identify a set of genes in both synthesized and gene expression data. The accuracy of the proposed method shows superior performance compared to the the state-of- the-art-method in gene saving based on the area under curves (AUC). In colon cancer data analysis, we used the proposed method to identify genes and perform pathway analysis [the gene ontology (GO) of biological process categories, Kyoto Encyclopedia of Genes and Genomes (KEGG)] and gene-gene interaction networks. We found that identified genes function in a concerted effort and have biological relevance to colon cancer. In addition, the selected genes based classification is superior than selected genes by other methods as well as classification using all genes. For any biomedical data analysis, the proposed method could be applied to select a potential subset of genes.

The remainder of the paper is organized as follows. In the materials and methods section, we provide a brief review of positive definite kernel, kernel CCA and IF of kernel CCA. The utility of the proposed method is demonstrated by both simulated and real data analysis from an colon cancer study in the experimental results section. In the discussion section, we also summarize our findings and give a perspective for future research.

Materials and methods

Positive definite kernel

In kernel methods, a nonlinear feature map is defined by positive definite kernel. It is known that a positive definite kernel k is associated with a Hilbert space , called reproducing kernel Hilbert space (RKHS), consisting of functions on so that the function value is reproduced by the kernel [19]. For any function and a point , the function value f(X) is where in the inner product of is called the reproducing property. Replacing f with yields for any . A symmetric kernel k(⋅, ⋅) defined on a space is called positive definite, if for an arbitrary number of points the Gram matrix (k(X_i, Y_j))_ij is positive semi-definite. To transform data for extracting nonlinear features, the mapping is defined as Φ(X) = k(⋅, X), which is a function of the first argument. This map is called the f feature map, and the vector Φ(X) in is called the feature vector. The inner product of two feature vectors is then This is known as the kernel trick. By this trick the kernel can evaluate the inner product of any two feature vectors efficiently without knowing an explicit form of Φ(⋅) [7–9].

Kernel canonical correlation analysis

Kernel CCA has been proposed as a nonlinear extension of linear CCA [10]. Researchers have extended the standard kernel CCA with an efficient computational algorithm [20]. Over the last decade, kernel CCA has been used for various tasks [21–23]. Given two sets of random variables X and Y with two functions in the RKHS, and , the optimization problem of the random variables f_X(X) and f_Y(Y) is (1) The optimizing functions f_X(⋅) and f_Y(⋅) are determined up to scale.

Using a finite sample, we are able to estimate the desired functions. Given an i.i.d sample, from a joint distribution F_XY, by taking the inner product with elements or “parameters” in the RKHS, we have features and , where k_X(⋅, X) and k_Y(⋅, Y) are the associated kernel functions for and , respectively. The kernel Gram matrices are defined as and . We need the centered kernel Gram matrices M_X = CK_XC and M_Y = CK_YC, where with and 1_n is the vector with n ones. The empirical estimate of Eq (1) is then given by where where a_X and a_Y are the directions of X and Y, respectively.

Influence function of the kernel canonical correlation analysis

Since 1974, the IF plays an important role for detecting outlying multivariate observations in statistical analysis. The IF can usually be defined on first order approximation for estimators of parameters in a multivariate population which indicates where in the n-dimensional space of observations. The observed vectors should have a large effect on the value of the estimator of the parameter. For a sample of observation vectors, we can define the IF based on empirical distribution (EIF) to find set of these vectors that have much greater effect on the estimator. This vector is called set of outline vector [13]. In many situation outliers are often the special point of interest and their recognition is the main goal of the investigation. Although, there are several approaches to identify outliers in multivariate data analysis. The goal of this paper is to identify a set of outline observations for two view data set using IF of kernel CCA.

Using the idea of IF of the linear PCA, the kernel PCA, and the linear CCA, the IF of kernel CCA has been proposed by Alam et al., [18]. To define, given two sets of random variables (X, Y) having the distribution F_XY and the j-th kernel CC (ρ_j) and kernel CVs (f_jX(X) and f_jX(Y)), the influence functions of kernel CC at Z′ = (X′, Y′) is given by where and . The above theorem has been proved on the basis of previously established ones, such as the IF of linear PCA [24, 25], the IF of linear CCA [26], and the IF of kernel PCA, respectively. The details proof is given in [18].

Let be a sample from the empirical joint distribution F_nXY. The EIF of kernel CC at (X′, Y′) for all points (X_i, Y_i) is defined as (2)

Using the above result, we can identify a set of observations based on its influence values. To demonstrate, we proposed a noble method, with application to DNA microarray gene expression data. This novel method can be applied to the study any disease processes, where two-view data analysis is a common task. The proposed approach consists of two basic parts: a step that aims to calculate influence value of each gene and a step that aims to determine the outline gene. For the first step, we use EIF in Eq (2) and we can use a any univariate outliers detection tools. To extract the outliers of the genes, we have considered distribution based tools.

Kernel choice

In kernel based learning, choosing a suitable kernel is key for favorable results. Most of unsupervised kernel methods suffer from the problem of kernel choice. The liner kernel is just used the underlying Euclidean space to define the similarity measure. Whenever the dimensionality of the input space, X is very high, this might allow for more complexity in the function class than what we could measure and assess otherwise. It has limitation of linearity. Using a polynomial kernel it is possible to use the higher order correlation between the data in the different purposes. But, due to the finite bounded degree such kernel will not provide us with guarantees for a good dependency measure. In addition both liner and polynomial kernels are non-robust.

The Gaussian kernel, is a radial basis function kernels that maps X into an infinite dimensional space. The Gaussian kernel is defined as:

This most applicable kernel in kernel methods has a number of theoretical properties (e.g., boundedness, consistent, characteristic, universality, robustness etc.) [27]. In this paper we consider the Gaussian kernel and use the median of the pairwise distance as a bandwidth [28, 29].

The assumption of kernel methods (methods based on positive definite kernel) is that the data should be a non-empty set. The kernel methods are independent of the dimensions. Its allow us to construct spaces of functions on an arbitrary set with the appropriate structure of a Hilbert space. By the reproducing property, computing the inner product on RKHS is easy and the computational cost only depends on the sample size. It is true that kernel methods may have computational issues for very large data set in handling Gram matrices of sample size. However, recent developments on approximation methods such as random Fourier features enables us to apply kernel methods to data size of millions.

Relevant approaches

While the proposed approach is designed for two view data set, we compare its performance against other relevant algorithms in univariate data or multivariate data (one view data) set only, since a two view data comparison is not feasible. To demonstrate the performance of the proposed method in a comparison, we examine four popular gene selection methods: T-test, significance analysis of microarrays (SAM), Linear Models for Microarray and RNA-Seq Data (LIMMA) and principal components to identify outliers (PCout) [15, 30–32]. Computing a t-test statistic can be problematic because the variance estimates can be skewed by genes having a very low variance [30]. For each gene, SAM gives a score on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR) [31]. LIMMA contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. This linear modelling strategy (beyond the intended analysis of gene expression data) has been found to have many applications [32]. A computationally fast procedure for identifying outliers is presented that is particularly effective in high dimensions. This algorithm not only utilizes simple properties in the transformed space but also needs less computational time than existing methods for outliers detection, and is suitable for use on very large data sets [15]. But it has limitation of linearity and a single view data set. We used all of these methods to compare to the proposed method.

Experimental results

We have used both simulated and real microarray gene expression data set of colon cancer [33]. To compare relevant approaches (T-test, SAM, LIMMA and PCout) we used four R packages including STATS, SAMR, LIMMA and PCout, respectively. The performance measures including true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), false negative rate (FNR), misclassification error rate (MER), FDR and AUC have been evaluated for each of the methods as previously described [34]. To compute the performance measures, we used R packages, which are available in the comprehensive R archive network or bioconductor.

Simulation study

To investigate the performance of the proposed method in comparison with four popular methods as mentioned above with k = 2 groups, we considered gene expression profiles from both normal distribution and t-distribution. We also considered data set of both small-and-large-sample cases with different percentages of differently expressed (DE) genes.

Simulated gene expression profiles generated from normal distribution

We used a one-way ANOVA model to generate simulated data sets from normal distribution (3) where x_ijk, i is the expression of the ith gene for the jth samples in k group, μ_ik is the mean of all expressions of ith gene in the kth group and ϵ_ijk is the random error which usually follows a normal distribution with mean zero and variance σ².

To investigate the performance of the proposed method in a comparison of other four popular methods as early mentioned for k = 2 groups, we generated 100 data sets using 100 times of simulations for both small (n₁ = n₂ = 3) and large (n₁ = n₂ = 15) sample cases using Eq (3). The means and the common variance of both groups were set as (μ_i1, μ_i2) ∈ (3, 5) and σ² = 0.1, accordingly. Each data set for each case represented the gene expression profiles of G = 1000 genes, with n = (n₁+ n₂) samples. The proportions of DE gene (pDEG) were set to 0.02 and 0.06 for each of the 100 data sets. We computed average values of different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on 20 and 60 estimated DE genes by five methods (T-test, SAM, LIMMA, PCout and Proposed) for each of 100 data sets. Fig 1a and 1b represent the ROC curve based on 20 estimated DE genes by four methods for both small-and-large-sample cases, respectively. We observe that the proposed method performed better than other four methods for small-sample case (Fig 1a). On the other hand, for large-sample case (Fig 1b) proposed method keeps almost equal performance with other four methods. Fig 2 shows the boxplot of AUC values based on 100 simulated data set estimated by each of the four methods both for small-and-large-sample cases, respectively. Fig 2a and 2b represent the boxplots of AUC values with pDEG = 0.02 and 0.06, respectively. From these boxplots we obtained similar results like ROC curve for every pDEG values. We also notice that the performance of the methods increases when we increase the value of pDEG 0.02 to 0.06. Furthermore, we calculate the average values of different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on 20 (pDEG = 0.02) and 60 (pDEG = 0.06) to estimate DE genes by each of the methods. The results are summarized in Table 1. In this table the results without and within the brackets indicate average of different performance measures estimated by different methods for small-and-large sample cases, respectively. We also find the similar interpretations like ROC curve and boxplots (Table 1).

Download:

Fig 1. Performance evaluation using ROC-curve produced by the four methods (T-test, SAM, LIMMA, PCout and Proposed) based on 100 datasets with pDEG = 0.02.

Datasets were generated from normal distribution for (a) and (b) and datasets were generated from t-distribution for (c) and (d), where (a) and (c) represents ROC curve for small-sample case (n₁ = n₂ = 3) and (b) and (d) represents ROC curve for large-sample case (n₁ = n₂ = 15).

https://doi.org/10.1371/journal.pone.0217027.g001

Download:

Fig 2.

Performance evaluation using boxplot of AUC values produced by the four methods (T-test, SAM, LIMMA, PCout and Proposed) based on 100 datasets were taken from normal distribution for small-and large-sample cases (a) Boxplot of AUC values with proportion of DE gene = 0.02. (b) Boxplot of AUC values with proportion of DE gene = 0.06. Each dataset contains G = 1000 genes.

https://doi.org/10.1371/journal.pone.0217027.g002

Download:

Table 1. Performance evaluation of different methods based on simulated gene expression dataset generated from normal distribution.

https://doi.org/10.1371/journal.pone.0217027.t001

Simulated gene expression profiles generated from t- distribution

We also investigated the performance of the proposed method in a comparison of other four methods for non-normal case. Accordingly we generated 100 simulated data sets from t-distribution with 10 degrees of freedom. We set the mean and variance as previously mentioned. We estimated different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on 20 estimated DE genes by four methods for each of 100 data sets. The average values of performance measures are summarized in Table 2. From this table we mentioned that the performances of all the methods become progressively worse when the datasets came from t-distribution. We also observed that the proposed method performed better than the other four methods. For example, the proposed method produces AUC = 0.469 (0.887) which is larger than 0.316 (0.830), 0.326 (0.832), 0.411 (0.880) and 0.316 (0.830) for the competitors T-test, SAM, LIMMA and PCout, respectively. The boxplots in Fig 3 and ROC curve in Fig 1(c) and 1(d) also revealed similar results like Table 2. We also noticed from boxplots that the proposed method has less variability among the other four methods. From this analysis we may conclude that the performance of the proposed method has improved than the four well-known gene selection methods.

Download:

Fig 3.

Performance evaluation using boxplot of AUC values produced by the four methods (T-test, SAM, LIMMA, PCout, and Proposed) based on 100 data sets were taken from t-distribution distribution for small-and large-sample cases (a) Boxplot of AUC values with proportion of DE gene = 0.02. (b) Boxplot of AUC values with proportion of DE gene = 0.06. Each data set contains G = 1000 genes.

https://doi.org/10.1371/journal.pone.0217027.g003

Download:

Table 2. Performance evaluation of different methods based on simulated gene expression data set generated from t-distribution.

https://doi.org/10.1371/journal.pone.0217027.t002

Application to colon cancer microarray data

The data consists of expression levels of 2000 genes obtained from a microarray study on 62 colon tissue samples collected from colon-cancer patients [33]. Among the 62 colon tissues, tumor tissues (40) and normal tissues (22) were coded by 2 and 1, respectively. The goal here is to characterize the underlying interactions between genetic markers for their association with the colon-cancer patients and the healthy persons. In simulation studies, we observed that the multivariate approaches (the PCOut and the proposed (KCCOut)) performed better than univariate approaches. In addition to PCOut and KCCOut, we considered liner CCA (CCOut) to colon cancer data analysis. To calculate the influence value of each gene, we used these three methods, respectively. Fig 4. visualizes the plots of absolute influence value for 2000 genes. By the outlier detection technique in the one dimensional influence value of each method, we obtained 31, 133 and 210 genes using the PCOut, the CCOut and the KCCOut, respectively. To compare the selected genes, we made a Venn-diagram of the selected genes from the three methods. Fig 5. presents the Venn-diagram of the PCOut, LCCAOut, and KCCAOut methods. From this figure, we observed that the disjointedly selected genes of PCOut, LCCAOut, and KCCAOut are 19, 61, and 144, respectively. The number of common genes between PCOut and LCCAOut, and PCOut and KCCAOut, and LCCAOut and KCCAOut were 7, 1, and 61, respectively. All methods selected 4 common genes: J00231, T57780, M94132 and M87789.

Download:

Fig 4. The influence value of genes using three methods: The principal components analysis (PCOut), the linear canonical correlation analysis (LCCOut), and the kernel canonical correlation analysis (KCCOut).

https://doi.org/10.1371/journal.pone.0217027.g004

Download:

Fig 5. The Venn diagram of the selected genes using three methods: The principal components analysis (PCOut), the linear canonical correlation analysis (LCCOut), and the kernel canonical correlation analysis (KCCOut).

https://doi.org/10.1371/journal.pone.0217027.g005

Genes do not function alone; rather, they interact with each other. When genes share a similar set of gene ontology (GO), they are more likely to be involved with similar biological mechanisms. To verify this, we extracted the GO of biological process categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotations of 210 genes detected by proposed KCCA using Database for Annotation, Visualization and Integrated Discovery (DAVID) [35]. The GO analysis revealed that most of genes are significantly enriched in biological adhesion, cell adhesion, viral process, multi-organism cellular process, regulation of cellular amide metabolic process etc. (see supplementary S1 Table). Table 3 presents the KEGG pathway analysis. From the table, we found that these genes are mostly enriched in toxoplasmosis, antigen processing and presentation, proteoglycans in cancer, neurotrophin signaling pathway, small cell lung cancer etc. (also see supplementary S2 Table). We also constructed the gene-gene interaction networks using STRING [36]. The STRING imports protein association knowledge from databases of both physical interactions and curated biological pathways. In STRING, the simple interaction unit is the functional relationship between two proteins or genes that can contribute to a common biological purpose. Fig 6. shows the gene-gene network based on the protein interactions among the selected 210 genes. In this figure, the color saturation of the edges represents the confidence score of a functional association. Further network analysis shows that the number of nodes, number of edges, average node degree, clustering coefficient, p-values are 75, 214, 5.71, 0.473 for p ≤ 8.22 × 10⁻¹⁵, respectively. This network of genes has significantly more interactions than expected, which indicates that they may function in a concerted effort.

Download:

Fig 6. The network of the selected genes by the proposed method of colon cancer microarray data.

https://doi.org/10.1371/journal.pone.0217027.g006

Download:

Table 3. Top ten significant KEGG pathways for the 210 genes detected by the proposed method for Colon cancer data set.

https://doi.org/10.1371/journal.pone.0217027.t003

The proposed method can be applied to the study of other disease process, where two view data is a common task. To confirm, we have applied the proposed method to another real data set: RNA-sequence study for osteoporosis risk (Source: Tulane Center of Bioinformaties and Genomics). The details of the data and the results are provided in supplementary material, S1 File.

In addition, the data set was used to classify the colon cancer patients from the healthy controls via the PCOut and the proposed feature extraction techniques (CCOut and KCCOut) and followed by the two classifiers (the k-nearest neighbors (KNN) and liner support vector machine (SVM)). For the proposed approach, we considered the features 31, 133 and 210 that have influence effects using the PCout, the CCOut and the KCCOut, respectively. The PCOut, CCOut, and KCCOut serve as a feature extraction tool based on which the classifier is used to separate patients from healthy controls. Table 4 presents the classification error using cross-validation (2−fold and 5−fold). From these results, it is evident that the KCCOut based classification is significantly more accurate than other methods as well as methods on all features, demonstrating that the proposed method is a better tool for feature extraction.

Download:

Table 4. The classification error of discriminating colon cancer patients from healthy controls with cross-validations.

https://doi.org/10.1371/journal.pone.0217027.t004

Discussion

Kernel based machine learning methods are vital for the biomedical data analysis. The kernel based methods provide more powerful and reproducible outputs, while the interpretation of the results remain challenging. In this paper, the influence function of the kernel CCA based gene shaving method is proposed. The performance of the proposed method was evaluated on both simulated and real data set. The extensive simulation studies show the power gained by the proposed method relative to the alternative methods. The utility of the proposed method is to further demonstrate its application to analyze cancer microarray data, e.g. colon cancer microarray data. According to the influence values, the proposed method is able to rank the influence of a gene, and the genes are identified to be highly related to disease. Using an distribution based outlier detection method, the proposed method extracts 210 genes out of 2000 genes, which are considered to have a significant impact on the patients. Incorporating biological knowledge information (e.g., GO) can provide additional evidence for the results. By conducting GO, pathway analysis, and network analysis including visualization, we find evidence that the selected genes have significant influence on the manifestation of colon cancer disease and can serve as a distinct feature for the stratification of colon cancer patients from the healthy controls. This novel method can be applicable to the study of other disease processes including cancer, where gene shaving is a common task.

Supporting information

S1 Table. GO biological process categories for 210 genes for Colon cancer data set.

https://doi.org/10.1371/journal.pone.0217027.s001

(XLSX)

S2 Table. KEGG (whole) Pathways for 210 genes for Colon cancer data set.

https://doi.org/10.1371/journal.pone.0217027.s002

(XLSX)

S1 File. The details of the RNA-seq data and its results.

https://doi.org/10.1371/journal.pone.0217027.s003

(PDF)

References

1. Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology. 2000; 1(2):1–21.
- View Article
- Google Scholar
2. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015; ID 198363, 13 pages.
- View Article
- Google Scholar
3. Ruan L, Yuan M. An empirical bayes’ approach to joint analysis of multiple microarray gene expression studies. Biometrics. 2011; 67, 1617–1626. pmid:21517790
- View Article
- PubMed/NCBI
- Google Scholar
4. Sheng J, Deng HW, Calhoun VD, Wang YP. A Integrated Analysis of gene expression and copy number data on gene shaving using independent component analysis. IEEE/ ACM Transactions on computational biology and bioinformatics. 2011; 8(6), 1568–1579. pmid:21519112
- View Article
- PubMed/NCBI
- Google Scholar
5. Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics 2012; 99, 323–329. pmid:22546560
- View Article
- PubMed/NCBI
- Google Scholar
6. Castellanos-Garzón J, Romos J. A gene selection approach based on clustering for classificaiton task in colon cnacer. Advances in distributed computing and artificial intelligence journal. 2015; 4(3), 1–10.
- View Article
- Google Scholar
7. Hofmann T, Schölkopf B, Smola JA. Kernel methods in machine learning. The Annals of Statistics. 2008; 36,1171–1220.
- View Article
- Google Scholar
8. Alam MA. Fukumizu K. Hyperparameter selection in kernel principal component analysis. Journal of Computer Science. 2014; 10(7), 1139–1150.
- View Article
- Google Scholar
9. Charpiat G, Hofmann M, Schölkopf B. Kernel methods in medical imaging, Chapter 4, Berlin, Germany, Springer, 2015.
10. Akaho S. A kernel method for canonical correlation analysis. International meeting of psychometric Society. 2001;35, 321–377.
- View Article
- Google Scholar
11. Alam MA, Fukumizu K. Higher-order regularized kernel canonical correlation analysis. International Journal of Pattern Recognition and Artificial Intelligence. 2015; 29(4), 1551005 (1–24).
- View Article
- Google Scholar
12. Alam MA, Fukumizu K. Higher-order regularized kernel CCA. In the 12th International Conference on Machine Learning and Applications, Miami, USA. 2013; 374-377.
13. Hampel FR, Ronchetti EM, Rousseuw PJ, Stahel WA. Robust Statistics: the approach based on influence functions. John Wiley & Sons, New York, 2011.
14. Debruyne M, Hubert M, Horebeek JV. Detecting influential observations in kernel PCA. Computational Statistics and Data Analysis. 2010; 54, 3007–3019.
- View Article
- Google Scholar
15. Filzmoser P. Maronna R. and Werner M. Outlier identification in high dimensions. Computational Statistics and Data Analysis. 2008; 52, 1694–1711.
- View Article
- Google Scholar
16. Alam MA, Nasser M. Fukumizu K. Sensitivity analysis in robust and kernel canonical correlation analysis. In proceedings of the 11th International Conference on Computer and Information Technology, Bangladesh, IEEE. 2008; 399–404.
17. Alam MA, Calhoun, V. and Wang, Y-P. (2016). Influence function of multiple kernel canonical analysis to identify outliers in imaging genetics data. In proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB’16, Seattle, USA. 2016; 210–219.
18. Alam MA, Fukumizu K, Wang YP. Infuence function and robust variant of kernel canonical correlation analysis. Neurocomputing. 2018; 304, 12–29. pmid:30416263
- View Article
- PubMed/NCBI
- Google Scholar
19. Aronszajn N. Theory of reproducing kernels. Transactions of the American Mathematical Society. 1950; 68, 337–404.
- View Article
- Google Scholar
20. Bach FR, Jordan MI. Kernel independent component analysis. Journal of Machine Learning Research. 2002; 3, 1–48.
- View Article
- Google Scholar
21. Alzate C, Suykens JAK. A regularized kernel CCA contrast function for ICA. Neural Networks. 2008; 21, 170–181. pmid:18280110
- View Article
- PubMed/NCBI
- Google Scholar
22. Huang SY, Lee M, Hsiao CK. (2009b). Nonlinear measures of association with kernel canonical correlation analysis and applications. Journal of Statistical Planning and Inference. 2009; 139, 2162–2174.
- View Article
- Google Scholar
23. Richfield O, Alam MA, Calhoun V, Wang YP. Learning schizophrenia imaging genetics data via multiple kernel canonical correlation analysis. In proceedings- 2016 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016, Shenzhen, China. 2017; 5, 507–5011.
24. Tanaka Y. Sensitivity analysis in principal component analysis: in uence on the subspace spanned by principal components. Communications in Statistics-Theory and Methods. 1988; 17(9), 3157–3175.
- View Article
- Google Scholar
25. Tanaka Y. Inuence functions related to eigenvalue problem which appear in multivariate analysis. Communications in Statistics-Theory and Methods. 1989; 18(11), 3991–4010.
- View Article
- Google Scholar
26. Romanazzi M. Inuence in canonical correlation analysis. Psychometrika. 1992; 57(2), 237–259.
- View Article
- Google Scholar
27. Sriperumbudur BK, Fukumizu K, Gretton A, Lanckriet GRG. Schölkopf B. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems. 2009; 21, 1750–1758.
- View Article
- Google Scholar
28. Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola A. A kernel statistical test of independence. In Advances in Neural Information Processing Systems. 2008; 20, 585–592.
- View Article
- Google Scholar
29. Song L, Smola A, Gretton A, Bedo J. Borgwardt K. Feature selection via dependence maximization. Journal of Machine Learning Research. 2012; 13, 1393–1434.
- View Article
- Google Scholar
30. Jeanmougin M, de Reynies A, Marisa L, Passard C, Nuel G, Guedj M. Should we abandon the t-test in teh analysis of gene expression microarry data: a comparison of variance modeling strategies, PlOS One.2010; 5(9):e12336. pmid:20838429
- View Article
- PubMed/NCBI
- Google Scholar
31. Tusher JG, Tibshirani R, Chu G. Ssignificance analysis of microarrays applied to the ionizing radiation response, Proceedings of the National Academy of Sciences. 2001; 98(9): 5116–21.
- View Article
- Google Scholar
32. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015; 43(7), e47. pmid:25605792
- View Article
- PubMed/NCBI
- Google Scholar
33. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In proceedings of the National Academy of Sciences of the United States of America, 1999; 96(12), 6745–6750.
- View Article
- Google Scholar
34. Fundamentals of Biostatistics. 8th edition, Cengage Learning, United States, 2016.
35. Huang DW, Sherman BR, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009; 4(1), 44–57.
- View Article
- Google Scholar
36. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J. et al. STRING v10: Protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research. 2007; 43, 531–543.
- View Article
- Google Scholar

[ref1] 1. Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology. 2000; 1(2):1–21.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015; ID 198363, 13 pages.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Ruan L, Yuan M. An empirical bayes’ approach to joint analysis of multiple microarray gene expression studies. Biometrics. 2011; 67, 1617–1626. pmid:21517790
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Sheng J, Deng HW, Calhoun VD, Wang YP. A Integrated Analysis of gene expression and copy number data on gene shaving using independent component analysis. IEEE/ ACM Transactions on computational biology and bioinformatics. 2011; 8(6), 1568–1579. pmid:21519112
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref5] 5. Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics 2012; 99, 323–329. pmid:22546560
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Castellanos-Garzón J, Romos J. A gene selection approach based on clustering for classificaiton task in colon cnacer. Advances in distributed computing and artificial intelligence journal. 2015; 4(3), 1–10.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref7] 7. Hofmann T, Schölkopf B, Smola JA. Kernel methods in machine learning. The Annals of Statistics. 2008; 36,1171–1220.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref8] 8. Alam MA. Fukumizu K. Hyperparameter selection in kernel principal component analysis. Journal of Computer Science. 2014; 10(7), 1139–1150.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref9] 9. Charpiat G, Hofmann M, Schölkopf B. Kernel methods in medical imaging, Chapter 4, Berlin, Germany, Springer, 2015.

[ref10] 10. Akaho S. A kernel method for canonical correlation analysis. International meeting of psychometric Society. 2001;35, 321–377.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref11] 11. Alam MA, Fukumizu K. Higher-order regularized kernel canonical correlation analysis. International Journal of Pattern Recognition and Artificial Intelligence. 2015; 29(4), 1551005 (1–24).
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref12] 12. Alam MA, Fukumizu K. Higher-order regularized kernel CCA. In the 12th International Conference on Machine Learning and Applications, Miami, USA. 2013; 374-377.

[ref13] 13. Hampel FR, Ronchetti EM, Rousseuw PJ, Stahel WA. Robust Statistics: the approach based on influence functions. John Wiley & Sons, New York, 2011.

[ref14] 14. Debruyne M, Hubert M, Horebeek JV. Detecting influential observations in kernel PCA. Computational Statistics and Data Analysis. 2010; 54, 3007–3019.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref15] 15. Filzmoser P. Maronna R. and Werner M. Outlier identification in high dimensions. Computational Statistics and Data Analysis. 2008; 52, 1694–1711.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref16] 16. Alam MA, Nasser M. Fukumizu K. Sensitivity analysis in robust and kernel canonical correlation analysis. In proceedings of the 11th International Conference on Computer and Information Technology, Bangladesh, IEEE. 2008; 399–404.

[ref17] 17. Alam MA, Calhoun, V. and Wang, Y-P. (2016). Influence function of multiple kernel canonical analysis to identify outliers in imaging genetics data. In proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB’16, Seattle, USA. 2016; 210–219.

[ref18] 18. Alam MA, Fukumizu K, Wang YP. Infuence function and robust variant of kernel canonical correlation analysis. Neurocomputing. 2018; 304, 12–29. pmid:30416263
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref19] 19. Aronszajn N. Theory of reproducing kernels. Transactions of the American Mathematical Society. 1950; 68, 337–404.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref20] 20. Bach FR, Jordan MI. Kernel independent component analysis. Journal of Machine Learning Research. 2002; 3, 1–48.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref21] 21. Alzate C, Suykens JAK. A regularized kernel CCA contrast function for ICA. Neural Networks. 2008; 21, 170–181. pmid:18280110
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref22] 22. Huang SY, Lee M, Hsiao CK. (2009b). Nonlinear measures of association with kernel canonical correlation analysis and applications. Journal of Statistical Planning and Inference. 2009; 139, 2162–2174.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref23] 23. Richfield O, Alam MA, Calhoun V, Wang YP. Learning schizophrenia imaging genetics data via multiple kernel canonical correlation analysis. In proceedings- 2016 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016, Shenzhen, China. 2017; 5, 507–5011.

[ref24] 24. Tanaka Y. Sensitivity analysis in principal component analysis: in uence on the subspace spanned by principal components. Communications in Statistics-Theory and Methods. 1988; 17(9), 3157–3175.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref25] 25. Tanaka Y. Inuence functions related to eigenvalue problem which appear in multivariate analysis. Communications in Statistics-Theory and Methods. 1989; 18(11), 3991–4010.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref26] 26. Romanazzi M. Inuence in canonical correlation analysis. Psychometrika. 1992; 57(2), 237–259.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref27] 27. Sriperumbudur BK, Fukumizu K, Gretton A, Lanckriet GRG. Schölkopf B. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems. 2009; 21, 1750–1758.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref28] 28. Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola A. A kernel statistical test of independence. In Advances in Neural Information Processing Systems. 2008; 20, 585–592.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref29] 29. Song L, Smola A, Gretton A, Bedo J. Borgwardt K. Feature selection via dependence maximization. Journal of Machine Learning Research. 2012; 13, 1393–1434.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref30] 30. Jeanmougin M, de Reynies A, Marisa L, Passard C, Nuel G, Guedj M. Should we abandon the t-test in teh analysis of gene expression microarry data: a comparison of variance modeling strategies, PlOS One.2010; 5(9):e12336. pmid:20838429
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref31] 31. Tusher JG, Tibshirani R, Chu G. Ssignificance analysis of microarrays applied to the ionizing radiation response, Proceedings of the National Academy of Sciences. 2001; 98(9): 5116–21.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref32] 32. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015; 43(7), e47. pmid:25605792
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref33] 33. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In proceedings of the National Academy of Sciences of the United States of America, 1999; 96(12), 6745–6750.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref34] 34. Fundamentals of Biostatistics. 8th edition, Cengage Learning, United States, 2016.

[ref35] 35. Huang DW, Sherman BR, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009; 4(1), 44–57.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref36] 36. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J. et al. STRING v10: Protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research. 2007; 43, 531–543.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

Figures

Abstract

Background

Methods and findings

Conclusions

Introduction

Materials and methods

Positive definite kernel

Kernel canonical correlation analysis

Influence function of the kernel canonical correlation analysis

Kernel choice

Relevant approaches

Experimental results

Simulation study

Simulated gene expression profiles generated from normal distribution

Simulated gene expression profiles generated from t- distribution

Application to colon cancer microarray data

Discussion

Supporting information

S1 Table. GO biological process categories for 210 genes for Colon cancer data set.

S2 Table. KEGG (whole) Pathways for 210 genes for Colon cancer data set.

S1 File. The details of the RNA-seq data and its results.

References