Identification of Genes Discriminating Multiple Sclerosis Patients from Controls by Adapting a Pathway Analysis Method

Lei Zhang; Linlin Wang; Pu Tian; Suyan Tian

doi:10.1371/journal.pone.0165543

Abstract

The focus of analyzing data from microarray experiments has shifted from the identification of associated individual genes to that of associated biological pathways or gene sets. In bioinformatics, a feature selection algorithm is usually used to cope with the high dimensionality of microarray data. In addition to those algorithms that use the biological information contained within a gene set as a priori to facilitate the process of feature selection, various gene set analysis methods can be applied directly or modified readily for the purpose of feature selection. Significance analysis of microarray to gene-set reduction analysis (SAM-GSR) algorithm, a novel direction of gene set analysis, is one of such methods. Here, we explore the feature selection property of SAM-GSR and provide a modification to better achieve the goal of feature selection. In a multiple sclerosis (MS) microarray data application, both SAM-GSR and our modification of SAM-GSR perform well. Our results show that SAM-GSR can carry out feature selection indeed, and modified SAM-GSR outperforms SAM-GSR. Given pathway information is far from completeness, a statistical method capable of constructing biologically meaningful gene networks is of interest. Consequently, both SAM-GSR algorithms will be continuously revaluated in our future work, and thus better characterized.

Citation: Zhang L, Wang L, Tian P, Tian S (2016) Identification of Genes Discriminating Multiple Sclerosis Patients from Controls by Adapting a Pathway Analysis Method. PLoS ONE 11(11): e0165543. https://doi.org/10.1371/journal.pone.0165543

Editor: Klaus Brusgaard, Odense University Hospital, DENMARK

Received: February 17, 2016; Accepted: September 13, 2016; Published: November 15, 2016

Copyright: © 2016 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The first data set is available from the ArrayExpression repository (http://www.ebi.ac.uk/arrayexpress) and stored there as E-MTAB-69. The second data set contains data obtained from a third party (i.e., the sbv Improver challenge). The readers may go to https://sbvimprover.com/challenge-1/challenge/ms-diagnostic to request access to the data.

Funding: This study was supported by the Natural Science Foundation of China (No 31401123 for ST and No 31270758 for PT). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

With the development of major pathway databases, e.g., the Kyoto Encyclopedia of Gene and Genomes (KEGG) [1] and Gene Ontology (GO) [2], the coordinated effect of all genes inside a pathway or gene set on a phenotype has been increasingly explored. These databases organize different types of biological pathway or gene set information and record co-expressed/co-regulated patterns. Consequently, many pathway or gene-set analysis methods have been proposed [3–11]. In this article, the phrases “gene set” and “pathway” are used interchangeably.

Feature selection is usually implemented to cope with the high dimensionality issue in bioinformatics [12]. It has been shown that when a feature selection method incorporates pathway knowledge, it has a better predictive power and more meaningful biological implication [8,13,14]. Supervised group LASSO method proposed Ma et al [15] is one of such methods. Briefly, this method consists of two steps. First, LASSO is used to identify relevant genes within each cluster/group. Then the method selects relevant clusters/groups using a group LASSO. In their work, the clusters are generated using a K-mean method, and thus are mutually exclusive. In reality, however, it is common to have a gene involving in many gene sets or pathways. An alternative way to account for pathway knowledge is suggested by [16]. In this algorithm, a pseudo-gene taking the average expression value of all genes inside a gene set is created to represent the whole gene set, and then the downstream analysis is conducted using those pseudo-genes. However, this method is incapable of selecting individual relevant genes.

A novel direction of gene set analysis was proposed by [17], which aims at further reduction of a significant gene set into a core subset. The reduction step to a smaller-sized core subset is essential towards understanding the underlying biological mechanisms. The proposed method by [17] was named as significance analysis of microarray-gene set reduction (SAM-GSR). The issue addressed by SAM-GSR is also of interest in a feature selection algorithm, which motivates us to carry out feature selection using the SAM-GSR algorithm.

Multiple sclerosis (MS) is the most prevalent demyelinating disease and the principal cause of neurological disability in young adults [18]. Currently, MS can only be confirmed using invasive and expensive tests such as magnetic resonance imaging (MRI). Therefore, researchers are searching for an easier and cheaper diagnosis of MS with the aids of other technologies such as microarray [19–21]. However, the number of microarray experiments on MS is limited and the sample sizes of those studies are predominately small [22]. Consequently, a feature selection algorithm that downsizes the number of genes under consideration to a manageable scale is highly desirable for the classification of MS samples.

As a part of the recently-launched Systems Biology Verification (sbv) Industrial Methodology for Process Verification in Research (IMPROVER) Challenge [23], MS sub-challenge targeted specifically on the utilization of gene expression data for the purpose of MS diagnosis. Among the challenge participants who ranked top in this sub-challenge, two used the methods accounting for pathway knowledge. First, Lauria [24] used Cytoscape [25] to construct two separate clusters/networks to discriminate MS samples from controls. Since the modeling parsimony is not a concern in this method, the resultant signature might be not applicable in the clinical setting. Second, Zhao et al [26] implemented the method by Chen et al. [16] and generated one pseudo-gene for each pathway by averaging expression values of all genes in that pathway. Then a logistic regression with elastic net regularization on those resulting pseudo features was fitted. This method was shown to be inferior to the regularized logistic regression model on individual genes.

In this paper, we apply SAM-GSR to MS microarray data to explore if SAM-GSR can be used for the purpose of feature selection. Also, we propose an extension to SAM-GSR that explicitly accomplishes feature selection.

Materials and Methods

Experimental data

We considered two microarray datasets in this study. The first one included chips from the experiment E-MTAB-69 stored in the ArrayExpress [27] repository (http://www.ebi.ac.uk/arrayexpress). All chips were hybridized on Affymetrix HGU133 Plus 2.0 chips. In this study, there were 26 patients with relapsing-remitting multiple sclerosis (RRMS) and 18 controls with neurological disorders of a non-inflammatory nature. The second dataset was provided by the IMPROVER MS sub-challenge, which is accessible on the project website (http://www.sbvimprover.com). It was hybridized on Affymetrix HGU133 Plus 2.0, and there were 28 patients with RRMS and 32 normal controls.

Gene sets were downloaded from the Molecular Signatures Database (MSigDB) [5]. We considered both c2 and c5 categories. The c2 category includes gene sets from curated pathways databases such as KEGG and those manually curated from the literature on gene expression. The current version (version 4.0) of MSigDB c2 category included 4,722 gene sets annotating on 11,844 unique genes. The c5 category includes 1,454 gene sets annotated by GO terms.

Experimental data

Raw data of the first dataset (E-MTAB-69) were downloaded from the ArrayExpress repository, and expression values were obtained using the GCRMA algorithm [28] and normalization across samples was carried out using quantile normalization. The resulting expression values were on log₂ scale. When there were multiple probe sets representing the same gene, the one with the largest fold change was chosen. Then the resulting expression values of 19,851 unique genes were fed into downstream analysis. Raw data of the second set were downloaded from the sbv challenge website, and were separately pre-processed in the same way.

Statistical Methods

SAM-GSR.

SAM-GSR is an extension of the SAM-GS algorithm [29], with an objective of identifying the core gene subset within each selected pathway. It consists of two steps: SAM-GS to select relevant pathways and the reduction step to obtain the core subset. In SAM-GS step, the following statistic, named as SAM-GS, is defined for gene set j, where d_i is the SAM statistic [30] of gene i and calculated for each gene for gene set j, and are the sample averages of gene i for the diseased and control group, respectively. Parameter s(i) is a pooled standard deviation and is estimated by pooling samples over two groups. s₀ is a small positive constant used to offset the small variability in microarray expression measurements, and |j| represents the number of genes within gene set j. Basically, the SAM-GS statistic for a gene set is the L₂ norm of SAM statistics over all genes within the gene set.

Inside a significant gene set S, where its statistical significance is estimated using a permutation test by perturbing phenotype-labels for several hundred times, the reduction step gradually partitions the entire set S into two subsets: the reduced subset R_k and the residual one for k = 1,…, |j|. After ordering genes in gene set j increasingly, based on the p-value of genes’ SAM statistics, the first k genes are enrolled into R_k. Let c_k be the SAM-GS p-value of , the final size of R_k is set as the smallest k where c_k is larger than a pre-determined threshold for the first time. For more descriptions on the SAM-GSR algorithms, see the original work [17]. In addition, Fig 1A provides its graphical elucidation.

Download:

Fig 1. Graphical illustration of SAM-GSR and modified SAM-GSR algorithms.

A. The SAM-GSR algorithm. B. The modified SAM-GSR algorithm.

https://doi.org/10.1371/journal.pone.0165543.g001

When using the SAM-GSR algorithm to execute feature selection, c_k can be regarded as a tuning parameter. Its optimal cutoff value is determined by conducting a sensitivity analysis in which a grid of values (i.e., 0.05 to 0.5 with an increment of 0.05) is considered. For each value, a support vector machine (SVM) [31] with the genes inside the resulting reduced subsets is fitted to calculate the misclassified error, i.e., the number of samples being falsely identified over the total sample size, on the training set. The optimal cutoff value of c_k is the one having the minimal misclassified error and the least number of selected genes. Lastly, we fit a SVM model upon the selected genes with c_k being set as the optimal cutoff, and evaluate the predictive performance of this final model using the test set.

Modified SAM-GSR.

In SAM-GSR, whether a gene is selected into the core reduced subset R_k depends on the magnitude of its SAM statistic. It implies that if in a gene set |d_i|> |d_k| for genes i and k, gene k is possible to be involved in the reduce subset R_k only when gene i is in R_k. When the goal is feature selection, however, the magnitude of individual SAM statistic might not matter so critically.

In this study, we propose to use a penalized machine learning method to perform feature selection and classify samples simultaneously. Because SVM is one of the widely used supervised learning methods, especially suitable for the two-class classification tasks of microarray data [32], we propose to use a SVM with a Smoothly Clipped Absolute Deviation (SCAD) [33,34] penalty to do feature selection. In a linear SVM model, the subjects from two distinct classes are separated by where x = (x₁,..,x_G) are the gene expression profiles, and x_i (i = 1,…G), a vector of length n, represents gene i’s expression profiles for n patients (n is sample size and G is the number of genes under consideration). And y (y = -1,1) is the class labels, w = (w₁,…, w_G) are the coefficients before gene expression values and represent the contribution of those genes to the hyperplane. A SVM model aims at finding the optimal hyperplane with maximal margin, which can be solved by, the above penalty term pen_λ(w) is the sum of a SCAD penalty function over all coefficients, where the SCAD penalty function for coefficient i is defined by [34] as, where both α and λ are tuning parameters. For small coefficients, SCAD has the same behavior as L₁/LASSO penalty [35], shrinking those coefficients to zeros. For large coefficients, however, its constant penalty produces smaller biases on the estimations. SVM-SCAD is implemented using R penalizedSVM package [36]. The default value for α is 3.7. Then for the grid of 2⁻⁸,2⁻⁷, 2⁻⁶, … and, 2¹⁴, λ is optimized via 5-fold cross validations (CV).

The procedure in which an SVM-SCAD model is implemented to select features, but restricting the genes under consideration to those inside the significant gene sets identified by SAM-GS, is referred to as modified SAM-GSR herein. Fig 1 elucidates graphically on both SAM-GSR and modified SAM-GSR algorithms.

Statistical Metrics

Usually, using a single metric to evaluate an algorithm introduces biases. An algorithm may be erroneously claimed to be superior if a metric in favour of it is chosen or to be inferior if an unfavourable metric is used [23]. To avoid such biases, we used four metrics, namely, Belief Confusion Metric (BCM), Area Under the Precision-Recall Curve (AUPR), Generalized Brier Score (GBS), and error rate to evaluate the performance of a classifier.

Specifically, GBS is defined as using the equation by Yeung et al [37] and then dividing it by the sample size n, where Y_ik (1 if subject i belongs to class k, and 0 otherwise) are indicator functions for class k (k = 1,…,K), and p_ik denotes the predicted probability such that Y_ik = 1. GBS is in the internal of (0,1) while a value closer to zero indicates a better predictive. For more detailed description on GBS, see the work by [37,38].

BCM and AUPR are two metrics used by SBV challenge. As summarized by [39], BCM captures the average belief/confidence that a sample belongs to a class when indeed it belongs to this class. AUPR summarizes the ability of correctly ranking the samples known to be in a given class when sorted by the belief values decreasingly for that class. For these two metrics, the closer to 1 they are, the better a classifier is.

Statistical language and packages

Statistical analysis was carried out in the R language version 3.1 (www.r-project.org), and R codes for SAM-GSR were downloaded from Dr. Yasui’s webpage (www.ualberta.ca/~yyasui/homepage.html).

Results and Conclusions

The study schema is presented in Fig 2. First, we trained both SAM-GSR and modified SAM-GSR models on E-MTAB-69. The selected pathways and genes by both algorithms are provided in Figs 3 and 4.

Download:

Fig 2. Study schema.

Graphical illustration on how to analyze the multiple sclerosis (MS) microarray data.

https://doi.org/10.1371/journal.pone.0165543.g002

Download:

Fig 3. Selected pathways and genes by both SAM-GSR algorithms using pathways inside the MSigDB c2 category.

Gene symbols in purple are the genes indicated as being directly related to MS by the GeneCards database. The overlapped gene symbols between the SAM-GSR and modified SAM-GSR algorithms are in bold.

https://doi.org/10.1371/journal.pone.0165543.g003

Download:

Fig 4. Selected pathways and genes by both SAM-GSR algorithms using pathways inside the MSigDB c5 category.

Gene symbols in purple are the genes indicated as being directly related to MS by the GeneCards database. The overlapped gene symbols between the SAM-GSR and modified SAM-GSR algorithms are in bold.

https://doi.org/10.1371/journal.pone.0165543.g004

To evaluate both algorithms, we computed their predictive statistics on the training (i.e., E-MTAB-69) and the test sets (i.e., the sbv test set). As shown in Table 1, the performance of modified SAM-GSR was superior to SAM-GSR on all performance statistics except for one AURP (0.612 versus 0.644, using the MSigDB c2 category). Then we reversed the order of these two datasets and reanalyzed them using the sbv MS test set as the training set. The performance statistics for the resulting signatures are given in Table 2. It is observed that the modified SAM-GSR algorithm outperforms the SAM-GSR algorithm with respect to both BCM and AUPR, e.g., the modified SAM-GSR achieves a BCM of 0.5 and an AUPR of 0.75 versus the SAM-GSR algorithm only has a BCM of 0.457 and an AUPR of 0.422, using the pathways in the MSigDB c5 category.

Download:

Table 1. Performance statistics of selected genes using E-MTAB-69 as the training set.

https://doi.org/10.1371/journal.pone.0165543.t001

Download:

Table 2. Performance statistics of selected genes using the sbv Improver MS data as the training set.

https://doi.org/10.1371/journal.pone.0165543.t002

Interestingly, we observed that the model parsimony of the modified SAM-GSR algorithm suffers when trained on E-MTAB-69 while its parsimony is better than that of the SAM-GSR algorithm when trained on the sbv test set. We remark that when the SAM-GS statistic determines the significance level of a gene set, the decision of whether or not a gene is included in a reduced subset mainly depends on the magnitude of this gene’s SAM metric and the additive effect of genes in the reduced subset. Certainly, the number of gene sets in which a gene is involved also plays an important role. When a gene is involved in many gene sets, its likelihood of being selected increases several times compared to a barely isolated gene contained in only one or two gene sets. In contrast, such decision in the modified SAM-GSR algorithm hinges solely on genes’ contribution to the optimal hyperplane (i.e., weights) in the final SVM model.

Also in E-MTAB-69, the controls are those patients with neurological disorders of a non-inflammatory nature, such that the difference of expression values between MS and control in this data set is not as dramatic as the sbv test set in which the controls are normal individuals. After adjusting for the batch effect among different experiments using combat algorithm, the difference of expression profiles between a normal control and a control with non-inflammatory neurological disorders is distinct. This also explains why the predictive performance when trained on the sbv test set is not satisfying.

Therefore, we hypothesize that the modified SAM-GSR algorithm compromises on the model parsimony in order to obtain a good predictive performance when trained on E-MTAB-69. While the observation that the number of differentially expressed genes (DEGs) identified in the sbv test set is more than 10 times of that in E-MTAB-68 provides some support on this conjecture, further investigation is definitely needed.

Comparison with other relevant signatures

We compared several MS diagnosis signatures in the literatures with the ones we obtained using both SAM-GSR algorithms. Here, we only compared the performance of different signatures on the sbv IMPROVER test set. The performance statistics of those signatures were tabulated in Table 3.

Download:

Table 3. Comparison with other relevant signatures on the sbv Improver set.

https://doi.org/10.1371/journal.pone.0165543.t003

Most relevantly, Guo et al. [40] obtained an 8-gene signature using the same training set. This 8-gene signature ranked as the second worst, and only outperformed our original submission to sbv IMPROVER challenge. Compared with the top three teams in sbv MS diagnosis challenge, we remark that if we had submitted the results of modified SAM-GSR analysis to sbv IMPROVER challenge, we would have been ranked among top five.

In the worst performed signature, our original submission to the sbv challenge, the Threshold Gradient Descent Regularization (TGDR) [41] algorithm was utilized to conduct feature selection, and the training data sets included E-MTAB-69 in addition to five other microarray studies. Among these five microarray experiments, the chips from normal controls were included. Here, we reran TGDR analysis using E-MTAB-69 as the training set. The predictive performance improved dramatically, as indicated by the statistics in Table 3. There always exists data dependency for a feature selection algorithm [42]. Additionally, we think that the expression value profiles may be still be subject to batch effect even though we adjusted for it using combat algorithm [43]. Lastly, the distinct difference between normal controls and controls with other diseases might also play a role.

Further verification using lung adenocarcinoma (AC) datasets

To further evaluate on both SAM-GSR algorithms, we applied these two algorithms to another set of real-world datasets. The objective is to discriminate histology stage I from stage II of lung adenocarcinoma patients. We trained both algorithms on a microarray dataset (GEO accession No: GSE 50081) and then evaluated the resulting signatures using 70 AC patients at early stages (i.e., stage I and II) in the RNA-seq data stored in The Cancer Genome Atlas (https://tcga-data.nci.nih.gov/tcga/). In this application, we only considered the pathways in the MSigDB c5 category.

For the RNA-seq data, Counts-per-million (CPM) values were calculated and log₂ transformed by Voom function [44] in R limma package. For the microarray data, expression values were obtained using the fRMA algorithm [45], and then quantile normalization was carried out and those expression values were log₂ transformed.

The results for both SAM-GSR algorithms in the AC application are given in Table 4. Moreover, we made a comparison of both SAM-GSR algorithms with three other feature selection algorithms, namely, SVM-SCAD, LASSO, and moderated t-test. These three algorithms are either well known in the field, e.g., LASSO or very relevant, e.g., SVM-SCAD. The performance statistics are presented in Table 4 as well. It is shown that modified SAM-GSR performs the best with respect to GBS and BCM, and SAM-GSR performs worse than SVM-SCAD in terms of predictive error, GBS, and BCM but ranks as the first in terms of AUPR. Overall, the modified SAM-GSR algorithm is the best among these five methods if all performance statistics are considered together.

Download:

Table 4. Performance statistics for the lung adenocarcinoma application.

https://doi.org/10.1371/journal.pone.0165543.t004

Discussion

The results of real-world applications show that the modified SAM-GSR algorithm has similar or better performance compared with the SAM-GSR algorithm and other novel feature selection algorithms. Moreover, the modified SAM-GSR algorithm has its distinguished merits. First, it requires less computational burden given it applies penalized SVM once instead of subsequently evaluating on SAM-GS statistics of the reduced subsets. Second, it automatically produces a final model that can be used to calculate a new sample’s posterior probability whereas SAM-GSR needs an extra application of SVM in order to obtain such probability.

To conclude, by incorporating the additional pathway knowledge contained in gene sets, both SAM-GSR algorithms have good performance, and they can be utilized for feature selection indeed. The modified SAM-GSR algorithm has advantages over the SAM-GSR algorithm. In the clinical setting, a feature selection algorithm that downsizes the number of genes to an understandable scale is imperative when using gene expression profiles for diagnostic purposes. Focusing on a smaller number of genes facilitates biological insight into disease processes and thus provides insight on the targeted therapies and intervention strategies. Furthermore, feature selection makes the replacement of a high-throughput microarray technology with some cheaper and quicker alternatives such as real-time PCR possible, thus increasing the applicability of the gene biomarkers in routine practice.

As indicated by the simulations in S1 File, both SAM-GSR algorithms have one drawback: when the true markers are only involved in few gene sets, both algorithms are highly unlikely to identify them. To alleviate or even eliminate this disadvantage, some specific modification on the SAM-GS step is needed. Moreover, the way of the SAM-GSR algorithms account for the pathway knowledge is obviously not seamless. Ignoring the pathway topology completely, the SAM-GSR algorithms heavily weigh on the number of gene sets inside which a gene is contained. Future study on these topics is warranted.

Given that pathway information is far from completeness, especially for an under-investigated disease such as MS, the de novo construction of biologically meaningful gene networks using a statistical method is recommended. The basic requirement for such a method is that it must take interactions and interplay among genes into account so that a gene is possible to appear in multiple gene sets. Then using the more appropriate and comprehensive pathway information, both SAM-GSR algorithms will be revaluated and better characterized.

Supporting Information

S1 File. Simulations to further evaluate on both SAM-GSR algorithms.

https://doi.org/10.1371/journal.pone.0165543.s001

(DOCX)

Acknowledgments

We thank Dr. Howard Chang of the Emory University for English editing.

Author Contributions

Conceived and designed the experiments: ST PT.
Analyzed the data: LZ LW ST PT.
Wrote the paper: ST LZ PT LW.

References

1. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M (1999) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27: 29–34. pmid:9847135
- View Article
- PubMed/NCBI
- Google Scholar
2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29. pmid:10802651
- View Article
- PubMed/NCBI
- Google Scholar
3. Luo W, Friedman MS, Shedden K, Hankenson KD, Woolf PJ (2009) GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics 10: 161. pmid:19473525
- View Article
- PubMed/NCBI
- Google Scholar
4. Kim S, Kon M, DeLisi C (2012) Pathway-based classification of cancer subtypes. Biol Direct 7: 21. pmid:22759382
- View Article
- PubMed/NCBI
- Google Scholar
5. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. (2005) Gene set enrichment analysis : A knowledge-based approach for interpreting genome-wide. Proc Natl Acad Sci U S A 102: 15545–15550. pmid:16199517
- View Article
- PubMed/NCBI
- Google Scholar
6. Kim S (2005) PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics 12: 1–12.
- View Article
- Google Scholar
7. Lim K, Wong L (2014) Finding consistent disease subnetworks using PFSNet. Bioinformatics 30: 189–196. pmid:24292362
- View Article
- PubMed/NCBI
- Google Scholar
8. Ma S, Shi M, Li Y, Yi D, Shia B-C (2010) Incorporating gene co-expression network in identification of cancer prognosis markers. BMC Bioinformatics 11: 271. pmid:20487548
- View Article
- PubMed/NCBI
- Google Scholar
9. Tsai C-A, Chen JJ (2009) Multivariate analysis of variance test for gene set analysis. Bioinformatics 25: 897–903. pmid:19254923
- View Article
- PubMed/NCBI
- Google Scholar
10. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A 102: 13544–13549. pmid:16174746
- View Article
- PubMed/NCBI
- Google Scholar
11. Wu D, Smyth GK (2012) Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res 40: e133. pmid:22638577
- View Article
- PubMed/NCBI
- Google Scholar
12. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507–2517. pmid:17720704
- View Article
- PubMed/NCBI
- Google Scholar
13. Ma S, Huang J, Shen S (2009) Identification of cancer-associated gene clusters and genes via clustering penalization. Stat Interface 2: 1–11. pmid:20057914
- View Article
- PubMed/NCBI
- Google Scholar
14. Huang J, Ma S, Xie H, Zhang C-H (2009) A group bridge approach for variable selection. Biometrika 96: 339–355. pmid:20037673
- View Article
- PubMed/NCBI
- Google Scholar
15. Ma S, Song X, Huang J (2007) Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics 8: 60. pmid:17316436
- View Article
- PubMed/NCBI
- Google Scholar
16. Chuang H-Y, Lee E, Liu Y-T, Lee D, Ideker T (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol 3: 140. pmid:17940530
- View Article
- PubMed/NCBI
- Google Scholar
17. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, et al. (2009) Gene-set analysis and reduction. Brief Bioinform 10: 24–34. pmid:18836208
- View Article
- PubMed/NCBI
- Google Scholar
18. Fontoura P, Garren H (2010) Multiple sclerosis therapies: Molecular mechanisms and future. Results Probl Cell Differ 51: 259–285. pmid:20838962
- View Article
- PubMed/NCBI
- Google Scholar
19. Chabas D, Baranzini SE, Mitchell D, Bernard CC, Rittling SR, Denhardt DT, et al. (2001) The influence of the proinflammatory cytokine, osteopontin, on autoimmune demyelinating disease. Science 294: 1731–1735. pmid:12649465
- View Article
- PubMed/NCBI
- Google Scholar
20. Mycko MP, Papoian R, Boschert U, Raine CS, Selmaj KW (2003) cDNA microarray analysis in multiple sclerosis lesions: detection of genes associated with disease activity. Brain 126: 1048–1057. pmid:12690045
- View Article
- PubMed/NCBI
- Google Scholar
21. Tajouri L, Fernandez F, Griffiths L (2007) Gene Expression Studies in Multiple Sclerosis. Curr Genomics 8: 181–189. pmid:18645602
- View Article
- PubMed/NCBI
- Google Scholar
22. Kemppinen AK, Kaprio J, Palotie A, Saarela J (2011) Systematic review of genome-wide expression studies in multiple sclerosis. BMJ Open 1: e000053. pmid:22021740
- View Article
- PubMed/NCBI
- Google Scholar
23. Meyer P, Hoeng J, Rice JJ, Norel R, Sprengel J, Stolle K, et al. (2012) Industrial methodology for process verification in research (IMPROVER): toward systems biology verification. Bioinformatics 28: 1193–1201. pmid:22423044
- View Article
- PubMed/NCBI
- Google Scholar
24. Lauria M (2013) Rank-based transcriptional signatures: a novel approach to diagnostic biomarker definition and analysis. Syst Biomed 1: 35–46.
- View Article
- Google Scholar
25. Shannon P, Markiel A, Ozier O (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res: 2498–2504. pmid:14597658
- View Article
- PubMed/NCBI
- Google Scholar
26. Zhao C, Deshwar AG, Morris Q (2013) Relapsing-remitting multiple sclerosis classification using elastic net logistic regression on gene expression data. Syst Biomed 1: 247–253.
- View Article
- Google Scholar
27. Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, et al. (2011) ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 39: D1002–D1004. pmid:21071405
- View Article
- PubMed/NCBI
- Google Scholar
28. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F (2004) A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J Am Stat Assoc 99: 909–917.
- View Article
- Google Scholar
29. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, et al. (2007) Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 8: 242. pmid:17612399
- View Article
- PubMed/NCBI
- Google Scholar
30. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98: 5116–5121. pmid:11309499
- View Article
- PubMed/NCBI
- Google Scholar
31. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273–297.
- View Article
- Google Scholar
32. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16: 906–914. pmid:11120680
- View Article
- PubMed/NCBI
- Google Scholar
33. Fan J, Li R (2001) Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Am Stat Assoc 96: 1348–1360.
- View Article
- Google Scholar
34. Zhang HH, Ahn J, Lin X, Park C (2006) Gene selection using support vector machines with non-convex penalty. Bioinformatics 22: 88–95. pmid:16249260
- View Article
- PubMed/NCBI
- Google Scholar
35. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 58: 267–288.
- View Article
- Google Scholar
36. Becker N, Werft W, Toedt G, Lichter P, Benner A (2009) PenalizedSVM: A R-package for feature selection SVM classification. Bioinformatics 25: 1711–1712. pmid:19398451
- View Article
- PubMed/NCBI
- Google Scholar
37. Yeung KY, Bumgarner RE, Raftery AE (2005) Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21: 2394–2402. pmid:15713736
- View Article
- PubMed/NCBI
- Google Scholar
38. Tian S, Suárez-Fariñas M (2013) Multi-TGDR: A Regularization Method for Multi-Class Classification in Microarray Experiments. PLoS One 8: e78302. pmid:24260109
- View Article
- PubMed/NCBI
- Google Scholar
39. Tarca AL, Than NG, Romero R (2013) Methodological approach from the Best Overall Team in the IMPROVER Diagnostic Signature Challenge. Syst Biomed 1: 1–11.
- View Article
- Google Scholar
40. Guo P, Zhang Q, Zhu Z, Huang Z, Li K (2014) Mining gene expression data of multiple sclerosis. PLoS One 9: e100052. pmid:24932510
- View Article
- PubMed/NCBI
- Google Scholar
41. Friedman JH, Popescu BE (2004) Gradient Directed Regularization for Linear Regression and Classification.
42. Boulesteix AL (2010) Over-optimism in bioinformatics research. Bioinformatics 26: 437–439. pmid:19942585
- View Article
- PubMed/NCBI
- Google Scholar
43. Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostat Oxford Engl 8: 118–127.
- View Article
- Google Scholar
44. Law CW, Chen Y, Shi W, Smyth GK (2014) Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15: R29. pmid:24485249
- View Article
- PubMed/NCBI
- Google Scholar
45. McCall MN, Irizarry RA (2011) Thawing Frozen Robust Multi-array Analysis (fRMA). BMC Bioinformatics 12: 369. pmid:21923903
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M (1999) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27: 29–34. pmid:9847135
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29. pmid:10802651
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Luo W, Friedman MS, Shedden K, Hankenson KD, Woolf PJ (2009) GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics 10: 161. pmid:19473525
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Kim S, Kon M, DeLisi C (2012) Pathway-based classification of cancer subtypes. Biol Direct 7: 21. pmid:22759382
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. (2005) Gene set enrichment analysis : A knowledge-based approach for interpreting genome-wide. Proc Natl Acad Sci U S A 102: 15545–15550. pmid:16199517
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Kim S (2005) PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics 12: 1–12.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref7] 7. Lim K, Wong L (2014) Finding consistent disease subnetworks using PFSNet. Bioinformatics 30: 189–196. pmid:24292362
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Ma S, Shi M, Li Y, Yi D, Shia B-C (2010) Incorporating gene co-expression network in identification of cancer prognosis markers. BMC Bioinformatics 11: 271. pmid:20487548
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Tsai C-A, Chen JJ (2009) Multivariate analysis of variance test for gene set analysis. Bioinformatics 25: 897–903. pmid:19254923
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A 102: 13544–13549. pmid:16174746
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Wu D, Smyth GK (2012) Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res 40: e133. pmid:22638577
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507–2517. pmid:17720704
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref13] 13. Ma S, Huang J, Shen S (2009) Identification of cancer-associated gene clusters and genes via clustering penalization. Stat Interface 2: 1–11. pmid:20057914
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref14] 14. Huang J, Ma S, Xie H, Zhang C-H (2009) A group bridge approach for variable selection. Biometrika 96: 339–355. pmid:20037673
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref15] 15. Ma S, Song X, Huang J (2007) Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics 8: 60. pmid:17316436
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref16] 16. Chuang H-Y, Lee E, Liu Y-T, Lee D, Ideker T (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol 3: 140. pmid:17940530
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref17] 17. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, et al. (2009) Gene-set analysis and reduction. Brief Bioinform 10: 24–34. pmid:18836208
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref18] 18. Fontoura P, Garren H (2010) Multiple sclerosis therapies: Molecular mechanisms and future. Results Probl Cell Differ 51: 259–285. pmid:20838962
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. Chabas D, Baranzini SE, Mitchell D, Bernard CC, Rittling SR, Denhardt DT, et al. (2001) The influence of the proinflammatory cytokine, osteopontin, on autoimmune demyelinating disease. Science 294: 1731–1735. pmid:12649465
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref20] 20. Mycko MP, Papoian R, Boschert U, Raine CS, Selmaj KW (2003) cDNA microarray analysis in multiple sclerosis lesions: detection of genes associated with disease activity. Brain 126: 1048–1057. pmid:12690045
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref21] 21. Tajouri L, Fernandez F, Griffiths L (2007) Gene Expression Studies in Multiple Sclerosis. Curr Genomics 8: 181–189. pmid:18645602
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref22] 22. Kemppinen AK, Kaprio J, Palotie A, Saarela J (2011) Systematic review of genome-wide expression studies in multiple sclerosis. BMJ Open 1: e000053. pmid:22021740
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref23] 23. Meyer P, Hoeng J, Rice JJ, Norel R, Sprengel J, Stolle K, et al. (2012) Industrial methodology for process verification in research (IMPROVER): toward systems biology verification. Bioinformatics 28: 1193–1201. pmid:22423044
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref24] 24. Lauria M (2013) Rank-based transcriptional signatures: a novel approach to diagnostic biomarker definition and analysis. Syst Biomed 1: 35–46.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref25] 25. Shannon P, Markiel A, Ozier O (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res: 2498–2504. pmid:14597658
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref26] 26. Zhao C, Deshwar AG, Morris Q (2013) Relapsing-remitting multiple sclerosis classification using elastic net logistic regression on gene expression data. Syst Biomed 1: 247–253.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref27] 27. Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, et al. (2011) ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 39: D1002–D1004. pmid:21071405
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref28] 28. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F (2004) A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J Am Stat Assoc 99: 909–917.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref29] 29. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, et al. (2007) Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 8: 242. pmid:17612399
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref30] 30. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98: 5116–5121. pmid:11309499
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref31] 31. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273–297.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref32] 32. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16: 906–914. pmid:11120680
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref33] 33. Fan J, Li R (2001) Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Am Stat Assoc 96: 1348–1360.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref34] 34. Zhang HH, Ahn J, Lin X, Park C (2006) Gene selection using support vector machines with non-convex penalty. Bioinformatics 22: 88–95. pmid:16249260
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref35] 35. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 58: 267–288.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

[ref36] 36. Becker N, Werft W, Toedt G, Lichter P, Benner A (2009) PenalizedSVM: A R-package for feature selection SVM classification. Bioinformatics 25: 1711–1712. pmid:19398451
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref37] 37. Yeung KY, Bumgarner RE, Raftery AE (2005) Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21: 2394–2402. pmid:15713736
View Article
PubMed/NCBI
Google Scholar

[139] View Article

[140] PubMed/NCBI

[141] Google Scholar

[ref38] 38. Tian S, Suárez-Fariñas M (2013) Multi-TGDR: A Regularization Method for Multi-Class Classification in Microarray Experiments. PLoS One 8: e78302. pmid:24260109
View Article
PubMed/NCBI
Google Scholar

[143] View Article

[144] PubMed/NCBI

[145] Google Scholar

[ref39] 39. Tarca AL, Than NG, Romero R (2013) Methodological approach from the Best Overall Team in the IMPROVER Diagnostic Signature Challenge. Syst Biomed 1: 1–11.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

[ref40] 40. Guo P, Zhang Q, Zhu Z, Huang Z, Li K (2014) Mining gene expression data of multiple sclerosis. PLoS One 9: e100052. pmid:24932510
View Article
PubMed/NCBI
Google Scholar

[150] View Article

[151] PubMed/NCBI

[152] Google Scholar

[ref41] 41. Friedman JH, Popescu BE (2004) Gradient Directed Regularization for Linear Regression and Classification.

[ref42] 42. Boulesteix AL (2010) Over-optimism in bioinformatics research. Bioinformatics 26: 437–439. pmid:19942585
View Article
PubMed/NCBI
Google Scholar

[155] View Article

[156] PubMed/NCBI

[157] Google Scholar

[ref43] 43. Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostat Oxford Engl 8: 118–127.
View Article
Google Scholar

[159] View Article

[160] Google Scholar

[ref44] 44. Law CW, Chen Y, Shi W, Smyth GK (2014) Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15: R29. pmid:24485249
View Article
PubMed/NCBI
Google Scholar

[162] View Article

[163] PubMed/NCBI

[164] Google Scholar

[ref45] 45. McCall MN, Irizarry RA (2011) Thawing Frozen Robust Multi-array Analysis (fRMA). BMC Bioinformatics 12: 369. pmid:21923903
View Article
PubMed/NCBI
Google Scholar

[166] View Article

[167] PubMed/NCBI

[168] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

Experimental data

Experimental data

Statistical Methods

SAM-GSR.

Modified SAM-GSR.

Statistical Metrics

Statistical language and packages

Results and Conclusions

Comparison with other relevant signatures

Further verification using lung adenocarcinoma (AC) datasets

Discussion

Supporting Information

S1 File. Simulations to further evaluate on both SAM-GSR algorithms.

Acknowledgments

Author Contributions

References