SurvExpress: An Online Biomarker Validation Tool and Database for Cancer Gene Expression Data Using Survival Analysis

Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.


Introduction
Cancer causes millions of deaths around the world. To improve treatments, several biomarkers have been proposed for risk prognosis and treatment response. Recent published biomarkers in many types of cancer contain numerous genes and are mainly based on gene expression. They have been generated using microarray profiling and lately by RNA-Seq technologies. Often, identified biomarkers are developed to a specific cancer tissue and subtypes. In breast cancer, for example, more than 40 biomarkers have been proposed containing between 3 and 512 genes and whose prognostic or predictive performance depends on therapy, hormone receptor status, and the number of genes [1,2]. On the other hand, assessing the performance of proposed biomarkers in different populations or evaluating competing biomarkers are difficult tasks even though hundreds of public datasets are available. The main limitations are the time and resources needed for acquiring, processing, normalizing, filtering, and statistical modeling of large gene expression datasets. This is important since several of the reasons involved in the failure of biomarkers in clinical trials are related to data analysis [3]. For the analysis of biomarkers, tools as ITTACA, KMPlot, RecurrenceOnline, bc-GeneExMiner, GOBO, and PrognoScan have been proposed [1,[4][5][6][7][8][9]. However, these tools have serious restrictions (Table 1), complicating and limiting the assessment of multi-gene biomarkers in cancer. Some of the main limitations include considering just one gene at the time or a specific set of genes; focusing on breast or ovarian cancer datasets or to a particular Affymetrix gene expression platform; requiring the upload of Affymetrix gene expression data (.CEL files); and using a single quantity per gene even though some microarray platforms provide more probesets.
To solve these issues and to facilitate performance comparisons and validations of prognostic and predictive biomarkers for cancer outcomes, we developed SurvExpress. SurvExpress is a comprehensive gene expression database and web-based tool providing survival analysis and risk assessment in cancer datasets using a biomarker gene list as input. The tool is available in http:// bioinformatica.mty.itesm.mx/SurvExpress. The tool includes a tutorial that describes the analysis options, plots, tables, key concepts related to survival analysis, and representative methods to identify biomarkers from gene expression data.

Database Acquisition
Datasets were obtained mainly from GEO (http://www.ncbi. nlm.nih.gov/geo/) and TCGA (https://tcga-data.nci.nih.gov) after searching for keywords related to cancer, survival, and gene expression technologies. Additionally, a few were obtained from author's websites and from ArrayExpress (http://www.ebi.ac.uk/ arrayexpress/). The data source used is shown in the web interface. We favored cancer types above two different cohorts and datasets containing survival data over 30 samples in which censoring indicator and time to death, recurrence, relapse, or metastasis were provided. Clinical data was provided by dataset authors via personal email when not available online in corresponding repositories. Datasets were annotated from provider files as found up to September 2012, and were quantilenormalized and log2 transformed when needed. From TCGA, all datasets were obtained at the gene level (level 3). RNA-Seq counts data were log2 transformed. In some cancer types where many datasets were found for the same gene expression platform, we also provide a merged meta-base. In meta-bases, datasets were quantile normalized; probesets means were equalized conserving the standard deviation by each cohort; and datasets were merged by probeset id. At the moment we provide meta-bases for breast, lung, and ovarian cancer. To facilitate gene searches and conversions between gene identifiers, human gene information was used and obtained from the NCBI FTP site (ftp://ftp.ncbi.nih. gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_ info.gz). To simplify the user interface, datasets were grouped by related organ or tissue using disease ontologies [10].

Web Interface Implementation
Two simple and lightweight HTML user interfaces based on java server pages, JavaScript, R, Ajax, Apache, and MySQL were implemented ( Figure 1A). In the Input page, users introduce the gene list based on NCBI compatible gene identifiers (official symbol, Entrez, Ensembl, HGNC, or others) and select the target dataset. Users can also choose how to treat genes having more than one probe. The Analysis page extracts the dataset rows related to genes in the biomarker and delivers a web interface. Then, users can assess the biomarker in a variety of ways, including switching on and off specific genes, stratifying samples by available clinical information (e.g. stage, grade, age, biochemical results, and mutation status), specifying training and test samples, and weighting genes instead of using the Cox fitting. The results are displayed in common and flexible publication-ready plots and tables within the Analysis page. A PDF version of the results can also be obtained.

Prognostic Index Estimation
The prognostic index (PI), also known as the risk score, is commonly used to generate risk groups. The PI is known as the linear component of the Cox model [11], PI = b 1 x 1 + b 2 x 2 +...+b p x p where x i is the expression value and the b I can obtained from the Cox fitting. Each b I can be interpreted as a risk coefficient. SurvExpress implements two procedures to estimate the b coefficients. The first procedure is the classical Cox model where all genes are included in a unique model. The fitting is performed in R (http://cran.r-project.org) using the survival package. In the second procedure, the user can specify a weight for each gene instead of using the values from the Cox fitting. Such option is useful to make comparisons with biomarkers computed with mathematical models other than Cox.   Figure 2). The whole process can be achieved in less than one minute for a sensible number of genes. doi:10.1371/journal.pone.0074250.g001

Risk Estimation
SurvExpress implements two methods to generate risk groups. The first method (default) generates the risk groups splitting the ordered PI (higher values for higher risk) by the number of risk groups leaving equal number of samples in each group. For two risk groups, this is equivalent to split the PI by the median. The second method to produce risk groups uses an optimization algorithm from the ordered PI. Briefly, for two groups, a log-rank test is performed along all values of the arranged PI. Then, the algorithm chooses the split point where the p-value is minimum. This procedure is generalized for more than two groups repeatedly optimizing one risk group at the time until no changes are observed. Details of this procedure are described in the tutorial provided in SurvExpress web site.

Outputs
The outputs included correspond to common metrics and plots used to assess the performance of survival data. An example of the outputs generated by SurvExpress is shown in Figure 2. Panel A shows the Kaplan-Meier plots by risk group, the log-rank test of differences between risk groups, the hazard-ratio estimate, and the concordance indexes, which estimate the probability that subjects with a higher risk will experience the event after subjects with a lower risk [12]. Panel B displays a visual association of available clinical information to risk groups. Panel C illustrates a heat map of gene expression values. Panel D shows box plots of gene expression values across gene groups together with the p-value of the corresponding difference. Panel E demonstrates the risk group optimization plot. Panel F shows fragments of the tables for the beta coefficients including corresponding Cox p-values, prognostic index per sample, and Cox fitting information from the survival package in R. Other advanced plots are also available in the tutorial provided in SurvExpress. Other 'advanced plots' include SurvivalROC that estimates time-dependent sensitivities and specificities for survival risk groups [13] but needs a few minutes to compute. Additional plots, details and interpretations of the outputs are described in the tutorial provided in the SurvExpress web site.

Database
Although data collection will continue, to date we have collected around 20,000 cancer samples distributed in 140 datasets covering more than 20 tissues ( Table 2). The main limitation to include more datasets was that the absence of censoring information in repositories. Nevertheless, the SurvExpress collection surpasses that of similar tools in terms of tissue coverage, number of samples, multivariate predictor estimation, and functionality (Table 1).
From the 20 cancer types, the most represented by their number of datasets were breast, hematologic, lung, brain, and ovarian, reaching around 70% of the database collection. It is surprising that most of the existing tools are mainly concentrated in breast cancer even though a similar number of datasets is available for other cancer types. Consequently, one of the immediate advantages of SurvExpress is the availability to perform powerful analysis for these highly studied types of cancers. In addition, SurvExpress will allow the validation of biomarkers in cancer types that have not been considered by other tools such as kidney, liver, gastrointestinal, pancreatic, bone, head and neck, and uterine. In the web interface, we also encourage users to suggest or send data to increase cancer and dataset coverage.

Web Interface
The two web interfaces comprise three sections: Input, Analysis and Results ( Figure 1B). The Input page is easily operated typing or pasting a list of genes and specifying the target dataset (numbers 1 to 3 in Figure 1B). It also includes a link to the tutorial that describes all options and provides comprehensive interpretations of the outputs. The subsequent Analysis and Result page is obtained in a few seconds (about 1 second per gene and 200 samples). In the Analysis section, the user specifies the outcome of the selected dataset in which the analysis will be performed (number 4 in Figure 1B). The Results section ( Figure 2) is obtained few seconds after submitting an analysis. This section includes outputs such as Kaplan-Meier curves for risk groups, visual comparison of the clinical information to risk groups, a heat map of the gene expression values, box plots of the gene expression per gene and risk group, a plot of the risk group optimization process, tables of the Cox coefficients, prognostic indexes, and Cox fitting information, and a link to obtain the R scripts used.

Validation and Applications
Because of limitations in other tools, multi-gene comparisons across tools were not possible. Still, SurvExpress can provide similar results to other tools when one gene only is used. Nevertheless, to assess the functionality and estimations of SurvExpress, we performed two analyses evaluating the performance of well-known and proposed prognostic biomarkers. We used the OncotypeDX biomarker for recurrence in breast cancer and two published biomarkers for lung cancer survival.
OncotypeDX biomarker for breast cancer. As an example for testing one biomarker in several datasets, we used the 16 OncotypeDX genes [14]. OncotypeDX estimates a recurrence score that is mainly offered to early-stage, estrogen positive, lymph node negative breast cancers. The genes included are AURKA, BAG1, BCL2, BIRC5, CCNB1, CD68, CTSL2, ERBB2, ESR1, GRB7, GSTM1, MKI67, MMP11, MYBL2, PGR, and SCUBE2 (ACTB, GAPDH, GUSB, RPLP0, and TFRC genes used as reference in the RT-PCR assay were not used here). To estimate the score, OncotypeDX uses a weighting algorithm equivalent to a weight multiplied by corresponding gene expression normalized by a reference [14]. In SurvExpress we used Cox fitting (as an approximation since gene expression data is not normalized to reference genes) in four breast cancer datasets (Table 3). Other settings were the maximum row average for genes with multiple probesets, and two risk groups split at the median of the prognostic index. To test the biomarker in several conditions, the datasets were chosen to reflect patients suitable for the test (Wang [27] and Ivshina [26]), patients with partial information besides different event (TCGA [25]), and patients without clinical information (Kao [15]). The results shown in Figure 3 and summarized in Table 4 suggest that, overall, Oncotype DX can separate significantly low- and high-risk groups in the four datasets tested. Moreover, satisfactory indexes of concordance and areas under the ROC curve were obtained. These results can be obtained using SurvExpress in a few minutes. To demonstrate the analytical features of SurvExpress, we also performed the survival evaluation stratifying the samples using the tumor grades provided by authors (AJCC Stage in the TCGA dataset and grade in the Ivshina dataset). Representative results for the Ivshina dataset are shown in Figure 4. The figure suggests that the performance, given by the concordance index and log-rank test for risk groups, decreases along grade. Results for the TCGA dataset are shown in the Tutorial available in the SurvExpress web site.
Comparison of two lung cancer biomarkers. For nonsmall-cell lung cancer (NSCLC), at least 16 biomarkers have been proposed [16]. Here we compared two biomarkers proposed for survival of NSCLC that attempt to predict the same event     Corresponding beta coefficients from the Cox fitting is shown. Two stars (**) marks genes whose fitting p-value ,0.05, one star (*) for marginal significant genes having p-value ,0.10, and no stars for genes whose p-value is .0.1. Box plots compare the difference of gene expression between risk groups using a t-test. doi:10.1371/journal.pone.0074250.g005 (survival) and use a similar number of genes; however, the genes are different. The first NSCLC biomarker was proposed by Boutros et al. [17] and contains the following genes: STX1A, HIF1A, CCT3, HLA-DPB1, RNF5, and MAFK. The second NSCLC biomarker was proposed by Chen et al. [18] and contains the genes DUSP6, MMD, STAT1, ERBB3, and LCK. Therefore, it is of clinical interest to compare their performance. For this, we performed an analysis in SurvExpress using the maximum row average for genes with multiple probesets, two risk groups by prognostic index median, and Cox fitting. We used a special lung meta-base build in our research group, which is composed of more than 1,000 samples obtained from six authors (Bild [19], Raponi [20], Zhu [21], Hou [22], NCI [23], Okayama [24]), equivalent Affymetrix gene expression platform, and containing all biomarker genes.
The results show that both biomarkers are able to separate risk groups characterized by differences in their gene expression (see Kaplan-Meier and box plots respectively in Figure 5). Nonetheless, the p-value of the risk group separation, the concordance index, and the significance of the coefficients were slightly better in the Chen biomarker. To analyze the biomarkers more deeply, we tested the biomarker per database author using the SurvExpress stratification functionality (this can also be achieved performing a SurvExpress analysis per author dataset). The results for the six authors are summarized in Table 5. Three representative examples are shown in Figure 6. The results show that the Boutros biomarker fails in four datasets (the log-rank test of the difference in risk groups is not significant) while the Chen biomarker works better in almost all datasets. In summary, these results suggest that the performance of Chen biomarker is superior.

Conclusion
Compared with other tools, SurvExpress is the largest and the most versatile free tool to perform validation of multi-gene biomarkers for gene expression in human cancers. The analysis  requires only the list of genes and can be performed in approximately one minute per dataset. Common applications for testing the performance of biomarkers include the evaluation of a biomarker in other populations or clinical status and the comparison of competing biomarkers. We have shown these two applications of SurvExpress comparing the performance of a breast cancer biomarker in several datasets, including tumor grades, and determining the best biomarker out of two alternative lung cancer biomarkers. We conclude that SurvExpress is a valuable and comprehensive web tool and cancer database with clinical outcomes tailored to rapidly evaluate gene expression biomarkers.

Author Contributions
Conceived and designed the experiments: VT. Performed the experiments: RAG HGR EML AMT RCH ARB VT. Analyzed the data: VT JTP. Contributed reagents/materials/analysis tools: RAG HGR EML AMT RCH ARB VT. Wrote the paper: VT RAG JTP.