Combining Gene Signatures Improves Prediction of Breast Cancer Survival

Background Several gene sets for prediction of breast cancer survival have been derived from whole-genome mRNA expression profiles. Here, we develop a statistical framework to explore whether combination of the information from such sets may improve prediction of recurrence and breast cancer specific death in early-stage breast cancers. Microarray data from two clinically similar cohorts of breast cancer patients are used as training (n = 123) and test set (n = 81), respectively. Gene sets from eleven previously published gene signatures are included in the study. Principal Findings To investigate the relationship between breast cancer survival and gene expression on a particular gene set, a Cox proportional hazards model is applied using partial likelihood regression with an L2 penalty to avoid overfitting and using cross-validation to determine the penalty weight. The fitted models are applied to an independent test set to obtain a predicted risk for each individual and each gene set. Hierarchical clustering of the test individuals on the basis of the vector of predicted risks results in two clusters with distinct clinical characteristics in terms of the distribution of molecular subtypes, ER, PR status, TP53 mutation status and histological grade category, and associated with significantly different survival probabilities (recurrence: p = 0.005; breast cancer death: p = 0.014). Finally, principal components analysis of the gene signatures is used to derive combined predictors used to fit a new Cox model. This model classifies test individuals into two risk groups with distinct survival characteristics (recurrence: p = 0.003; breast cancer death: p = 0.001). The latter classifier outperforms all the individual gene signatures, as well as Cox models based on traditional clinical parameters and the Adjuvant! Online for survival prediction. Conclusion Combining the predictive strength of multiple gene signatures improves prediction of breast cancer survival. The presented methodology is broadly applicable to breast cancer risk assessment using any new identified gene set.


I: Materials and Microarray protocol Training set (MicMa)
The published data [1] have 123 human breast cancer cases, mainly stage I and II. Patients treated for localized breast cancer were included in this study [2] between 1995 to 1998. This cohort is a subset of a larger study where the patients were monitored for presence of disseminated tumor cells in bone marrow (DTC). The selection of patients to adjuvant treatment was based upon the prevailing National Guidelines, where postmenopausal hormone receptor (HR) positive patients received tamoxifen only, postmenopausal HR negative patients received CMF (Cyclophosphamide, Methotrexate, Fluorouracil) and premenopausal patients received CMF followed by tamoxifen if HR positive. Five patients received high dose chemotherapy and another five, preoperative chemotherapy due to large tumor size [2]. After completed primary therapy, the patients were followed at 6-12 months intervals. Fresh frozen tissue samples were available from all the 123 individuals. One patient was diagnosed with ductal carcinoma in situ.
Total RNA was isolated from primary tumor tissues using the TRIzol reagent (Invitrogen). The integrity and quality of RNA samples were evaluated on the 2100 Bioanalyzer (Agilent) and the concentration measured using a NanoDrop spectrophotometer (NanoDrop Technologies). Amplification, and labeling of RNA from tumor with Cy5 and from the Universal Human Reference (Stratagene) with Cy3, was performed as previously described [3]. Hybridization of labeled cRNA to arrays containing 42,000 features representing 23169 unique cluster IDs (UniGene Build Number 215) produced at the Stanford Functional Genomics Facility (http://www.microarray.org/sfgf/jsp/home.jsp) was performed at 65°C overnight as previously described. The hybridized arrays were scanned on an Agilent DNA microarray scanner and images analyzed by GenePix Pro v 4.1. All procedures are available at http://genomewww.stanford.edu/breast_cancer and all raw data can be obtained from the Stanford Microarray Database (SMD; http://smd.stanford.edu/). The data have been deposited in NCBIs Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) and are accessible through GEO Series accession number GSE3985.

Validation set (Ull)
The data published by Langerød et. al [4] was used as validation dataset. Patient samples were sequentially collected at Ullevål University Hospital from 1990 to 1994. 80 cases were selected from the total series based only on sufficient amount of fresh frozen tissue for microarray. The last update of patient information was done in 2006, providing an observation time of 12 to 16 years. Patients were followed until death or emigration, and only 12 patients were lost to followup. The average age of the 80 cases analyzed by cDNA microarrays was 65.0 years at time of primary surgery (range 28.2 to 87.7 years).
Total RNA was isolated from snap frozen tumor tissue using TRIzol ® solution (Invitrogen™, Carlsbad, California, USA). The concentration of total RNA was determined using an HP 8453 spectrophotometer (Hewlett Packard) and the integrity of the RNA was assessed using a 2100 Bioanalyzer (Agilent, Santa Clara, California, USA). Linear amplification of total RNA was performed using an optimized protocol previously described [3]. Amplified tumor RNA was

II: A brief discussion of the various gene sets & Adjuvant! Online model
Except for the RS predictor, all gene sets were originally developed as microarray-based predictors.
1. 16-gene recurrence-score predictor (RS) These 16 genes are the cancer-related subset from the 21-gene recurrence score RT-PCRbased assay (Oncotype DX ® ) in Paik et al. (2004) [5]. The expression for each cancer-related gene on RT-PCR assay is normalized relative to the expression of the five reference genes, and further used to calculate a recurrence score, RS, which quantifies the likelihood of distant recurrence in adjuvant-tamoxifen-treated patients with lymph node-negative, ER-positive breast cancer into categories of high risk (RS ≥ 31), intermediate risk (18 ≤ RS < 31), and low risk of recurrence (RS < 18). They claimed Recurrence Score ® performance is superior to that of patient age, tumor size or tumor grade in either predictive power or reproducibility.

26-gene stroma-derived prognostic predictor (SD)
This gene set was identified from an analysis of microdissected stroma from breast cancer specimens. A logistic regression was used to score and rank each gene in the expression profile on the basis of its statistical significance in predicting recurrence in a model that included the gene expression level, lymph node status, and ER, PR and HER2 status. And naive Bayes' classifiers were trained to predict outcome using the ranked gene expression profile of the recurrence-positive stroma cluster. The original study [6] reported that 26-SD prognostic predictor stratified disease outcome independently of standard clinical prognostic factors and published expression-based predictors.
3. 54-lung-metastasis-gene signature (LM) By means of in vivo selection, transcriptomic analysis, functional verification and clinical validation, a set of 54 genes was identified that mediates risk of breast cancer metastasis to lung and is clinically correlated with the development of lung metastasis when expressed in primary breast cancers [7]. A cohort of 82 breast cancer patients was used in a univariate Cox proportional hazards model to relate the expression level of each lung metastasis signature gene with clinical outcome. A 'leave-one-out' cross-validated multivariate analysis was used to distinguish between patients with a high risk and those with a low risk for developing lung metastasis. In each round, each of the 54 gene weights was estimated in a univariate Cox using only training set, and a 10-year lung-metastasis risk index for test case (held out from the full set of tumors) was defined as a linear combination of gene expression values weighted by their univariate result; and the held out tumor was assigned to the high-or lowrisk groups. The 20th percentile of the risk index scores was chosen as the threshold to separate high risk (top 20%) and low risk group.

70-gene predictor (AMST)
This breast cancer prognostic signature [8], also known as MammaPrint ® genomic test, consists of 70 most informative genes derived from a supervised statistical model. Based on the expression levels of the 70 genes, it predicts distant metastases status 5 years after the diagnosis on younger patients with stage I and II node-negative breast cancer. The signature associated with a poor prognosis demonstrates overexpression of genes regulating the cell cycle, invasion, metastasis, and angiogenesis. In 2007, MammaPrint ® test was approved by FDA for use in the U.S. for lymph node negative breast cancer patients under 61 years of age with tumors of less than 5cm. It is the first cleared molecular test that profiles genetic activity.
In a recent preliminary clinical trial, Mook et al. (2008) [9] reported that the 70-gene prognosis-signature outperforms traditional prognostic factors in predicting metastasis in patients with 1-3 positive nodes. According to the authors, the signature can accurately identify patients with a good disease outcome in node-positive breast cancer, who would not benefit from adjuvant chemotherapy.

76-gene predictor (ROT)
This gene set was used to predict development of distant metastases within 5 years. It was developed from 286 patients not receiving systemic treatment, with lymph node negative primary breast cancer of variable size and from all age groups [10].
6. 97-grade-associated markers (Grade) Sotiriou et al. (2006) [11] identified 97 grade-associated genes from 64 ER positive tumor samples by comparing expression profiles between histologic grade 3 and grade 1 tumors. The signed sum of gene expression values of these 97 genes were used to calculate the gene expression grade index (GGI). In validation datasets, the GGI was strongly associated with histologic grade. 7. 127-gene classifier (Robust) 127 genes were identified as predictors of prognosis from a pooled dataset consisting of six breast cancer datasets, totaling 947 samples [12]. The study showed a significant positive correlation between the number of datasets that is pooled, the validation performance, the number of genes selected, and the enrichment of specific functional categories. The authors claimed that the 127 gene signature is one of the most robust signatures currently available.
8. 168-hypoxia-gene signature (Hypoxia) The epithelial hypoxia signature [13] consists of 168 genes that were consistently induced by hypoxia. They were defined from an in vitro experiment by Chi et al. (2006) [13] where four epithelial cell lines were exposed to hypoxia. A gene expression classier was developed through a "hypoxia score" for the patient by averaging expression levels for the hypoxiaresponse genes. Patients were assigned into high or low hypoxia response group by a cutoff hypoxia-score at zero. Using published data sets, the authors found that the "high hypoxia response" group tends to be higher grade, and more likely to have p53 and oestrogenreceptor deficiencies, and, most importantly, a significant association with a poorer prognosis in breast and ovarian cancer. 9. 186-invasiveness-gene signature (Stem) Liu et al. (2007) [14] report on a 186-gene "invasiveness" gene signature (IGS) that discriminates between normal breast epithelium and tumorigenic breast-cancer cells that are characterized by CD44 expression and low or undetectable levels of CD24. This signature was reported to be associated with survival among patients with breast cancer.

306-intrinsic/UNC gene list (Intrinsic)
The original intrinsic gene list was previously identified using DNA microarray in the study of Perou et al. (2000) [15] and refined in Sørlie et al. (2001) [16]. It has been used to identify five distinct subtypes of breast tumors (basal-like, ERBB2+, luminal A, luminal B and normal-breast like) based on the expression profile of the ~ 500 intrinsic genes, whose variation was significantly greater between samples from different tumors than between samples from the same tumor before and after treatment. These subtypes were associated with significant differences in overall and disease-free survival [16,17]. The intrinsic/UNC gene list [18] is a new intrinsic-subtype classifier, which uses gene expression profiles to distinguish among breast cancers on the basis of either their cell type of origin -the luminal cell (which is ER-positive) or the basal cell (which lacks expression of ER, the progesterone receptor, and HER2) -or whether the tumor is HER2-positive. Among the UNC intrinsic genes, a proliferation signature was not present in previous breast intrinsic gene sets.
11. 512-core serum response gene list (WR) The wound response/core serum response gene list [19] based on a wound-response geneexpression signature derived from the transcriptional response of normal fibroblasts to serum in cell culture, has been shown to improve the risk stratification of early breast cancer over that provided by standard clinic pathological features, in that the development of distant metastases is more likely among patients whose breast cancers have activated pathways for matrix remodeling, cell motility, and angiogenesis than among those whose cancers do not.

Adjuvant! Online model (AOL)
Adjuvant! Online [20] is a computer based model using patient age, comorbidity level, ER status, tumor grade, tumor size and number of positive lymph nodes to predict breast cancer specific mortality and recurrence risk, as well as the benefit of adjuvant therapy for women with early-stage breast cancer.
Because Adjuvant! was directly derived from mortality data and because details of local therapy (sugery and initial radiation) can strongly influence local relapse rates more so than mortality, Adjuvant!'s estimates of mortality are more firmly based than those for relapse. Breast cancer outcome estimates made by Adjuvant! are for "patients who have unilateral, unicentric, invasive adenocarcinoma of the breast, who have undergone definitive primary breast surgery and axillary node staging, and who have no evidence of metastatic or known residual disease; no evidence of T4 features (extension to skin or chest wall); no evidence of inflammatory breast cancer. If they have had breast conserving therapy there should be plans for them to receive radiation therapy. They should not yet have received systemic therapy (neoadjuvant therapy), or radiation prior to their surgical staging." (Adjuvant! Breast Cancer Help Files).
To calculate the 10-year mortality risk for the 80 patients in the test data (Ull), 3 patients with NEOadjuvant; 2 additional patients with metastasis; 3 additional patients with T4 were excluded. Among the rest 72 patients, 6 patients are with "other" histologic subtypes. Due to the differences between the coding used in Adjuvant! Online and in Ull set for node status and tumor size, "positive node = 0" in AOL was specified for Ull patients with unknown node status; "tumor size = 0.1-1.0cm" in AOL was specified for patients with unknown tumor size; "tumor size = 1.1-2.0cm" was specified for patients with T1 (<=2cm), "tumor size = 3.1-5.0cm" was specified for Ull patients with T2 (>2 & <=5cm). The default level "Minor Problems" was kept for comorbidity in AOL to give an estimate of the general health of an individual.

III. Cross-platform gene mapping
The annotations for the Stanford 43k clones were retrieved from SMD (http://smd.stanford.edu/; UniGene Build Number 215). For gene sets that developed from the same platform as Stanford 43k cDNA array, we matched the genes in the gene set to MicMa and Ull through clone IDs. For those developed from other platforms, the matching was done by first using UniGene ID and gene symbols to link to Stanford 43k Clone IDs; and then using gene aliases for those genes found no match in the preceding step. See table 1 for an example. Table 1: Illustration of matching gene set to Stanford 43k array.
Step 1, UniGene IDs for all the probes were retrieved (column "Linker ID 1"). Probe 1 and 2 were found to share the same UniGene IDs with Stanford 43k array clone 566115 and 310356/69002, respectively. Step 2, gene symbols were retrieved for probes 3, 4, 5 and 6 (column "Linker ID 2") that had no match in step 1. Probe 3 and 4 were matched to clone 122274 and 358030, respectively, linking through shared gene symbols.
Step 3, gene aliases were retrieved for probe 5 and 6 (column "Linker ID 3") that had no match after step 1 and 2. Probe 5 was matched to clone 33045 by common gene aliases. There is no corresponding clone on Stanford 43k array for probe 6. Furthermore for the matched clones representing the same gene symbol were mean-averaged. For clones without information of gene symbol, we mean-averaged the multiple matched clones that represent the original probe IDs from the corresponding gene signature.
Details of the matching results for individual gene set are describes as follows.
Retrieved ID: Corresponding UniGene IDs were retrieved using gene symbols from SMD (Build #215).
Results: 13 matches were found on Stanford 43k array using gene symbols. For those didn't find match by symbol, additional 2 matches were found using gene aliases.
Retrieved ID: Corresponding UniGene IDs were retrieved using gene symbols from SMD (Build #215).
Results: 21 matches were found on Stanford 43k array by symbols. For those didn't find match by symbol, additional 1 match was found by using retrieved UniGeneID.
Retrieved ID: Corresponding UG & symbols were retrieved using 54 Affy probes from chip annotation file for Affy U133A chip.
Results: 45 matches were found on Stanford 43k array using UniGeneID. For those didn't find match by UniGene, additional 2 matches were found using symbols. For those still didn't find match, additional 1 match was found by using gene aliases. Results: 53 matches were found on Stanford 43k array using UniGeneID. For those didn't find match, additional 4 matches were found using gene aliases.
Retrieved ID: Corresponding UniGeneIDs & gene symbols were retrieved using Affy probes from chip annotation file for Affy U133A chip.
Results: 59 matches were found on Stanford 43k array using UniGeneID. For those didn't find match by UniGene, additional 3 matches were found using symbols. For those still didn't find match, additional 4 matches were found by using gene aliases.
Retrieved ID: Corresponding UniGeneIDs & gene symbols were retrieved using Affy probes from chip annotation file for Affy U133A chip.
Results: 101 matches were found on Stanford 43k array using UniGeneID. For those didn't find match by UniGene, additional 6 matches were found using symbols. For those still didn't find match, additional 4 matches were found by using gene aliases.
Retrieved ID: Corresponding UniGeneIDs & gene symbols were retrieved using Affy probes from chip annotation file for Affy U133A chip.
Results: 108 matches were found on Stanford 43k array using UniGeneID. For those didn't find match by UniGene, additional 6 matches were found using symbols.
Results: all 253 matches were found on Stanford 43k array using CLID.
Retrieved ID: Since the probes were from different arrays, and information about which probes came from which array were not published, 186 gene symbols were retrieved from the published annotation file (http://content.nejm.org/content/vol356/issue3/images/data/217/DC1/NEJM_Liu_217sa2.xls ); and then UniGeneIDs were retrieved by gene symbols from (build#215), further used for the mapping.
Results: 109 matches were found on Stanford 43k array using gene symbol. For those didn't find match, additional 52 matches were found using gene aliases.
Retrieved ID: Since the 327 probes were from different arrays, and information about which probes came from which array were not published. Corresponding UniGeneIDs & gene symbols for 327 probes were retrieved from the published 1500-probes-Annotation-table (https://genome.unc.edu/pubsup/breastTumor/Hu-et-al-Intrinsic-List-with-Annotation.xls).
*304 EntrezIDs were also provided by UNC author. There are 294 overlapping with published-EntrezIDs. 290 out of 304 UNC-EntrezIDs can be mapped on Stanford43k. Did not achieve 100% match (according to UNC, it should be 100% match using Entrez alone). Therefore, we used UG & symbols & aliases for the mapping instead of Entrez.
Results: 111 probes were mapped on Stanford 43k array by using UniGene; additional 157 probes were found by using gene symbol; additional 22 probes by gene aliases.

Total: 290/306=95%
11. WR (512 wound response/core serum response gene list): Original ID: 573 Clone IDs were published Results: 573 matches were found on Stanford 43k array using CLID. (Although the platforms are the same, it's possible some clones were removed from the newer version of the Stanford 43k array.

IV. Statistical analysis
A total 118 patients with available information of systemic recurrence status and time to recurrence (n = 118) were considered in the analysis. For breast cancer specific survival, information was available for 123 patients (n = 123), and these were used in the further analyses.

Theoretical background: Cox regression
The Cox proportional hazards model [21] is widely used to model survival data. No assumption has to be made for the underlying parametric distribution of survival times. We applied this technique to model and predict our time-to-event data. We aimed to predict survival probability for systemic recurrence and breast cancer specific death, respectively. The variables included in the model were expressions of individual genes from gene sets and also the different clinical parameters in the later stage of the analysis.
Our data are (t j , d j , X j ), j = 1,…, N, where t j is the time on study of the jth patient, d j is an indicator with d j = 1 indicating an event (systemic recurrence or breast cancer specific death) and d j = 0 indicating no event at the observed time, and X j is the vector of risk variables associated with the jth patient. In a Cox proportional hazards model, the hazard rate for the jth individual with risk vector X j at time t is as follows: (1) Here, is a parameter vector to be estimated and is an arbitrary baseline hazard function. Even though the form of the baseline hazard function is unspecified, coefficients for the risk vector in the Cox model can still be estimated by maximizing a partial likelihood [21]: (2) where R j is the set of individuals still alive and uncensored just before time t j .

Theoretical background: Cox-ridge regression
In our case, the number of explanatory variables genes (p) is large and sometimes even exceeds the number of individuals N used for training of the Cox model. Thus overfitting may easily arise, leading to a fitted model that describes that training data well, but performs poorly on prediction of new observations. In addition, a high degree of collinearity among the variables is likely to emerge, thereby leading to a situation in which the estimated regression coefficients may change substantially even after slight perturbations of the training data. In a linear model, dimension reduction techniques and penalized regression are the strategies to control and stabilize the variance of the estimates and further achieve better prediction rules. Some widely used regression regularization methods such as ridge regression, partial least squares and principle components regression were compared in the study by Frank and Friedman in 1993 [22]. In Cox-ridge regression, the coefficients are estimated by maximization of the penalized partial log-likelihood (using Newton-Raphson procedure): (3) where i =1, …, p, indicates the covariates; the first term is the partial log-likelihood and the second term is the penalty term [23]. Here, λ>0 is a tuning parameter that controls how much weight to put on the penalty function.

Theoretical background: Model selection by cross-validation
The ridge regression estimates depends on the tuning parameter λ. In our study, λ was determined by maximizing the cross-validated partial log-likelihood criterion proposed by Verweij and van Houwelingen (1993) [24]. Let denote the partial log-likelihood when observation i is left . Then, the contribution of observation i to the likelihood is . Let β (− i) be the value of that maximizes . Then the leave-one-out cross-validated partial log-likelihood criterion proposed by Verweij and van Houwelingen (1993) [24] is given by: Maximizing the cross-validated partial log-likelihood with respect to λ yields the optimal value for the tuning parameter.
Note that in this paper, we focus on obtaining a prediction rule that performs well on new observations. Therefore, the value λ is determined by maximizing the predictive value of the model. In parametric models [25], the measure for the prediction error is to use the predictive log-likelihood, which is the expected value of the model log-likelihood for new observations. Due to the semi-parametric nature of Cox model, the partial log-likelihood instead of the full loglikelihood is used as a predictive log-likelihood so that the baseline hazard does not play a role.
In a comparative study of survival prediction performance using microarray data [26], it has been found that Cox-ridge regression often outperforms other common regularization techniques for Cox regression, such as principal components regression, supervised principal components regression, partial least squares regression and the lasso.

Construct a gee-set prediction model based on gene expression
On the training set, we apply leave-one-out cross validation using Van Houwelingen's criteria [24] to determine the tuning parameter , which is specific for each gene set. We then built a Cox-ridge model using for each gene set by incorporating the intensity log ratio of genes from the gene set as covariates. Let the vector of gene expression values for the ith patient and the jth For all patients on jth gene set, j = 1,…, 11, we assume a Cox-ridge model: where X j Training is the notation for the training set for j th gene set, which is the gene expression matrix matched on MicMa for j th gene set. Gene set j consists a total n j genes.
β j = (β j1 ,...,β j,n j ) T are coefficients associated with corresponding genes and h 0 j (t) is the gene-set specific baseline hazard at certain time point t. It has been recommended to standardize the covariates before applying the ridge regression, since the ridge estimates are not invariant to the scaling of the input data [22,27]. In our case, the expression for each gene in a matched geneset expression matrix was transformed to center around 0 with standard deviation 1. Standardization was carried out both for the training set and the test set.
From Model 5 on the training data, we obtained the estimated coefficients β j . The estimated Prognostic Index for each MicMa patient in training set was calculated by The predicted Prognostic Index for an Ull patient in test was calculated by sum of weighted expression for each gene in test dataset by the corresponding coefficient for the same gene estimated from training data:

Model evaluation: proportion of variation explained (PVE)
Comparable with the R 2 in regression modeling, the importance of covariates in the Cox model can be quantified using the proportion of variation explained in the outcome variable (PVE) [28] by one or more covariates: where n denotes total sample size. The relative importance of a covariate in a multivariate Cox model was measured by the partial PVE, which was calculated as the different of for the full model and for a model with a factor of interest excluded.

Model evaluation: concordance index (C-index)
The concordance index (C-index) [29] is a widely accepted measurement for predictive discrimination of a given model. In survival analysis, it is a generalization of the area under the receiver operating characteristic (ROC) curve, and it measures the probability of concordance between the predicted and observed responses in terms of lengths of survival of any two patients: | Ω | where r i and r j are the predicted risk for ith and jth patient; respectively. Ω is a set of all possible pairs of patients, at least one of whom has experienced an event and time to event t i < t j . If the predicted risk is larger for the patient who lived shorter, the predictions for that pair are said to be concordant with the outcomes. The C-index is ranging from 0 to 1. The larger C-index, the better is the predictability of a survival model.

Evaluation on the training set for systemic relapse as clinical endpoint
The estimated Prognostic Index for each MicMa patient in training set was calculated by model 6. It should be emphasized that the results on the training set should not be interpreted as more than a preliminary screening of the model applicability. The performance of the model on the training set was examined as follows. The estimated prognostic index for each training patient was obtained by weighted sum of the expression values of corresponding genes in each gene set in training set (formula 6). Hierarchical clustering using Spearman correlation and average linkage performed on the estimated PIs from all gene sets revealed two distinct risk groups, namely: high-risk and low-risk group ( Figure S2A) associated with significantly different survival probabilities (χ 2 = 49.7, df = 1, p < 0.001) ( Figure S2B). The median survival time, where half of the subjects have reached the event of interest, was not observed in the low-risk group, while the high-risk group had a 47.2 month median survival time. Distinct clinical characteristics were observed in the two risk clusters: a total of 30 out of 49 Luminal A tumors (61%) were clustered in the low risk group, and 19 luminal A tumors (39%) were found in the high risk group ( Table 2). The results suggested that the two risk groups derived from the estimated prognostic indices had distinct survival patterns on the training set.