Combining Gene Signatures Improves Prediction of Breast Cancer Survival

doi:10.1371/journal.pone.0017845

Figure 1.

Flowchart of the analysis.

(A) Construction of the gene-set predictor/gene signature for risk prediction. Input: A set of genes of interest (gene 1, …, m) which can be traced by the corresponding colors through out the diagram; gene expression data for training cohort and test cohort with genes placed in the rows and patients in the columns. Step 1. Gene identity mapping and extract expression matrix. Step 2. With available status of observing an event for the patients on the training set, a Cox model with L2 penalty is used to model the relationship of survival probability and gene expression pattern of the gene set. The coefficients or “gene weights” (β₁, …, β_m) associated with individual genes are estimated from the Cox-ridge model. Size of the bubble in the gene weights matrix reflects the importance of the corresponding gene for survival prediction. Step 3. A Prognostic Index (PI), the predicted risk score for a test patient i (i = 1, …, n) is calculated by the sum of weighted gene expression from test patient i using the estimated gene weights from step2. (B) Integration of multiple gene signatures by dimension reduction. Input multiple gene sets of interest together with their gene expression data. Module 1: For jth gene set (j = 1, …, R), the procedure described in panel A is used to predict a risk score PI for individual test patient. The resulting PI matrix is positioned in R by n dimension representing the risk prediction of the n test patients by each of the R gene sets. Module 2: Integrate predictions from multiple gene signatures by dimension reduction using principal components analysis (PCA). Module 3: Dichotomize the risk scores on PC1 by median (higher than median indicates high risk) resulting in two predicted risk groups for survival outcome.

More »

Expand

Table 1.

Published gene sets included in the analysis.

More »

Expand

Table 2.

Acronyms for original gene sets and coverage on training & test set.

More »

Expand

Table 3.

Number of overlapping genes between gene sets.

More »

Expand

Table 4.

Individual gene set prediction characteristics (optimal λ by LOOCV in model building, change in deviance on test set, standard deviation for PIs).

More »

Expand

Figure 2.

Boxplot of predicted PIs on test data.

(A) Systemic recurrence: The figure shows that the predicted PIs across all the studied gene sets were roughly centered around 0, resulting from the standardization procedure of the expression matrix on both training and test set for individual gene set in the model building stage. The standard deviations of PIs for individual gene set are following: RS: 0.109; SD: 0; LM: 0; AMST: 0.195; ROT: 0.095; Grade: 0.078; Robust: 0.121; Hypoxia: 0.245; Stem: 0.044; Intrinsic: 0.137; WR: 0.037. Due to lack of convergence, the predicted PIs by gene set SD and LM was calculated by setting tuning parameter λ, at a large value. (B) Breast cancer specific death (BC specific death): Boxplot of predicted PIs on test set. Gene set LM failed to converge in the model training.

More »

Expand

Figure 3.

Hierarchical clustering of predicted PIs on test set and Kaplan-Meier analysis of the clusters.

Results for systemic recurrence are in (A–B); for BC specific death in (C–D). In heatmaps (A, C), rows are notations for the gene sets. Columns are annotation for the patients; data outside of 1% quantile were trimmed. “Average” linkage based on Spearman correlation was used to construct the dendrograms. Figure A and C share legends for the clinical parameters. (A) Heatmap of predicted PIs on the test set for systemic recurrence from each gene sets. Two risk groups were observed from the hierarchical clustering; cluster I and cluster II. The control sample in Ull DNR_N_100, marked by green, was classified in the cluster associated with a lower risk (cluster II). (B) The Kaplan-Meier curves for the two clusters. A significant separation between the two groups was observed (χ² = 7.8, df = 1, p = 0.005). (C) Heatmap of predicted PIs on the test set for BC specific death from each gene sets. Two risk groups were observed from the hierarchical clustering; cluster I and cluster II. The control sample in Ull DNR_N_100, marked by green, was classified in the cluster associated with a lower risk (cluster I). (D) A significant separation between the two Kaplan-Meier curves associated with the clusters was observed (χ² = 5.996, df = 1, p = 0.014).

More »

Expand

Table 5.

Clinical and molecular characteristics of the two risk groups from hierarchical clustering of the test patients based on the predicted PI matrix.

More »

Expand

Figure 4.

Correlation structure of predicted PIs from gene sets with convergence in model-building stage.

Heatmap of Spearman correlation matrix of predicted PIs for corresponding survival endpoint from individual gene sets. (A) For systemic recurrence, nine gene sets that reached convergence during modeling building are displayed. (B) For BC specific death, ten gene sets that reached convergence during modeling building are displayed. Figure A and B share the same color legend.

More »

Expand

Figure 5.

Systemic recurrence: Kaplan-Meier plot of the PI-risk groups for each of the individual gene sets.

The Kaplan-Meier curves and the associated logrank p values for dichotomized PI-risk groups from each of the 9 converged gene-set models.

More »

Expand

Figure 6.

Systemic recurrence: PCA of predicted PIs from converged gene sets and performance of the resulting groups from PC1.

Results for systemic recurrence are in (A-B); for BC specific death in (C-D). (A) Scatter plot of predicted PIs from 9 converged gene-set models on the space of the top two leading PCs. Black circles indicate censored observations; red dots indicate patients with relapse. (B) The Kaplan-Meier curves for high and low risk groups are significantly different (χ² = 8.76, df = 1, p = 0.003). (C) Scatter plot of predicted PIs from 10 converged gene-set models on the space of the top two leading PCs. Black circles indicate censored observations; red dots indicate patients with BC specific death; brown stars indicate death from other reasons. (D) The Kaplan-Meier curves for high and low risk groups are significantly different (χ² = 10.26, df = 1, p = 0.001).

More »

Expand

Figure 7.

Univariate comparison of predictors for systemic recurrence.

Comparison of combined-PI risk predictor with clinical parameters and individual gene-set predictors using univariate Cox model. (A) Y axis indicates C-index associated with individual predictor and X axis indicates the p values (on minus log10 scale) from likelihood ratio test in univariate Cox model. C-index = 0.5 and the significant level: α = 0.05 for the likelihood ratio test are indicated by the dotted line. The size and the color of the bubble indicate the PVE and the deviance in univariate Cox model, respectively. The combined-PI risk predictor had the most significant marginal effect for predicting systemic recurrence (p = 0.003). It was associated with the second highest C-index score (C = 0.75) following TP53 mutation status (C = 0.76). It had the second highest deviance (8.61) following tumor size (9.36), and the combined-PI predictor alone explained 10.6% of the variability as indicated by PVE, following tumor size (11.7%) and stage (11.1%) (B) X axis indicates HR from the univariate Cox model and the 95% CIs are shown along with the point estimates. “LR test” stands for likelihood ratio test. Insignificant predictors (likelihood ratio test p>0.05) are grayed out. To keep the results interpretable, only predictors with two levels are compared. The combined-PI risk predictor had the 2^nd largest HR (2.82 with 95% CI 1.37—5.80) following TP53 mutation status (2.87 with 95% CI 1.42—5.83).

More »

Expand

Table 6.

PCA-combined PI risk predictor in univariate and multivariate Cox regression.

More »

Expand