Fig 1.
Schematic diagram of this study.
a Definition of gene signatures to recapitulate the pathways underlying driver genomic aberrations. Here we use three genomic aberrations (TP53 mutation, ERBB2 amplification and ATM deletion) as examples. In ER+ breast cancer, we defined a total of 72 gene signatures, each for a specific genomic aberration. b To predict patient prognosis in ER+ breast cancer, we constructed prediction models to integrate the 72 gene signatures (intrinsic features), 6 types of infiltrating immune cells (extrinsic features), and clinical factors (e.g., age, tumor stage). Gene signature scores and immune cell scores were calculated based on gene expression of tumor samples. Random Forest models were used to classify good versus poor prognosis, and Cox regression models were used to predict prognostic risk scores.
Table 1.
Summary of features in each model discussed in the manuscript.
Fig 2.
Gene signatures recapitulate the downstream pathways of mutated driver genes.
a Spearman correlation coefficients (Correlation) between different gene signatures defined based on the TCGA ER+ breast cancer data. Signature scores can distinguish ER+ breast cancer samples with TP53_mut (b), ERBB2_amp (c), PIK3CA_mut (d), and GATA3_mut (e) from samples without the aberrations. b, c were based on the Curtis data; (d) was based on GSE41994; and (e) was based on GSE101780. ROC curves showing that TP53_mut (f) and ERBB2_amp (g) signature scores can predict the mutation status of their respective driver genomic aberration.
Fig 3.
Gene signatures and immune infiltration scores predict patient prognosis.
a-c Signature scores for TP53_mut (a), TNFRSF17_amp (b), and MemB (c) distinguish patients with good and poor prognosis. d ROC curves showing that TP53_mut, TNFRSF17_amp and MemB infiltration score predicts prognosis at a comparable level to Onco-score. e AUC scores of random forest models with different combinations of predictive features. Our Sig+Imm model performs with higher AUC scores than the Onco-score models with and without clinical features. f Relative importance of the top 20 most important genomic aberration and immune infiltration features.
Fig 4.
Optimized model outperforms Oncotype DX risk scores for prognostic prediction.
a Results from backward selection to find an optimized set of features—AUC score of the model plotted as a function of the number of features removed. The optimized model is chosen as the highest AUC score at the smallest number of features. b ROC curves of the performance of our optimized model as compared to the Onco-score + Cli model and just the Cli model, showing that our optimized model overperforms both Oncotype DX and clinical features. c ROC curves of our optimized model when trained in Curtis discovery and validated in Curtis validation, and vise versa. d ROC curves of our optimized model when trained in Curtis discovery and validated in the test dataset, the Ur-Rehman dataset, and vise versa. e ROC curves of the Onco+Cli model trained and validated in the same way as (d), showing decreased performance compared to our optimized model.
Fig 5.
Individual signature and immune infiltration scores can identify prognostic patient groups.
Signature and immune infiltration scores fitted in a univariate Cox proportional hazards model without clinical adjustment (a) and with clinical adjustment (b) are associated with survival time. Patients are significantly dichotomized by their TP53_mut (c), TNFRSF17_amp (d), and MemB (e) score. f Risk groups dichotomized by the median Onco-score have lower or comparable significance to some of our signature and immune infiltration scores.
Fig 6.
Optimized Cox regression model for prognostic risk prediction.
a Patients in the Curtis validation dataset are significantly grouped by their risk as predicted by the optimized model trained in the Curtis discovery dataset. b Patients are still significantly dichotomized by their risk when clinical variables are removed from the optimized model. c Onco class achieves slightly lower performance than our optimized model without clinical information in grouping patient risk categories. d-f Our optimized model is able to further stratify the Onco high (d), intermediate (e), and low (f) risk classes.