Fig 1.
(a) Using a gene network as background knowledge (lower left), NetNorM normalises each mutation profile in a collection of somatic mutation profiles (upper left) into a new, binary representation (right) which encodes additional information relative to tumours’ overall mutational burden and hubs’ neighbourhood mutational burden. This new representation allows performing patient stratification with unsupervised clustering techniques, or survival analysis. (b) NetNorM normalises every patient mutation profile to k mutations. Patients with less than k mutations get ‘proxy’ mutations in their genes with the highest number of mutated neighbours until they reach k mutations. Patients with more than k mutations have mutations ‘removed’ in their genes with lowest degree until they reach k mutations.
Table 1.
Summary of the full exome mutation profiles used in this study.
We analysed a total of 3,278 samples from 8 cancer types, downloaded from the TCGA portal.
Fig 2.
Comparison of the survival predictive power of the raw mutation data, NSQN and NetNorM (with Pathway Commons as gene network) for 8 cancer types.
For each cancer type, samples were split 20 times in training and test sets (4 times 5-fold cross-validation). Each time a sparse survival SVM was trained on the training set and the test set was used for performance evaluation. The presence of asterisks indicate when the test CI is significantly different between 2 conditions (Wilcoxon signed-rank test, P < 5 × 10−2 (*) or P < 1 × 10−2 (**)).
Fig 3.
Effect of network randomisation on survival prediction performances.
(a-b) Performances obtained for 20 cross-validation folds with Pathway Commons (real network) and 10 randomised versions of Pathway Commons (randomised network) with NetNorM (left) and NSQN (right) for LUAD (a) and SKCM (b).
Fig 4.
Genes frequently selected in the survival prediction model for LUAD (left) and SKCM (right) learned using the NetNorM representation of mutations with Pathway Commons as gene network.
The genes reported are those that were selected at least 10 times in 20 cross-validation folds. For each cancer, genes are ordered from the most frequently selected (left) to the least frequently selected (right). The top panel reports the number of raw mutations in the selected genes (black), as well as the number of “proxy” mutations (red) and the number of mutations removed (blue) after application of NetNorM. The bottom panel reports the coefficients of a gene in the survival SVM model across the cross-validation folds where this gene was selected. Gene names marked in red indicate proxy genes.
Fig 5.
(a) Comparison of survival prediction performances according to patients’ mutational burden for LUAD. Three different representations of the mutations are used to perform survival prediction using a ranking SVM: raw (the raw binary mutation data), NSQN (network smoothing with quantile normalisation) and NetNorM. Performances for half of the patients with fewer (resp. more) mutations are derived from the predictions made using the whole dataset. (b) Scatter plot of the total number of mutations in a patient of the LUAD cohort (x-axis) against the number of mutated neighbours of KHDRBS1 in a patient (y-axis). Only patients with less than kmed = 295 mutations are shown, where kmed is the median value of k learned across cross-validation folds. Red (resp. blue) indicate patients mutated (resp. non mutated) in KHDRBS1 after processing with NetNorM using k = kmed. The black line was fit by linear regression and by definition indicates the expected number of mutated neighbours of KHDRBS1 given the mutational burden of a patient.
Fig 6.
Survival predictive power of mutation data (raw binary mutations, mutations preprocessed with NSQN or NetNorM with Pathway Commons), clinical data, and the combination of both for LUAD and SKCM.
The combination of both data types was made by averaging the predictions obtained with each data type separately. For both cancers, samples were split 20 times in training and test sets (4 times 5-fold cross-validation). Each time a sparse survival SVM was trained on the training set and the test set was used for performance evaluation.
Fig 7.
Comparison of patient stratifications obtained with the raw mutation data, NSQN (Pathway Commons) and NetNorM (Pathway Commons) for 8 cancer types.
(a) Association of patient subtypes with survival time. One circle indicates P ≤ 0.05 and two concentric circles indicate P ≤ 0.01 (log-rank test). Cases where clusters were too unbalanced (95% of the patients in one single cluster) are not shown. (b) Evaluation of the clustering stability as measured by the proportion of ambiguous clustering (PAC). The transparency of the triangles indicate the percentage of patients in the largest cluster. The scale ranges from 100% (totally opaque) to % (totally transparent) where N is the number of subtypes. Therefore opacity (resp. transparency) indicate unbalanced (resp. balanced) clusters. (c) Kaplan Meir survival curves for NetNorM subtypes with significantly distinct survival outcomes. In the legend are indicated the subtype number followed by the number of patients in the subtype.
Fig 8.
Effect of network randomisation on patient stratification.
Log-rank statistic obtained with Pathway Commons (curve) and 10 randomised versions of Pathway Commons (boxplots) with NetNorM (blue) and NSQN (orange) for HNSC, OV, KIRC and SKCM. One circle indicate a P-value P ≤ 5 × 10−2 and two concentric circles indicate P ≤ 1 × 10−2.
Fig 9.
Characterisation of LUAD patient subtypes obtained with NetNorM (N = 5 groups, k = 315, Pathway Commons).
(a) Kaplan Meir survival curves for NetNorM subtypes with significantly distinct survival outcomes. In the legend are indicated the subtype number followed by the number of patients in the subtype. (b) Metapatients matrix obtained by applying NMF to mutation profiles processed with NetNorM. The matrix shown is restricted to the genes with highest variance across metapatients. The genes (columns) are clustered via hierarchical clustering. Clusters are numbered from 1 to 20 from left to right. (c) Distribution of gene replication times across gene clusters. (d) A χ2 contingency test was performed for each gene cluster to test its enrichment (or depletion) in mutations across patient subtypes given the subtypes’ marginal number of mutations. The value represents the contribution of a subtype to the test statistic, and the colour indicates an enrichment (red) or a depletion (blue) in mutations. (e) Distribution of patients’ total number of (raw) mutations across patient subtypes.
Fig 10.
Exploring NSQN and NetNorM performances levers.
(a) Subtypes log-rank statistic obtained for LUAD (left) and SKCM (right). One circle indicate a P-value P ≤ 5 × 10−2 and two concentric circles indicate P ≤ 1 × 10−2 (log-rank test). (b) Consensus clustering matrices for LUAD. (c) Survival prediction performances for LUAD (left) and SKCM (right). (d) Confusion matrices for LUAD (top) and SKCM (bottom) comparing the subtypes obtained with NSQN and SimpNSQN on the one hand, and NSQN and NetNorM on the other hand. (a, b, c, d) were obtained with Pathway Commons.