Fig 1.
(A) TumE integrates a generative sampling process and stochastic simulation of cancer evolution to build well-specified synthetic variant allele frequency (VAF) distributions with respect to data observed in bulk sequenced tumour biopsies. Assuming copy-neutral diploid regions of tumour genomes, the generative sampling process uses the observation that neutral VAF distributions can be described by a power-law or Pareto neutral ‘tail’ [16,30] in addition to a dispersed clonal peak. By sampling empirically valid Pareto distributions, rapid realizations of the null hypothesis of neutral evolution encoded in the VAF distribution can be created (Methods). The stochastic branching process model of tumour evolution is then used to link parameters and latent states, relevant to positive subclonal selection, back to VAF distributions (Methods). (B) Synthetic supervised learning utilizes neural networks capable of handling the complete dimensionality of the simulated VAF distributions, x, to solve the inverse problem of identifying the evolutionary parameters and latent states, y, assigned to each synthetic VAF distribution. (C) We can then quantify model uncertainty using a computationally efficient form of Bayesian deep learning called Monte Carlo dropout [26,26]. Approximate posteriors are generated by performing T stochastic passes through the trained neural network.
Fig 2.
(A) In a cohort of 2.8 million synthetic tumours, TumE outperformed all existing common population genetic [20,21] and cancer evolution [7,12] specific summary statistics when differentiating between positive selection and neutral evolution, based on AUROC (two-sided Wilcoxon test). (B) Further, for predicting the true frequency of selected subclones, TumE provides comparable or better performance relative to the current state-of-the-art mixture model MOBSTER [16] that properly accounts for neutral dynamics in tumour populations. The panel shows correlation between the true and predicted subclone frequency in 80,000 synthetic tumours sequenced at 150x mean sequencing depth. (C) In an orthogonal dataset of 150 synthetic tumours [16] with either 0 or 1 detectable subclones, TumE was significantly faster at estimating the number of subclones (two-sided Wilcoxon test) than existing mixture model based methods sciClone [24] and MOBSTER [16] (measured in inference time per sample). In addition, only TumE and MOBSTER consistently identified the correct number of subclones, as both methods directly account for the neutral dynamics observed in tumour populations. (D) TumE estimates in a synthetic tumour sequenced at 120x mean sequencing depth and a subclone at 54% cellular fraction.
Fig 3.
TumE estimates in deep whole-genome or whole exome sequenced tumour biopsies.
(A) A deep-sequenced primary acute myeloid leukemia (AML) sample from Griffith et al. [31]. TumE estimated two subclones, a neutral tail, and a clonal peak. P(Selection) indicates the probability of selection. P(0, 1, 2 subclone) indicates the probability estimate for the number of subclones. Each probability estimate is provided with the 89% equal-tailed interval generated from 50 Monte Carlo dropout samples. A sample is labeled positive selection if the lower bound of the 89% interval is above P = 0.5, and the number of subclones is assigned to a sample if the lower bound of the 89% interval is greater than 0.5 (Methods). Subclone frequency estimates are shown with the complete approximate posterior. (B) A deep-sequenced breast adenocarcinoma from the pan-cancer analysis of whole genomes (PCAWG) [11]. TumE estimated two subclones, a neutral tail, and a clonal peak. (C) We applied TumE to a single mismatch repair deficient (MMR) gastro-esophageal tumour [33] sequenced across 5 spatially distinct regions. We first identified an intermediate frequency subclone in region P with TumE. (D) Under the hypothesis that TumE could reveal the fixation process of region P subclones in other regions, we annotated each of the remaining regions with the clonal, subclonal, and neutral tail mutations identified in region P. We identified ongoing subclonal selection in 2 out of the 4 remaining regions (N and T) consistent with an increase in frequency of subclonal and neutral tail mutations from region P. In cases where neutral evolution was the most parsimonious explanation, we observed complete fixation of the region P subclonal mutations (region AE and H).
Fig 4.
Comparison of subclone detection methods in 78 PCAWG samples with mean effective sequencing depth (purity * depth) of greater than 60x.
(A) Number of subclone estimates using different methods including MOBSTER [16], sciClone [24], and TumE. (B) Comparing agreement and disagreement between existing methods in the analyzed PCAWG samples. (C) Comparison of the estimated subclone frequency in PCAWG samples versus simulated data across different methods. Consistent with simulated data where each method properly identified the presence of 1 subclone, TumE consistently captures higher frequency subclones when compared to MOBSTER. (D–F) Visualization of the 3 samples where each TumE and MOBSTER detected 1 subclone. sciClone fits are provided for comparison. Single-nucleotide variants (SNV) with annotations in known driver genes [8,16] were labeled if present.
Fig 5.
(A) Transfer learning approach utilizing ‘renovated’ pre-trained neural networks for alternative evolutionary inference tasks in tumour cellular populations. TEMULATOR is an alternative cancer evolution simulator that generates synthetic tumour sequencing data by deterministically initiating subclones at user specified fitnesses and time points [38]. (B) Pre-trained models provide significant reductions in testing loss, over non-pretrained models, when updating neural network weights on reduced dataset size of 500,000 synthetic VAF distributions (~1.25% of the total dataset size used to originally train TumE). (C) TumE transfer (TumE-T) effectively recovers evolutionary parameters from TEMULATOR simulations (75–200x mean sequencing depth, 100% tumor purity) with mean and median percentage errors less than 10% in all cases. A full description of performance across variable sequencing depths, mutation rates, and subclone frequencies is provided in S23 Fig. (D) We find consistency between the subclone cellular fraction estimated by TumE-T and the subclone frequency (cellular fraction / 2) estimates generated from TumE, indicating nearly identical tasks are easily transferred through pre-training. (E) Per genome per division mutation rate estimates in 88 WES and WGS samples from von Loga et al. [33] (MMR-GE = mismatch deficient repair gastro-esophageal cancer), Griffith et al. [31] (AML = acute myeloid leukemia), and PCAWG [11]. (F) Subclone fitness (1+s) estimates (relative growth rate advantage of subclone over background population) and (G) subclone emergence time estimates in 36 tumour biopsies identified with 1 subclone in the PCAWG data. Subclone fitness and emergence time estimates were scaled to a final tumour population size of 1010 cells, similar to [7]. PCAWG sample identifiers are provided on the x-axis.