^{1}

^{1}

^{1}

^{2}

^{2}

^{1}

^{2}

^{1}

The authors have declared that no competing interests exist.

Quantification of the effect of spatial tumour sampling on the patterns of mutations detected in next-generation sequencing data is largely lacking. Here we use a spatial stochastic cellular automaton model of tumour growth that accounts for somatic mutations, selection, drift and spatial constraints, to simulate multi-region sequencing data derived from spatial sampling of a neoplasm. We show that the spatial structure of a solid cancer has a major impact on the detection of clonal selection and genetic drift from both bulk and single-cell sequencing data. Our results indicate that spatial constrains can introduce significant sampling biases when performing multi-region bulk sampling and that such bias becomes a major confounding factor for the measurement of the evolutionary dynamics of human tumours. We also propose a statistical inference framework that incorporates spatial effects within a growing tumour and so represents a further step forwards in the inference of evolutionary dynamics from genomic data. Our analysis shows that measuring cancer evolution using next-generation sequencing while accounting for the numerous confounding factors remains challenging. However, mechanistic model-based approaches have the potential to capture the sources of noise and better interpret the data.

Sequencing the DNA of cancer cells from human tumours has become one of the main tools to study cancer biology. However, sequencing data are complex and often difficult to interpret. In particular, the way in which the tissue is sampled and the data are collected impact the interpretation of the results significantly. We argue that understanding cancer genomic data requires mechanistic mathematical and computational models that tell us what we expect the data to look like, with the aim of understanding the impact of confounding factors and biases in the data generation step. In this study, we develop a spatial computational model of tumour growth that also simulates the data generation process, and demonstrate that biases in the sampling step and current technological limitations severely impact the interpretation of the results. We then provide a statistical framework that can be used to start overcoming these biases and more robustly measure aspects of the biology of tumours from the data.

Cancer is an evolutionary process fuelled by genomic instability and intra-tumour heterogeneity (ITH) [

The problem is usually tackled by performing subclonal deconvolution of the samples to separate the different subpopulations [

Here, we study how spatial constrains of a growing tumour impact our ability to infer cancer evolutionary dynamics. We combine explicit spatial evolutionary modelling with synthetic generation of multi-region bulk and single-cell data, thus providing a generative framework in which we know the evolutionary trajectories of all cells in a tumour and can examine the genomic patterns that emerge from the sampling experiment. We show that spatial constrains, stochastic spatial growth and sampling biases can have unexpected effects that confound both the interpretation and inference of the perceived evolutionary dynamics from cancer sequencing data. We also present a statistical inference framework that begins to account for some of these confounding factors and recover aspects of the cancer evolutionary dynamics from various types of multi-region sequencing data as well as single-cell data.

Here we develop and analyse a stochastic spatial cellular automaton model of tumour growth that incorporates cell division, cell death, random mutations and clonal selection (Material and Methods). Each tumour simulation starts with a single ‘transformed’ cell in the centre of either a 2D or a 3D lattice, and we model the resulting expansion of this first cancer cell. All events, such as cell proliferation, death, mutation and selection are modelled according to a Gillespie algorithm [

In our model we introduce a mutant at a given time t (blue = background clone; red = mutant subclone; shade is proportional to the number of generations the cell has gone through).

We also model ‘boundary driven’ growth, where only cells that are sufficiently close to the border of the tumour can proliferate. Other cells may remain ‘imprisoned’ in the centre of the tumour unable to proliferate because of the lack of empty space around them. Boundary-driven growth has been observed experimentally [

At each division, a cell has a certain probability to acquire additional somatic mutations, modelled with a Poisson distribution, with mean

Importantly, our spatial model of tumour growth allows for the simulation of tissue sampling and genomic data generation. For instance, we can simulate the collection of punch biopsies, where spatially localised chunks of tumour are collected (

We previously showed, using a non-spatial stochastic branching process model of tumour growth, that assuming a well-mixed population and exponential growth, the expected VAF distribution of subclonal mutations in cancer under neutral growth follows a power-law with a ^{2} scaling behaviour, where _{2}-like neutral subclonal tail can be observed in all samples of _{2}-like tail remains in the VAF frequency spectrum of all samples, as a consequence of within-clone neutral dynamics that remain on-going throughout the tumour’s growth [

For each representative simulation of spatial constraints in ^{2} distribution corresponding to neutral evolutionary dynamics [^{2} distribution (^{2} neutral tail, which in this case without cell death was 10 mutations per division (~10^{−9} mutations/bp/division). This was correctly recovered in all samples from

In the case of homogeneous growth with subclonal selection (^{2}-like tail resulting from the within-clone accumulation of passenger mutations remains in the frequency spectrum [

This initial spatial analysis produced similar results to our previous well-mixed non-spatial models [^{2} scaling form within most of the detectable frequency range (f>5%), although at low frequency deviations are expected [

Because the population is no longer homogeneously distributed however, this can lead to significant spatial bias, causing over- or under-representation of mutations in the VAF distributions solely due to spatial effects and not because of selection. This causes deviations from the neutral expectation of the mutant allele distributions that risk being wrongly interpreted as the consequence of on-going subclonal selection, as in

If we combine boundary driven growth and subclonal selection the situation is further complicated: selective effects are now modulated by spatial constraints. In some cases, the selected mutant emerges and remains directly at the front of tumour growth. In this scenario the outgrowth caused by its selective advantage is amplified further just because it occurred at the growing front (

We then looked at the pairwise VAF distributions between samples. The amount of subclonal mutations scattered through the frequency spectrum (

For each of the representative cases:

Most of the confounding factors we have described so far result from the limitations of bulk sequencing, where the genomes of many cells are convolved within samples. Single-cell sequencing does not suffer from this particular limitation and promises high-resolution cancer evolutionary analysis devoid of the drawbacks of bulk sequencing [

To examine the effect of single cell sequencing, we simulated whole-genome sequencing of 10 single cells taken at random from the tumour and reconstructed their phylogenetic relationship (^{2} tails (

However, as whole-genome mutational profiling of single cells is still difficult due to allele dropout [

Moreover, significant sampling bias is still apparent for single-cell sequencing when individual cells are not sampled uniformly at random from the whole tumour, but instead isolated in ‘clumps’ from different bulk samples. In

Whereas taking N random cells from a tumour highly reduces sampling bias, this is often not how single-cell from neoplasms are sampled. Often first small chunks of the tumour are dissected and then single-cells are isolated from those.

The spatial effects of drift and sampling bias one can observe are remarkable and represent a major challenge for the correct subclonal reconstruction of tumours growing in three-dimensional space. Due to the inherent complexity, analytical solutions to this problem that take space into the account remain challenging, although some attempts to tackle this difficult question are being undertaken [

Here we devise a statistical inference framework, similar in spirit to what we previously proposed for well mixed populations [

We combined our model with a statistical inference framework (Approximate Bayesian Computation–Sequential Monte Carlo) in order to infer the evolutionary parameters of selection and growth from the data. We tested this framework on 34 synthetic (target) tumours for which we generated genomic data. Our of these 34 target cases, 13 were characterised by homogeneous growth with no cell death

Not surprisingly, the scenario with exponential homogeneous growth without cell death was the one where the evolutionary parameters were the easiest to recover because spatial constrains were limited and the number of unknown parameters lowest (

It is now widely accepted that tumour growth is governed by evolutionary principles. Thus, recovering the evolutionary histories of tumours is essential to the understanding patient-specific tumour growth and treatment response. However, these analyses are inevitably based on limited information due to sampling biases, noise of known and unknown nature, lack of time resolved data amongst many others. Despite these limitations, many approaches based on single sampling, multi-region bulk profiling, or single cell sequencing have been developed. Information from such data is often derived using purely statistical bioinformatics methods such as clustering analyses, without consideration of the confounding underlying influence of the cellular mechanics of tumour growth. Here we explicitly investigated spatial effects on the evolutionary interpretation of typical multi-region sequencing data of tumours. We found that the effects of sampling bias and spatial distributions of spatially intermixed cell populations critically depend on the mode of tumour growth as well as the details of the underlying sampling and data generation procedure. Most surprisingly, we could observe clusters of over-represented alleles in the VAF distribution of some tumour samples that were indistinguishable from positively selected subclonal populations, despite emerging solely due to the spatial distribution of cells. Such clusters vary depending on how one samples a tumour, and would therefore cause a major challenge for the evolutionary interpretation of cancer genomic data based on subclonal reconstruction.

We furthermore presented a Bayesian inference framework to recover evolutionary parameters from our spatial distributions. Evolutionary parameters such as strength of selection or mutation rates may be important surrogate measurements of evolvability, and hence linked to progression and treatment resistance, as it has been demonstrated for the rates of chromosomal instability [

Importantly, future versions of the model could help guiding optimal sample collection that would minimise the spatial biases in the data. Due to the current technical limitations of these types of approaches, we are still far from direct application in the clinic. Additional effort should also be directed towards the use of measurements from other clinical data, such as imaging, where estimations of necrosis for example, can help parameterise computational models. However, we argue it remains extremely important to understand the confounding factors and spatial biases we expect to find in samples from which often we need to base clinical decisions on. Mathematical modelling of cancer evolution is a growing field with a fast expanding repertoire of models and approaches [

We developed a computational stochastic model of spatial tumour growth that allows simulating different strategies of multi-region tissue sampling followed by synthetic generation of high-throughput sequencing data. We consider tumour cells as asexually reproducing individuals that die and divide with certain pre-defined probabilities. If

In addition to cell division, we also model mutation and selection, where the latter can change birth and/or death rates. We model somatic mutations acquired by each cell after division as a Poisson random variable – Pois(

To simulate tumour growth in space with these four stochastic events–birth, death, mutation and selection–we have used a modification of the Gillespie algorithm [

Specifically, the simulation framework works as follows:

Until a cell reaches a predefined grid boundary, repeat the following steps

Compute the reaction propensities according to the Gillespie algorithm. Each reaction event of birth (or death) has a functional form

If the next event is a cell division, we use a heuristic method to place the 2 daughter cells on the grid. We first replace the parent cell with the first daughter, and search for a suitable position to place the second daughter cell. We use a Von Neumann neighbourhood and check if any of the 8 (in 2D grid) neighbouring spots of the parent cell is empty; if one or more are, we locate the second cell in one of those spots at random. Otherwise, with a probability determined by a parameter

If the next event is cell death, we simply free the position allocated to the cell.

At the end of this step, we check if the clock is greater than the time of the next scheduled driver event _{driver}; if it is, we convert a single wild type (WT) cell into a new mutant and increase its birth rate, or decrease its death rate. This will result in mutant cells having a proliferative advantage. To quantify the effect, we define the fitness

Squares, which are referred to in the paper as ‘punch biopsies’

Long thin rectangles that resemble a ‘needle biopsy’

A bulk sample is a set of adjacent cells from the final tumour population. Each cell has its unique ID, a position on a grid and its list of somatic mutations. From the sampled cells (in a bulk) joined list of mutations we can construct the Variant Allele Frequency (VAF) distribution as in a real sequencing experiment.

To construct a VAF distribution from a simulated bulk tumour sample, we mimic realistic next generation sequencing steps, specifically sequencing coverage and limits of detectability of low frequency mutations. We proceed as follows:

We generate (dispersed) coverage values for the input mutations by sampling a coverage from a Poisson distribution

Once we have sampled a depth value

This procedure guarantees that the generated read counts reflect the proportions of mutations in the simulated tumour. To model limits of detection of a mutation, after resampling a mutation, we discard it if the corresponding number of reads containing the variant allele is less than 5 (using the fixed coverage 100, which accounts for a ~0.05 minimum VAF).

We also performed single cell sequencing taking either random single cells across the whole tumour population, or from spatially structured biopsies (mimicking bulk tissue collection followed by single-cell isolation). We used the obtained single cells to construct maximum parsimony phylogenetic trees. In addition to single cell sequencing, we also model genotyping cells with a given list of mutations, corresponding to targeted sequencing of mutations found using e.g. exome or whole-genome sequencing. To implement this, we take one of the bulk samples as reference genotype and check for the presence of each individual mutation in a random set of 200 cells. Similarly, we use the obtained genotyped single cells to infer phylogenetic trees and check how much the genotyped trees differ from the single cell trees.

Due to the complexity captured by our spatial model of tumour growth, we do not have explicit formulas for the stationary probabilities of the stochastic process, and hence cannot derive a likelihood function. Thus, we have to use likelihood-free methods to perform statistical inference on the parameters and compute the posterior distribution of the parameters

Here we use Approximate Bayesian Computation (ABC) [

There are different approaches to implement ABC, the simplest is rejection-sampling. More advanced implementations such as ABC with Markov Chain Monte Carlo (MCMC) can result in significant increases in efficiency. In our paper we implemented a simple rejection-sampling algorithm first, and then added Monte Carlo simulation techniques to speed up convergence. The simple ABC rejection-sampling algorithm consists of the following steps:

Sample parameter vector

Run the model with the given parameter set and generate the synthetic dataset

Evaluate the distance between the simulated dataset and the target data

If the distance is less than a desired threshold, accept the parameters.

Return to step 1 and repeat until

In this study we use uniform priors for all parameters: _{driver}~Uniform(0, 15). One of the most important factors that affect the ABC outcome is the number of simulations that one can afford to run, and the summary statistics were chosen to evaluate the distance between a target and a simulated dataset. Summary statistics can be any quantitative measurement that captures the information from the multidimensional data without losing too much information. As for our distance metric, we use Euclidean and Wasserstein distances between summary statistics for different parameters as discussed below.

Wasserstein metric estimates the distance between probability distributions by treating each distribution as a unit amount of dirt piled up on a given metric space and calculates the minimum cost required to convert one pile into another. If

We used different summary statistics for each sampling scheme. For punch, needle biopsy and the whole tumour sampling–we used the VAF distribution to compute our summary statistics. For the whole tumour VAFs, our ABC procedure was similar to the one in ref [

With single cell samples, we constructed phylogenetic trees per tumour and used different tree-based summary statistics to evaluate the distance. Since the inferred phylogenetic tree branch length is proportional to the number of unique mutations belonging to a node, we decided to compare the vectors of all branch lengths (between a simulated and target tumour trees) by computing the Wasserstein distance. For the subclone introduction time _{driver}, death rate

Due to computational costs, we are limited to run the ABC framework with a small tumour size (~100k cells) or simulate smaller datasets per inference, both of which can significantly affect the outcome. To therefore speed up our ABC framework we implemented a Sequential Monte Carlo (SMC) algorithm to increase the acceptance rate of the simple ABC rejection algorithm. Our ABC SMC algorithm uses sequential importance sampling by running several rounds of resampling around the accepted parameters (correlating the rounds), and gradually decreasing the acceptance threshold while converging to the posterior distribution. This approach significantly increases the acceptance rate of the simulated datasets [

Our implementation of the ABC SMC algorithm is as follows:

Initialise the indicator to rounds

Run the simple ABC rejection algorithm (described above).

Order the simulated parameters set according to their corresponding distance values.

Keep the top Q per cent of the parameters.

Sample next particle _{r−1}.

Perturb each sampled parameter _{i} using uniform perturbation kernel _{i} − _{i} +

Simulate data from the model using the sampled particle

Calculate distance D between the target and the simulated data.

Calculate the weights for all accepted particles 1 ≤

_{(j,r)} = 1

Update the threshold

Repeat until

Our ABC-SMC framework tries to recover all the parameters (referred to as a particle in the algorithm above) at the same time. We notice that once one of the parameters converges, the acceptance rate decreases significantly. We then decided to fix the converged parameter at the inferred value (mode of its posterior) and rerun the inference varying the rest of the parameters until other parameters converge, and repeat the procedure. We found that this significantly improved the convergence speed. For the 2D inference in

The package implements three sampling strategies for the inference:

Bulk samples (punch or needle biopsies)—ABCSMCwithBulkSamples()

Single cell sample phylogenetic trees—ABCSMCwithTreeSampleBL() and ABCSMCwithTreeSampleBT() (using Branch Lengths or Branching Times as summary statistics)

Whole tumour bulk sample—ABCSMCwithWholeTumour()

Depending on the strategy, a user would need to provide real or synthetic target data in the form of tumour bulk sample VAFs (list of R data.frames where each row should correspond to a unique mutation with the following columns: clone (Clone type label set to 0), alt (Number of reads with the variant), depth (Sequencing depth), id (Unique mutation ID)), an array of whole tumour sample VAFs or single cell sampling phylogenetic trees. Alternatively, a user can provide a set of parameters (please refer to the package documentation for the details of each input parameter format) to simulate a synthetic target tumour to then recover these input parameters.

The functions output sequence of files containing sets of inferred parameters corresponding to each SMC round (that can then be used to construct the posterior distributions for each parameter).

For

To test for the presence of selection and the mutation rate inference, we fit 1/f_{2} distribution to the empirical cumulative distributions of sampled VAFs using the R package developed in ref [

Tumour cell population growth curves for each of the representative cases:

(PNG)

Two examples where fitness advantage is modelled by decreasing cell death the mutant subpopulations and increasing for the wild type.

(PNG)

(PNG)

Example of selective boundary driven growth when the driver mutant subpopulation gets trapped within the wild type population despite being fitter than the WT clone.

(PNG)

For each of the representative cases:

(PNG)

(PNG)

Example of a selective exponential growth when the mutant subpopulation has higher ‘push power’ than the wild type population.

(PNG)

For each of the representative cases:

(PNG)

Tumour cell population growth curves for each of the representative cases:

(PNG)

We construct the allele frequency distributions from sequencing the randomly sampled 400 single cells (same as in

(PNG)

We simulate 100 different tumours for each 4 representative growth models and test intermixing of subpopulations within each simulation lattice using Moran’s entropy-based test. Each individual test output significant p-values indicating to high spatial correlation between tumour cell types (mutant vs WT) and their location on tumour lattice. Although the test effect size (the observed values of the Moran’s test statistic) differ as we can see from their distributions per model scenario. The median values of each observed statistics are reported at the bottom of each violin plot.

(PNG)

(PNG)

The violin plots of the posterior distributions for each model parameter per synthetic tumour inferred by our ABC-SMC framework. The three sets of tumours corresponding to the three tumour growth scenarios are plotted separately: exponential

(PNG)

To explore the interdependence of the parameter pair

(PNG)

ABC SMC inference for a selective homogenous growth simulation in 3D space. Real ‘target’ values are reported as dashed lines. We run this ABC framework similarly to 2D simulations, where we recover each parameter at a time; first varying all parameters, once one is converged, fixing it at its inferred value and rerunning the simulation varying the parameters left to infer. Here we first recovered mutation rate, then time and selective advantage (together), and finally death rate and aggression (together as well). Similar to 2D models, our ABC framework with whole tumour sampling performs the best compared to other sampling strategies.

(PNG)

(CSV)

(CSV)

We thank Daniel Nichol and Haider Tari for the fruitful discussion.