Signatures of neutral evolution in exponentially growing tumors: A theoretical perspective

Recent work of Sottoriva, Graham, and collaborators have led to the controversial claim that exponentially growing tumors have a site frequency spectrum that follows the 1/f law consistent with neutral evolution. This conclusion has been criticized based on data quality issues, statistical considerations, and simulation results. Here, we use rigorous mathematical arguments to investigate the site frequency spectrum in the two-type model of clonal evolution. If the fitnesses of the two types are λ0 < λ1, then the site frequency spectrum is c/fα where α = λ0/λ1. This is due to the advantageous mutations that produce the founders of the type 1 population. Mutations within the growing type 0 and type 1 populations follow the 1/f law. Our results show that, in contrast to published criticisms, neutral evolution in an exponentially growing tumor can be distinguished from the two-type model using the site frequency spectrum.


Introduction
Following up on the introduction of the Big Bang model by Sottoriva et al [1], Sottoriva and Graham [2] described what they called "a pan-cancer signature of neutral tumor evolution:" the number of mutations with frequency � f will have the form c/f. The derivation of this result is remarkably simple and is given in Methods. In 2016, Williams et al. [3] found that 323 of 904 samples from 14 cancer types showed excellent straight line fits when the cumulative number of mutations of frequency � f is plotted versus 1/f. See Fig 2B in [3]. This paper has been cited 200 times, but among these works, there are a number of papers criticizing the result. See [4][5][6]. The December 2018 issue of Nature Genetics contains three letters raising objections to the conclusion [7][8][9]. Four common criticisms are 1. Inferring the allele frequency f requires accurate estimates of local copy number and ploidy. In addition, Wu et al [5] point out that local samples may not be indicative of overall frequencies.
2. Failure to reject the null model is not the same as proving it is true. To quote McDonald, Chakrabarti, and Michor [8] "The fact that a model of neutral evolution leads to a linear relationship between M(f) (the number of mutations with frequency � f) and 1/f does not imply . . . the presence of neutral evolution." 3. Tarabichi et al [7] applied methods that look at the dN/dS ratio, which compares the number of nonsynonymous and synonymous mutations, to look for signs of selection. They claim to have found significant signs of selection in tumors that were classified as neutral. However when the analysis was repeated on publicly available pancreatic cancer data, Graham, Sottoriva et al found no values significantly different from 1. [7] say "the deterministic models of tumor growth described by Williams et al [3] rely on strong biological assumptions. Using simple branching process to simulate neutral and nonneutral growth, they show that R 2 > 0.98 is neither necessary nor sufficient for neutral evolution."

Tarabichi et al
To try to shed some light on the controversy, we will do a mathematically rigorous computation of the site frequency spectrum produced by the two-type model of clonal evolution. We will describe the model in Results. The two-type model and its m-type generalization have been extensively studied. See [10] for results and references. This model is relevant to the discussion of [3] because it appears in the criticisms of McDonald, Chakrabarti, and Michor [8] and Bozic, Patterson, and Waclaw [6]. Before we describe the math, we want to make it clear that that this work only discusses the theoretical aspects of cancer genomics and is not concerned with practical problems in making inferences on cancer genomic data, which of course could hide some of the theoretical effects due to errors, bias, sampling, and other issues discussed in the criticisms listed above.

A two-type model
McDonald, Chakrabarti, and Michor [8] consider two alternative evolutionary models in order to argue that other underlying models can produce a linear relationship between 1/f and the cumulative number of mutations with frequency � f. Their second model is an infinite alleles branching process model previously studied by McDonald and Kimmel [11]. We will ignore this model, since in studying DNA sequence data the appropriate mutation scheme is the infinite sites model. In their first model, clonal expansion begins with a single cell of the original tumor-initiating type (type 0). To make it easier to connect with previous mathematical work, we will describe their model using the notation used in [10] and [12]. We suppose that type 0 individuals give birth at rate a 0 and die at rate b 0 , so the exponential growth rate is λ 0 = a 0 − b 0 . For simplicity, we will suppose that neutral mutations accumulate during the individual's life time at rate ν, instead of only at birth.
Type 0 individuals mutate to type 1 at rate u 1 . Type 1 individuals give birth at rate a 1 and die at rate b 1 . Their exponential growth rate is λ 1 = a 1 − b 1 where λ 1 > λ 0 . In [8], different type 1 families have different increases in their growth rates that follow a normal distribution. In this section, we will assume all type 1 mutations have the same growth rate. Later, we will consider the implications of random fitness changes for the behavior of the model.
The reader will see many complicated formulas in this paper, so it will be useful to have a concrete set of parameters to plug into these formulas. Borrowing an example from [10], we will set We do not pretend that these parameters apply to any specific cancer, but for a mental picture, you can imagine that type 0s are colon cancer cells in which both copies of APC have been knocked out, while type 1 cells in addition have a KRAS mutation. Limit theorems. As in [8], we will, for simplicity, restrict our attention to two types. The type 0's are a simple branching process, so well-known results show that where W 0 = 0 with probability b 0 /a 0 and has a rate λ 0 /a 0 exponential distribution with probability λ 0 /a 0 . The study of the second wave is simpler if we suppose that Z � 0 ðtÞ ¼ V 0 e l 0 t for all t 2 (−1, 1), where V 0 has the same distribution as (W 0 |W 0 > 0), that is exponential with rate λ 0 /a 0 . Mutations from type 0 to 1 occur at rate u 1 . Let σ 1 be the time of the first successful type 1 mutation, i.e., one whose branching process does not die out. Durrett and Moseley [13] showed, see (29) in [10], that σ 1 has median In the concrete example, s 1 1=2 ¼ 460:51. In colon cancer where cells divide every four days, s 1 1=2 is 1842 days or a little more than 5 years. Durrett and Moseley were the first to rigorously prove results about the asymptotic behavior of the size of the type 1 population Z � 1 ðtÞ, see Section 9 of [10]. Durrett [12] noticed that the constants are simpler if we use a different normalization. Here we are assuming a 0 = a 1 = 1 to simplify the constants.
Using Eq (3), and doing some algebra In our concrete example, � rðx; 1Þ ¼ 0:1772V 0 x À 1=2 : Note that due to shifting time by s 1 1=2 , the measure � r does not depend on the mutation rate.
Site frequency spectrum. There are three classes of mutations in the two-phase model • type 0: Neutral mutations that occur to type 0 individuals.
• type 1A: Advantageous mutations that turn type 0 individuals into type 1.
• type 1: Neutral mutations that occur to type 1 individuals.
By the argument in Methods, the type 0 mutations will have a 1/f site frequency. The argument can also be used to prove the next result so the details are hidden away in Methods.
Theorem 2 The number of type 1 mutations with frequency � f with in the type 1 population will be asymptotically ν/(λ 1 f).
The points in the Poisson process in Theorem 1 indicate the contributions of the various type one families to the limit � . . be the points, then the jth largest family makes up a fraction x j = � V 1 of the population. Intuitively, this implies that the number of type 1A mutations with frequency � f will be asymptotically Cf −α where α = λ 0 /λ 1 . However, the fact that the sum of the points in the Poisson process is random makes this difficult to study. Fortunately for us, the work has already been done in 1997 by Pitman and Yor [14], who proved that the points in the Poisson process divided by their sum follow the Poisson-Dirichlet distribution PD(α, 0). See the remark after Theorem 5 in [12]. This gives us that when 0 < α < 1 the site frequency spectrum of 1A mutations is: When α = 1/2, the constant is 2/π = 0.6366. Including type 0 passenger mutations in type 1A families does not significantly change the f −α shape in (4). This is because all important 1A mutations happen soon after the first mutation, which implies that all important 1A mutations have roughly the same number of passengers. See Methods.
To illustrate the results proved above, we turn to simulations seen in Figs 1 and 2.

Random fitness increases
McDonald, Chakrabarti, and Michor [8] considered the case in which type 1 individuals have growth rates that are normal with mean m and standard deviation d. Early work on models with random fitness increases in the two-type model led to very unusual behavior in the limit t ! 1, see [15]. Results in that paper show • If the fitness distribution was bounded then, as t ! 1, individuals with fitnesses that were close to the upper limit dominated the population.
• If the distribution was unbounded, then the population could grow faster than exponential.
In this section, we will modify our example from Fig 1 so that type 1 individuals have growth rates drawn from the normal distribution with mean m = 0.04 and standard deviation d = 0.005. We will see that in contrast to the limiting results just mentioned, random fitnesses do not substantially change the behavior.
To find the distribution of the growth rates of the mutations with the largest family sizes, we note that a mutant that occurs at time s i and has growth rate λ 1,i will grow to size W 1 exp (λ 1,i (1000 − s i )) at time 1000. The number of i that are successful and have λ 1,i (1000 − s i ) > x The simulation was performed with parameters ν = 0.02, u 1 = 2 × 10 −4 , λ 0 = 0.02, λ 1 = 0.04 and a 0 = a 1 = 1 and is the average site frequency spectrum of 1000 runs. We simulated the 1A families and type 0 passenger mutations on their founders. Then, we obtained type 1 mutations for each 1A family by applying (8) in Methods. We only consider mutations present in the type 1 population because, as t ! 1, the proportion of the population that is type 0 cells approaches 0. As suggested from Theorem 2, the type 1 site frequency spectrum is linear when plotted against 1/f. The 1A + 0 line looks similar to a power law, as suggested by (4).
where ϕ and F are the density function and distribution function, of a normal distribution with mean m = 0.04 and standard deviation d = 0.005. The equality follows from substituting u = (λ − 0.04) 2 for the inner integral. Fig 3 graphs (5). The random fitnesses cause the relative sizes of the contributions of mutations to the final population to change, but as Fig 4 shows, the site frequency still has the form C/f β , where β � α and achieves equality in the case of non-random changes, i.e. d = 0.
The authors of [8] claim that the site frequency spectrum in the two-type model is 1/f. However, their simulation methods take the very crude approach of considering the binary split process until 1,000 or 1,000,000 cells are produced. This corresponds to 10 and 20 generations respectively. To make it possible for something to happen in this short amount of time the mutation rate for advantageous mutations is set to be 0.1 in the 1000 cell scenario, and to 0.03 when there are 1,000,000 cells. At birth, each cell acquires a Poisson mean 100 number of mutations. In contrast our simulations run for approximately 1000 generations, leading to populations of order 10 9 cells, and neutral mutations occur slowly, leading to genealogical relationships that are more like those found in growing cancer tumors.

Subclonal mutation frequencies
Bozic, Paterson, and Waclaw [6] argue that "the fact that no subclonal driver is present at intermediate frequencies cannot be taken as proof of neutral or effectively neutral evolution. It can be a consequence of population dynamics which create only a short window during which the driver mutation can be detected but not fixed in the population." In this section we will describe their results and give a simple analytic derivation.
To argue for this viewpoint, they use the two-type model but with different notation In addition they define c = r 1 /r > 1, and g = c − 1. They assume that the mutation to type 1 occurs at time 0 and run the process until the time t at which the total population size is M. Let X 0 be the population of type 0's when the mutation occurs. Since X 0 is large, X t � X 0 e rt . The type 1 population at time t is Y t � W 1 e rct , where W 1 is an exponentially distributed random variable with rate cr/b 1 . Note that as in Bozic et al [16] the possibility of subsequent driver mutations is ignored. As Fig 5 shows, that change does not lead to a substantial error. Writing f sub = Y t /(X t + Y t ) they prove that when the total tumor size is M = X t + Y t the subclonal mutation frequency has which is (1) in [6]. From this they can compute the probability of a subclonal driver being detectable, that is, To see what this complicated formula implies, the authors turn to simulation. The mutation rate to produce an additional driver is u = 10 −5 . Their Fig 2A shows a moderately growing tumor b = 0.14, r = 0.01, 2B a fast growing tumor b = 0.25, r = 0.07, and 2C a slowly growing tumor b = 0.33, r = 0.0013. For moderate values of selection, e.g. g = 30%, the probability that a driver mutation is in the detectable range [0.2, 0.8] is < 15% for population sizes up to M = 10 9 cells and remain below 1/3 for M � 10 11 . For other cases considered there (g = 70% and 100%) the chance of detecting the subclonal driver is always < 60% and for a broad range of sizes is less than 30%. Panels d,e,f in their Fig 2 show the frequency of a subclonal driver in the case of moderate growth when the size M d = 10 7 , M e = 5 � 10 10 and M f = 2 � 10 8 . In the three cases the frequency is near 0, near 1, and almost uniformly distributed on [0, 1].
Rather than study the tumor when it reaches a fixed size, we will derive results at a fixed time by using Theorem 1. Recall that we have set Z � 0 ðtÞ ¼ V 0 e l 0 t and have shown Combining the last two results, we see that Inserting the values of the λ i rðt þ sÞ rðtÞ ¼ e ðl 1 À l 0 Þs ¼ e 0:015s  This graph gives the probability of having a driver with frequency greater than y once the tumor reaches size 10 9 . The parameters used are a 0 = a 1 = 1, λ 0 = 0.02, λ 1 = .035 and u 1 = 10 −5 and the data was generated from 1000 runs. Single 1A refers to approach taken by Bozic et al. where there is only 1 selective mutation. Multiple 1A is our approach. The theory curve comes using a Riemann sum with interval size 500 to evaluate the integral in Eq (6). https://doi.org/10.1371/journal.pcbi.1008701.g005

Discussion
Work of Sottoriva and Graham [2] and their co-authors [3] has shown that in many cases an exponentially growing tumor has a 1/f site frequency spectrum. This result has a simple derivation but the claim has drawn a large amount of criticism. Many of these concern the quality of the data used. Here, we have performed a mathematical analysis to show that given enough sequence data the site frequency spectrum can be used to distinguish neutral evolution from one specific type of selection. This analysis provides a useful complement to studies based solely on simulation.
We have studied the two-type model of cancer evolution in which the exponentially growing population of type 0 cells can mutate to a fitter type 1, and all cells can experience neutral mutations. In this model there are three types of mutations that we call 0, 1A, and 1. Type 0 mutations are neutral, occur to type 0 individuals, and have a 1/f site frequency spectrum. Type 1 mutations are neutral, occur to type 1 individuals, and again have a 1/f site frequency spectrum. Type 1A mutations are selective, occur to type 0 individuals, and result in type 1 individuals. When the two types have growth rates λ 0 < λ 1 , where α = λ 0 /λ 1 , then the site frequency spectrum has the shape 1/f α due to 1A mutations and the type 0 neutral mutations present in the founders of the type 1 population. These mutation types are more numerous than the others.
McDonald, Chakrabarti, and Michor [8] have used the two-type model to suggest that models with selection can have a 1/f site frequency spectrum. Our results show this is not true when type 1 mutations all have the same fitness increase. Their model has random increases in fitness, but we also show that this feature does not significantly change the qualitative features of the site frequency spectrum.
Bozic, Paterson, and Waclaw [6] study the two-type model and show that it is difficult to capture a subclonal driver mutation at intermediate frequency. Their model allows only one type 1A mutation. Using our simple analytical results and computer simulations, we confirm that this prediction holds in the two type model without that restriction. Sottoriva and Graham say in their original paper [2] that "the power law signature is common to multiple tumor types and is a consequence of the effectively-neutral evolutionary dynamics that underpin the evolution of a large proportion of cancers." To explain the source of the 1/f curve in an exponentially growing tumor, we give the derivation of the 1/f frequency distribution from [3]. They assumed that cells divide at rate λ and use N(t) to be the number of cells at time t. If we assume that the mutation rate is μ (which we assume takes into account their ploidy parameter π), then the expected number of new mutations before time t, M(t), satisfies dM dt ¼ mlNðtÞ:

NðsÞ ds:
Since N(s) = e λs (we have set β in [3] to be 1 for simplicity), we observe that a mutation that occurs at time s will have frequency e −λs in the population. Evaluating the integral in the previous formula, we have MðtÞ ¼ mðe lt À 1Þ: Ignoring the −1, if we set t f = −(1/λ)log f to make N(t f ) = 1/f so that mutations before time t f will have frequency � f, then Theorem 3 The number of mutations with frequency � f is Note that in this derivation, mutations occur only at birth. If we instead let mutations happen continuously throughout a cell's lifetime and call the mutation rate ν, then Durrett [12] has shown From the derivation given above, we see that the 1/f site frequency spectrum comes from the fact that mutations occur at a rate proportional to the size of the population and the fact that the population is growing exponentially fast.

Proof of Theorem 2
Proof. We follow the derivation of Theorem 3. If we let NðsÞ ¼ Z � 1 ðsÞ, then the number of type 1 mutations by time t satisfies M 1 ðtÞ ¼ n

Passengers do not change the shape of the SFS
To show that the important 1A mutations happen soon after the first, and that therefore all important 1A mutations have roughly the same number of passengers, consider two successful mutations at times s 0 and s 1 which have sizes W 0 e λ 1 (t−s 0 ) and W 1 e λ 1 (t−s 1 ). For the second mutation to be larger, we'd need W 0 /W 1 � e λ 1 (s 0 −s 1 ). Since the cdf of the quotient of two exponentials with the same rate is P(W 0 /W 1 � x) = x/(x+ 1), we find that P W 0 =W 1 � e l 1 ðs 0 À s 1 Þ ð Þ ¼