Estimating the number of genetic mutations (hits) required for carcinogenesis based on the distribution of somatic mutations

Individual instances of cancer are primarily a result of a combination of a small number of genetic mutations (hits). Knowing the number of such mutations is a prerequisite for identifying specific combinations of carcinogenic mutations and understanding the etiology of cancer. We present a mathematical model for estimating the number of hits based on the distribution of somatic mutations. The model is fundamentally different from previous approaches, which are based on cancer incidence by age. Our somatic mutation based model is likely to be more robust than age-based models since it does not require knowing or accounting for the highly variable mutation rate, which can vary by over three orders of magnitude. In fact, we find that the number of somatic mutations at diagnosis is weakly correlated with age at cancer diagnosis, most likely due to the extreme variability in mutation rates between individuals. Comparing the distribution of somatic mutations predicted by our model to the actual distribution from 6904 tumor samples we estimate the number of hits required for carcinogenesis for 17 cancer types. We find that different cancer types exhibit distinct somatic mutational profiles corresponding to different numbers of hits. Why might different cancer types require different numbers of hits for carcinogenesis? The answer may provide insight into the unique etiology of different cancer types.


Author summary
Cancer is primarily a result of genetic mutations. Each individual instance of cancer is initiated by a specific combination of a small number of mutations (hits). In trying to identify these combinations of mutations, it is important to know how many hits to look for. However, there are conflicting estimates for the number of hits. We present a fundamentally different model for estimating the number of hits. We found that the number hits ranges from two-eight depending on cancer type. These findings may provide insight into the unique characteristics of different cancer types.

Introduction
Cancer is known to result primarily from genetic mutations [1][2][3]. Moreover, models of carcinogenesis-the multi-stage and multi-hit models-suggest that cancer results from a small number of (two-seven) mutations [4][5][6][7][8][9][10][11]. Yet the availability of extensive genomic data [12,13] and decades of investigation have failed to reveal, for the vast majority of cancers, the specific mutations that result in carcinogenesis. One reason for the inability to identify these carcinogenic mutations is that there is no single combination of mutations (hits) responsible for all instances of cancer, even within a cancer subtype. Instead, carcinogenesis is a result of one of many possible combinations of a small number of hits. A reliable estimate for the number of such hits will help us understand how cancers originate, and to find the specific combination of mutations responsible for individual instances of cancer [14]. However, estimates from current mathematical models are questionable because they are based on simple assumptions about mutation rate, which can vary by over three orders of magnitude as discussed below [4,[15][16][17][18][19].
The goal of this work is to estimate the number of genetic mutations that may be required for carcinogenesis, without making assumption about the mutation rate. We derived, from first principles, a multi-combination multi-hit mathematical model to estimate the average number of hits required for carcinogenesis. Unlike current models, our estimate is based on the distribution of somatic mutations instead of age (which requires knowing or assuming a mutation rate). In fact, we find that somatic mutations at diagnosis are weakly correlated with age at cancer diagnosis, most likely due to a large variation in mutation rates across individuals, relative to each individual's intrinsic mutation rate. Comparing the distribution of somatic mutations calculated by the model, to the actual distribution for seventeen different cancer types, we estimate that the average number of hits varies from two to eight, depending on cancer type. Sensitivity analysis shows that, for sample size greater than 200, the distribution of somatic mutations and the corresponding estimated number of hits are robust. Some cancer types show a multimodal distribution indicating a mix of different number of hits. A key finding is that each cancer type has a distinct distribution. We discuss how the corresponding differences in the estimate for the number of hits may suggest differences in the etiology of the different cancer types.
Several factors, other than genetic mutations, may also affect carcinogenesis, such as epigenetic modifications [20], tumor environment [21], and adaptive evolution [22]. However, carcinogenesis is primarily a result of genetic mutations [23,24]. Somatic mutations in particular are a critical driver of carcinogenesis, as exemplified by the occurrence of cancer in Caenorhabditis elegans and Drosophila melanogaster. Cancers in C. elegans are only found in the indeterminately differentiating germ-line cells and not in somatic tissues, which are terminally differentiated at birth [25]. Cancers in D. melanogaster are only found in larvae or in the gut and germline cells of adults, and not in other adult tissues with little or no cell turnover after adulthood [26,27]. Retinoblastoma is the classic example where the disease is known to be caused by mutations to both copies of the RB1 gene, with 55-65% being non-hereditary, indicating that somatic mutations are the primary cause in these cases [28]. The causal relationship between somatic mutations and cancer is further supported by the mutagenic effects of carcinogens [29]. cancer diagnosis. We compare the distribution calculated by the model to the actual distribution in somatic mutations to estimate the number of hits required for carcinogenesis. The estimates were robust to sample size when sample size is greater than 200. We discuss possible reasons for differences in number of hits for different cancer types. We also discuss the effect of possible deviations from model assumptions on the estimated number of hits.

First-principles based multi-combination multi-hit model
We derived, from first-principles (see Materials and Methods), a mathematical model for carcinogenesis resulting from multiple different combinations of carcinogenic mutations (hits). The probability of carcinogenesis is formulated as a function of accumulated somatic mutations, the number of hits, and the number of possible multi-hit combinations. Fig 1 illustrates the process of carcinogenesis represented by the model. The probability P that the m th somatic mutation results in "hitting" one of k possible combinations of h carcinogenic mutations, can be approximated as PðmÞ � P k ðmÞ À P k ðm À 1Þ ð1Þ where P k is the cumulative probability of hitting one or more of k multi-hit combinations with m or less accumulated mutations, P h is the probability of hitting one combination of h mutations, and G is the number of all possible somatic mutations that can become fixed, i.e. do not cause cell death. Here k is the number of possible carcinogenic multi-hit combinations, e.g. for 2-hit combinations it would include a set of genetic mutations [g 11 , g 12 ], [g 21 , g 22 ], . . . [g k1 , g k2 ], where g ij represents a genetic mutation. In this model, mutations are counted separately for each copy of a gene. The derivation for the above equations use basic principles from probability theory, as detailed in Materials and Methods. Although this model does not make assumptions regarding mutation rate, as current models do, we do make two other simplifying assumptions that are similar to the other assumptions made by current models [4][5][6][7][8][9][10]. One, mutations at any one of the loci are equally likely. This assumption does not imply that mutation rate (variants per year) is fixed. It only implies that the distribution of mutations across the genome is approximately uniform subject to random variations. Two, carcinogenesis is primarily the result of somatic mutations. We discuss below the potential effect of deviations from these assumptions, on our results. Our model, which is based on accumulated somatic mutations, does not make any assumptions regarding mutation rate, which can be highly variable [15][16][17][18][19]. Whereas current models are based on incidence by age, which requires knowing or making assumptions about mutation rate, which limits the robustness of these models [4][5][6][7][8][9]. For example, the multi-stage models by Armitage and Dole, Ashley, Luebeck and Moolgavkar, Little and Wright, and Zhang and Simon, assume a fixed probability of cellular change (rate-limiting carcinogenic mutations) for each stage, and subjective explanations are provided for differences in this probability between stages [5][6][7][8][9]. Knudson's 1971 study of retinoblastoma and Frank's 2005 study of colon cancer compare incidence by age for cases with and without inherited mutations, assuming a constant mutation rate [28,30]. Tomasetti et. al. found that mutation rates vary by a factor of three between smokers and never-smokers, and by a factor ten between those with and without a mismatch repair deficiency. They partially mitigate the effect of these variable mutation rates, by partitioning tumor samples into two subsets based on an estimated constant mutation rate for each subset [4]. Current models use age as an indirect measure of somatic mutations, which are the primary cause of cancer, i.e. age-at-carcinogenesis = somatic-mutations-at-carcinogenesis / mutation-rate, which requires making assumptions about the mutation rate. By directly estimating the probability of carcinogenesis as a function of somatic mutations instead of age, our model eliminates the need for making assumptions about mutation rates.

Somatic mutations are weakly correlated with age at cancer diagnosis
The number of accumulated somatic mutations in tumor samples is weakly correlated with age at cancer diagnosis, as shown in Fig 2. The Pearson's correlation coefficient ranges from -0.2 to +0.2 for whole exome sequencing data for 31 of 32 cancer types in The Cancer Genome Atlas (TCGA) with matched tumor and blood derived normal samples. For kidney chromophobe, with correlation coefficient of -0.4, only nine samples were available. The associated pvalues are greater than 0.05 for all but four of the 32 cancer types, also suggesting a lack of correlation. For breast invasive carcinoma, rectum adenocarcinoma, stomach adenocarcinoma, and uterine endometrial carcinoma, the p-values were 0.02, 0.02, 7E-4 and 4E-6 respectively, with correlation coefficients of 0.07, -0.20, 0.18, and -0.21. The calculation of p-value is described in the SI. Longitudinal studies show that for any given individual the number of accumulated somatic mutations increase with age [15]. However, this relationship can be obscured by the extensive variation in the rate of mutations between individuals. For a given individual accumulated mutations will of course increase with age. However, we find that for a population of individuals diagnosed with cancer, the correlation is weak at best. Consider for example, a smoker who is likely to get cancer at a much younger age than a nonsmoker is. It is possible that, due to the mutagenic activity of cigarette smoke, the younger smoker will have accumulated more mutations than the older nonsmoker. Some of the factors that contribute to this variation include age itself, exposure to mutagens, and mutations in DNA repair genes [4,[15][16][17][18][19]. Tomasetti et. al. show that the average mutation rate for smokers is 3 times higher than for never-smokers, and over ten times higher for individuals with a mismatch repair deficiency compared to those without the deficiency [4]. Tomlinson et. al. suggest that mismatch repair deficiency can increase mutation rate by up to four orders of magnitude [19]. Bavarva et. al. report that the nonsynonymous mutation rate varies by age from 9.6E-7 to 8.4E-6 per base pair per year.

The number of carcinogenic mutations (hits) varies from two to eight depending on cancer type
Comparing the actual distribution of somatic mutations for different cancer types, to the distribution of somatic mutations estimated by our model (Figs 3 and S1-S3), suggests that the average number of hits varies from two to eight depending on cancer type (Table 1, Fig 4). The number of hits were determined by minimizing root mean squared difference (RMSD) between the actual distribution of somatic mutation and the distribution calculated by the multi-combination multi-hit model (Figs 3 and S1-S3). See Materials and Methods. Table 1 shows that the calculated distributions closely match the actual distribution with root mean squared difference (RMSD) less than 2.2%. These results are based on whole exome sequencing data from 6904 matched tumor and blood derived normal samples for 17 of the 33 cancer types in TCGA for which there were at least 200 matched tumor and blood derived normal samples. Sensitivity analysis (next section) shows that the distribution of somatic mutations are robust for sample size greater than 200. We compared our results to those of other mathematical models (Table 1). These mathematical models include both probabilistic [4,[7][8][9] and mechanistic models [5,6]. The probabilistic models assume a constant mutation rate for each driver gene, which results in an exponential relationship between the probability of cancer incidence and age. The mechanistic models explicitly represent each stage of cancer progression, and the processes of cell growth, death and mutation, with a constant mutation rate for each stage. As shown in Table 1, the number of hits estimated by our model (two-eight) is approximately in the same range as previous estimates (two-seven) [4][5][6][7][8][9]11] based on fundamentally different methodologies. More importantly the most recent, and arguably the most accurate, estimates by Tomasetti et. al. in 2015 for colon and lung cancers exactly match our estimates, lending support for the multicombination multi-hit model, the reasonableness of the assumptions used, and the estimated number of hits.
As a further test of the robustness of our probabilistic model, we implemented and compared our results to a mechanistic model. The mechanistic model incorporates characteristics of newer mechanistic models [31][32][33][34][35], not included in the older mechanistic models [5,6] considered in the comparisons shown in Table 1. The model is general enough that it can be  Table 1. Number of hits estimated by the multi-combination multi-hit model range from two-eight depending on cancer type. For the 17 cancer types with at least 200 samples, the RMSD between the distribution of somatic mutations and the probability distribution for the optimal model is less than 2.2% (top section of the table). Number of hits estimated by this somatic mutations based model is in the same range as those estimated by previous models based on incidence by age (middle section). Calculation of 95% confidence interval is described in the SI. These averages may consist of a mix of different number of hits (hi and hj) as illustrated in the bottom section of the    The parameters for the model include the numbers of stem cells, progenitor cells, and differentiated cells, the rates of cell division, differentiation, and death, and the oncogenic mutation rates. We identified four cancer types (colon, lung, and stomach adenocarcinoma, and thyroid carcinoma) for which values for key model parameters were found in the literature [11,[36][37][38][39], as listed in the SI, S3 Table. When compared to the incidence of these cancer types by age from a 2013-2015 UK population study [40], the three-hit mechanistic model best fits the population study for all four cancer types (SI, S6 Fig). The estimated number of hits for the mechanistic model matches the number of hits for the probabilistic model for colon and lung adenocarcinoma and is within the 95% confidence interval for stomach adenocarcinoma (Table 1). However, the model is sensitive to the parameters used. In particular, estimates for oncogenic mutation rate vary from 10 −8 to 10 −3 [8,11,28,41]. The number of hits estimated by the mechanistic model can exactly match our probabilistic model (Table 1) depending on the value used for this parameter (SI, S3 Table and S7 Fig). These findings provide further support for the robustness of our probabilistic model, and may give insight into, and help narrow, the range estimates for oncogenic mutation rate. In our probabilistic model, the number of multi-hit combinations that minimize RMSD range from 2x10 9 to 1x10 40 depending on cancer type. This number may seem large, but is reasonable considering the much larger number of all possible combinations of mutations. The average length of a protein in human cells is estimated to be approximately 500 amino acids (1500 DNA base pairs). The Catalog of Somatic Mutations in Cancer (COSMIC) has identified over 700 confirmed cancer genes, with many more genes that have been experimentally implicated in cancer [42]. If we assume that mutations in 1000 of the 1500 loci within a gene alters the function or activity of the gene, and that altering the function or activity of any combination of h (number of hits) out of just 50 different genes is carcinogenic. Then, for the diploid human genome, there are approximately 10 10 possible carcinogenic 2-hit combinations ((1000x2x50) 2 ) and approximately 10 40 possible carcinogenic 8-hit combinations.

Results are robust for sample size greater than 200
To determine the sensitivity of our results to sample size, we estimated the number of hits using a randomly selected 80% subset of the samples. The number of matched tumor and blood derived normal samples in TCGA vary from nine for kidney chromophobe to 929 for breast invasive carcinoma. We found that for sample size greater than 200, the estimated number of hits did not change in any of the 17 cancer types, the estimated number of combinations did not change in 14 of 17 cancer types, and the change in RMSD was less than 0.11% in 16 of 17 cancer types. Results for all 32 cancer types with all samples and 80% of the samples are shown in S1 Table.

Average number of hits may comprise a mix of different number of hits
Some of the cancer types exhibit multi-modal distributions, suggesting a mix of different number of hits, such as thyroid carcinoma (THCA) and prostate adenocarcinoma (PRAD) as seen in Fig 3(B) and 3(C). In addition, in some cases there is little difference in the RMSD between adjacent numbers of hits, also indicating a mix of different number of hits. For example, for head and neck squamous cell carcinoma (HNSC) the minimum RMSD for the 4-hit model is 2.03 compared to 2.06 for the 5-hit model. To assess the effect of a mix of different number of hits, we identified the optimal pairwise combination of hits from two to nine for each cancer type. The combined probability for a mix of a% hi-hits and (100-a)% hj-hits was calculated as P hihj = (a.P hi + (100 − a).P hj )/100. The RMSD between the actual and calculated distributions were lower in all cases with the maximum RMSD being reduced from 2.1 to 1.7 (Table 1). For THCA, for example, a mix of 40% 3-hit combinations and 60% 8-hit combinations fit the distribution with RMSD of 0.85 compared to the 5-hit combination with RMSD 1.69. For PRAD, a mix of 30% 6-hit combinations and 70% 8-hit combinations fit the distribution with RMSD of 1.2 compared to 1.6 for the 7-hit combination alone. For HNSC, a mix of 50% 3-hit and 9-hit combinations reduced the RMSD from 2.0 to 1.1. We speculate that the mix of different number of hits may represent a mix of different cancer subtypes. Determining the number of hits associated with different subtypes will require a statistically significant number of samples for each subtype, more than what is currently available in TCGA.

The number of hits may depend on tissue specific cellular growth characteristics
Different cancer types exhibit distinct distributions for somatic mutations, as seen in Figs 3 and S1-S3. In addition, these distributions are robust for sample sizes greater than 200 (S1 Table). Corresponding to these different distributions, the number of hits estimated by our model are also different ( Table 1). We speculate that there may be a relationship between the number of hits and the cellular growth characteristics of the tissue of origin for the different cancers. Previous studies have also identified tissue and cell specific differences in carcinogenesis [21,36,43,44]. Tomasetti & Vogelstein linked the tissue specific variation to lifetime stem cell divisions [36,43]. Nunney suggests that tissue specific variations may be a result of adaptive evolution, and that these adaptations may be a function of the number of tissue specific cell divisions [44]. Schneider et. al. have attributed tissue-specific difference in cancer to differences in cellular environment and signaling networks [21].
Tomasetti and Vogelstein [43], for example, found a strong correlation (Pearson's linear correlation = 0.804) between the incidence of different cancer types and the lifetime stem cell divisions in the corresponding tissue, highlighting the importance of tissue specific cellular growth characteristics on cancer incidence. We also found a moderate correlation (Pearson's linear correlation = 0.52) between the estimated number of hits and lifetime stem cell division for eight cancer types that match the Tomasetti and Vogelstein classification (Fig 5). However, this relationship may be coincidental (95% confidence interval = -0.29 to 0.90) due to the small number of data point.  An analysis of breast invasive carcinoma (BRCA) subtypes shows that the distribution of somatic mutations are similar across subtypes (Fig 6(A)), resulting in identical optimal multicombination multi-hit models. This again suggests that the number of hits may depend on the cellular growth characteristics of the tissue. The 709 BRCA samples, for which subtype information was available in TCGA, contain 62% hormone receptor positive (HR+) / tyrosine kinase-type cell surface receptor negative (HER2-), 17% HR-/HER2-(triple negative), 17% HR +/HER2+, and 4% HR-/HER2+ subtypes (roughly comparable to a recent study showing 73%, 12%, 10%, and 5% respectively) [45]. We subdivided the samples into four subsets corresponding to these subtypes. The estimated number of hits, and the number of multi-hit combinations was identical for all four subtypes. One possible explanation is that the number of hits may be determined by the growth characteristics of the tissue. Although the different subtypes may result from different combinations of mutations, the estimated number of hits are the same.
Another example of a possible link between the number of hits and tissue specific cellular growth characteristics are the tissues whose growth is regulated by female sex hormones: breast, ovarian and uterine epithelia [46]. These three cancers require on average only twothree hits. We speculate that since these tissues are actively growing for a relatively short time (from puberty to menopause), the associated cells may not require a more robust cell proliferation control mechanism, i.e. larger number of hits. Which may explain why continued exposure to female sex hormones past menopause increases the risk of breast, uterine and ovarian cancers [47][48][49].
For BRCA, the distribution of somatic mutations is also similar across the four cancers stages, corresponding to the same number of hits (Fig 6(B)), suggesting that there is no significant increase in somatic mutations with cancer stage. One possible explanation is that tumor growth originates from a relatively small number of cancer stem cells with limited subsequent differentiation [50]. A second possible explanation is that the variance in somatic mutations across individuals may obscure the relatively smaller increase in somatic mutations with  TCGA for which  cancer stage information was available, was 1551, 1433, 1299 and 1429 for stages I, II, III and IV respectively, also suggesting no significant increase in somatic mutations with cancer stage.

Potential effect of deviations from model assumptions
As noted above, unlike current models, we do not make assumptions about mutation rate. However. to derive a closed-form mathematical model (Eqs (1)-(3)) for this complex biological process, we make two simplifying assumptions. These assumptions are similar to the assumptions made by current models [4][5][6][7][8][9][10]. (1) Somatic mutations are equally likely at any location in the genome, and (2) carcinogenesis is driven primarily by somatic mutations. While the first assumption is unlikely to affect our results, the second assumption will likely result in the number of hits being understated, as discussed below.
Contrary to assumption (1), susceptibility to mutations varies across the genome. For example, CpG sites and tandem repeat regions are known to be much more mutable than the overall average rate [51][52][53]. However, these regions are distributed across the genome and across genes. Therefore, although tumors may be more likely to include hits to these regions, the estimated number of hits themselves are unlikely to be affected, e.g. more of the h-hit combinations found in the tumor samples may include mutations to tandem repeat regions, but the number of hits and number of possible combinations remain the same. In addition, our results are robust to large changes in the value of G, the number of possible somatic mutations. S1 and S2 Tables show that, when the value of G is changed by a factor of eight, the estimated number of hits for all cancers with at least 100 samples remain the same. As an extreme example, if we assume that all somatic mutations only occur in 12.5% (1/8 th ) of the genome, with no somatic mutations in the rest of the genome. Then, the value of G would change by a factor of eight, and as we have shown (S1 and S2 Tables), even such an extreme change does not affect the estimated number hits. It is also unclear if the variance in mutability can be incorporated into a closed form solution, such as Eqs (1)-(3). If it could be done, any potential benefit is outweighed by the increased dimensionality and higher uncertainty due to the additional parameters needed.
Contrary to assumption (2), in addition to somatic mutations, other changes, such as germline mutations and epigenetic changes are also known to be responsible for tumor initiation. Germline mutations, in particular, are known to indicate a significant predisposition to cancer. For example, potentially pathogenic germline BRCA mutations were found in approximately 5% of ovarian cancers and in 12% of young-onset breast cancer, representing 1% of all breast cancer cases [54,55]. If we conservatively assume that 10% of hits are present in the germline, then our model underestimates the number of hits by 10%. For example, when we estimated that BRCA could consist of a mix of 80% 3-hit and 30% 4-hit combinations ( Table 1), some of the 3-hit combinations could be 4-hit combinations with one germline hit and some of the 4-hit combinations could be 5-hit combinations with one germline hit, i.e. even if you are born with one strike against you, an additional 3-4 hits are required for carcinogenesis.
The mathematical model itself does not exclude mutations to non-exome regions, such as mutations in RNA, regulatory, and untranslated regions [56]. However, due to the limited availability of whole genome sequencing data for matched tumor and blood derived normal samples, we used data from whole exome sequencing. The value of G in Eq(3) was set accordingly to correspond to the size of the exome. In general, we find that the number of hits is determined primarily by the variance in distribution around the mode, as seen in Fig 3. Incorporating mutations to non-exome regions will increase the mode, but it is not clear if it will significantly alter the variance about the mode. For example, when we exclude silent mutations, the variance in the distribution of protein-altering mutations is similar enough that the estimated number of hits is the same for all but one of the 17 cancers.

Conclusions
Carcinogenesis can result from many different combinations of a small number of hits. However, identifying these specific combinations from an exponentially large number of possible combinations requires knowing with some degree of certainty the number of hits. In this study, we developed a mathematical model, from first principles, for predicting the distribution of somatic mutations at carcinogenesis. The model allows us to estimate the number of hits without knowing or making assumptions about the highly variable and difficult to estimate mutation rate, which is a critical limitation of current models. The number of hits required for carcinogenesis, estimated by our model, varies from two to eight depending on cancer type. The distinct distribution of somatic mutations for different cancer types, suggests tissue and cell specific carcinogenic mechanisms. The number of hits corresponding to these distinct distributions may provide insight into the etiology of different cancers types.

Materials and methods
Described below are the derivation for the probability function for the multi-combination multi-hit model, and data and procedure used for estimating the number of hits.

Probability of carcinogenesis as a function of somatic mutations
We derive here an expression for the probability of carcinogenesis for the multi-combination multi-hit model. The probability is expressed in terms of G the number of all possible somatic mutations that can become fixed in the human genome, h the number of specific mutations (hits) required for carcinogenesis, k the number of different combinations of h, and m the total number of somatic mutations. For the purpose of this model, we assume that: (1) mutations in any location within the genome is equally likely, and (2) carcinogenesis is primarily a result of somatic mutations. In the Results and Discussion section, we examine the potential effect of deviations from these assumptions. The probability of carcinogenesis is formulated as a function of m, h, k and G.  5. Probability of h hits in one of k possible combinations: P k (m)�1−[1−P h (m)] k . This relationship is an approximation because the individual combinations are not independent. However, it is a reasonable approximation since G�m. In addition, the cumulative probability approaches 1 as m increases, providing support for the reasonableness of the approximation. The term in brackets represents the probability of not hitting a given combination of h-hits. The exponential term represents the probability of not hitting any of the k possible combination.
6. wProbability of the last (h th ) hit occurring on the last (m th ) somatic mutation, P(m)� P k (m)−P k (m−1), i.e. the difference between the probability of hitting after m and m-1 somatic mutations.
In the above derivation, there are several probability terms with distinct meanings. The lower case p refer to a number of permutations, while the upper case P refer to a probability.

The Cancer Genome Atlas (TCGA) data used to estimate the number of hits
The TCGA data repository includes somatic mutations calculated by comparing whole exome sequencing data from matched blood derived normal and tumor tissue samples (MAF files). For this study we used the somatic mutations calculated using the MuTect2 variant caller [57]. We had access to data for 8292 samples for 32 cancer type. Of these, 6904 samples were for the 17 cancer types with over 200 samples.

Identifying the optimal multi-combination multi-hit model
To estimate the number of hits, we compared the actual distribution of somatic mutations in TCGA to the distribution calculated by the model for different numbers of hits and combinations. The estimated number of hits is the value that minimizes the root mean squared difference (RMSD) between the actual and calculated distribution. The process consisted of the following steps: 1. For each cancer type, using MAF data from TCGA, we calculated the number of somatic mutations per sample.
2. The samples were grouped by the number of somatic mutations per sample into bins of size 100 somatic mutations, i.e. 0-100, 101-200, etc., to calculate the distribution.
4. To identify the optimal multi-combination multi-hit model we minimized RMSD between the actual and calculated distributions.