KJB and LC are employees of NantOmics, LLC. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Cancer is sometimes depicted as a reversion to single cell behavior in cells adapted to live in a multicellular assembly. If this is the case, one would expect that mutation in cancer disrupts functional mechanisms that suppress cell-level traits detrimental to multicellularity. Such mechanisms should have evolved with or after the emergence of multicellularity. This leads to two related, but distinct hypotheses: 1) Somatic mutations in cancer will occur in genes that are younger than the emergence of multicellularity (1000 million years [MY]); and 2) genes that are frequently mutated in cancer and whose mutations are functionally important for the emergence of the cancer phenotype evolved within the past 1000 million years, and thus would exhibit an age distribution that is skewed to younger genes. In order to investigate these hypotheses we estimated the evolutionary ages of all human genes and then studied the probability of mutation and their biological function in relation to their age and genomic location for both normal germline and cancer contexts. We observed that under a model of uniform random mutation across the genome, controlled for gene size, genes less than 500 MY were more frequently mutated in both cases. Paradoxically, causal genes, defined in the COSMIC Cancer Gene Census, were depleted in this age group. When we used functional enrichment analysis to explain this unexpected result we discovered that COSMIC genes with recessive disease phenotypes were enriched for DNA repair and cell cycle control. The non-mutated genes in these pathways are orthologous to those underlying stress-induced mutation in bacteria, which results in the clustering of single nucleotide variations. COSMIC genes were less common in regions where the probability of observing mutational clusters is high, although they are approximately 2-fold more likely to harbor mutational clusters compared to other human genes. Our results suggest this ancient mutational response to stress that evolved among prokaryotes was co-opted to maintain diversity in the germline and immune system, while the original phenotype is restored in cancer. Reversion to a stress-induced mutational response is a hallmark of cancer that allows for effectively searching “protected” genome space where genes causally implicated in cancer are located and underlies the high adaptive potential and concomitant therapeutic resistance that is characteristic of cancer.
A defining quality of life is its phenotypic plasticity, generated through the ability to regulate gene expression and other cellular functions in response to environmental factors, critical properties that enable organisms to respond to a wide variety of environmental challenges in a coordinated and systematic way [
Cancer is a disease of bodies, and therefore of multicellular organisms, yet many of the hallmarks of cancer [
Here we present evidence demonstrating that cancer manifests as an atavistic recapitulation of pre-metazoan [
Gene homologies represent the evolutionary history of gene families. Accordingly, an ortholog of a human gene found in any other species can be assumed to have diverged from a common ancestor. Thus, by grouping orthologous genes into gene families, the age of the human gene can be identified by the divergence time of the last common ancestor of all the species contained within the gene family.
Given that this approach is contingent on the definition of homology, more accurate gene family builds will lead to better estimations of gene ages. We looked at three pertinent homology databases to identify the one with the most coverage across all kingdoms of life and the most robust human gene families. We considered Ensembl Compara / Ensembl Pan-Taxonomic Compara [
We then determined gene ages as the maximum phylogenetic divergence time between humans and all the species represented in each corresponding gene family according to the TimeTree database [
Sanger’s Catalogue Of Somatic Mutations In Cancer (COSMIC) is a comprehensive resource of somatic mutations in human cancer [
We obtained variants called from whole genome sequence (WGS) samples from the International Cancer Genomics Consortium (ICGC) data portal [
A priori, we removed all events occurring in regions known to be involved in somatic hypermutation [
We used the Functional Enrichment clustering tool of DAVID [
The atavistic model of cancer presumes that the cancer phenotype is to some degree an evolutionarily conserved ‘genetic subroutine’ that is suppressed by multicellularity but becomes re-activated through oncogenic progression [
We tested the first hypothesis by establishing the evolutionary ages of 19,756 human genes by assigning them to gene families according to the Ensembl Compara homology database. We then defined the age of the human member of the gene family as the maximum phylogenetic divergence time between humans and the species represented in the corresponding gene family. Next we examined mutational frequencies as a function of the evolutionary age of each gene in both normal tissue (“normal”) and cancer. In normal tissue, we analyzed the private SNVs from 129 individuals derived from 1000 Genomes Project whole genome sequencing trio data [
The Enrichment Ratio is the observed rate of mutation of a gene (in mutations per base-pair) over the expected value according to the null hypothesis of uniform random mutations. We categorized genes in three main age groups, corresponding to post-metazoan (less than 500 MY), metazoan (between 500 and 1000 MY) and pre-metazoan (more than 1000 MY) ages and produced the distribution of Enrichment Ratio for each group. Genes younger than 500 MY old are mutated significantly more frequently in both normal (A) and cancer (B). Also, the frequency of mutation declines as the age of the gene increases. P-values in each case are taken as the maximum between the p-value given by a Tukey's range test between the three groups and a pair-wise t-test comparison.
We then addressed whether a similar pattern exists in cancer. We looked at 764 samples from of the ICGC (release 19) that had whole genome sequencing with calls for both simple somatic mutations and structural mutations. Under the same null model assumption, genes in cancer cells had 15% less mutation compared to non-gene regions of the genome. Thus cancer recapitulates the pattern seen in normal tissue: mutation occurs predominantly outside of genes, and mutation that occurs within genes is more frequent in genes younger than 500 MY (Table A in
We also examined the patterns of mutation in cancer relative to what was observed in the normal tissue, which is equivalent to a non-uniform but random distribution in the genome, as shown in
For each human gene, the expected number of mutations is obtained based on the normal mutation pattern: frequency of normal mutations times the total number of cancer mutations recorded in the data set. According to this, the Enrichment Ratio (ER) is calculated as the ratio of observed cancer mutations and the number of expected mutations in the gene. Over-mutated genes have ER > 1.5; under mutated genes have ER < -1.5. Numbers in legend indicate the size of each gene set. Cross marks (X) on bars tips indicate the enrichment in that category is statistically significant at p < 0.01 according to a bootstrap test taking random samples from the set of all human genes (BSQ < 1%, see
To test the second hypothesis that genes that are both frequently and causally mutated in cancer are evolutionarily younger than the emergence of multicellularity as a whole (<1000 MY), we evaluated the evolutionary ages of genes demonstrated to be causally mutated in cancer as compiled by COSMIC in the Cancer Gene Census [
9.26x10-5 | 0.0126 | 0.00196 | |||||||
2.49x10-16 | 4.38x10-17 | 1.02 | 0.118 | 81.32% | |||||
1.05x10-5 | 2.17x10-4 | 1.32 | 0.0625 | 9.96% | |||||
0.71 | 0.298 | 6.94% | 0.59 | 0.15 | 2.08% | 1.23 | 0.495 | 39.48% | |
0.97 | 0.907 | 81.60% | 0.93 | 1 | 67.42% | 0.96 | 0.931 | 81.44% | |
0.93 | 1 | 64.70% | 0.72 | 0.438 | 15.66% | 1.4 | 0.365 | 23.02% | |
NA | NA | NA | NA | NA | NA | NA | NA | NA | |
0.72 | 0.298 | 5.40% | 0.00145 | 0.0395 | 2.14% | ||||
2.90x10-4 | 0.177 | 7.51x10-4 | |||||||
5.46x10-8 | 7.54x10-4 | 4.34% | 4.72x10-5 | ||||||
0.91 | 0.781 | 44.86% | 0.92 | 0.781 | 60.84% | 0.9 | 0.781 | 52.28% | |
1.18 | 0.566 | 39.40% | 1.33 | 0.566 | 27.66% | 1.04 | 0.781 | 71.66% | |
1.18 | 0.566 | 37.76% | 1.19 | 0.689 | 44.62% | 1.17 | 0.689 | 46.78% | |
0.95 | 1 | 77.96% | 0.84 | 1 | 57.98% | 1.04 | 0.781 | 71.64% | |
NA | NA | NA | NA | NA | NA | NA | NA | NA | |
1.4 | 0.126 | 6.40% | 1.12 | 0.781 | 56.66% | 1.65 | 0.0873 | 3.86% |
Score indicates the enrichment (> 1) or depletion (< 1) of genes in the age category. Scores that are statistically different from 1 as determined by either the p-value or BSQ are bolded.
a Enrichment and p-values were computed as indicated in Zeeberg, et al. [
b Bootstrap quantile (BSQ) score: for a given gene set, 10,000 random samples of the same size are taken without replacement from the parental list of 19,756 aged human genes and distribution of ages is calculated for each one. The BSQ score for an age group is the percentile quantile in which the actual observed frequency value falls for the corresponding age group in the sampling ensemble. Hence, any BSQ value of less than 1% indicates that the observation is highly unlikely by random sampling.
(A) Age distribution of dominant (green) and recessive (orange) genes from COSMIC Cancer Gene Census. Grey bars represent the age distribution of all human genes in ENSEMBL, and blue the age distribution of all COSMIC genes. Numbers in legend are the sizes of each gene set. Cross marks (X) on bars tips indicate the enrichment in that category is statistically significant according to Gene Enrichment Score method and a bootstrap test (BSQ < 1%, see
COSMIC contains 458 genes with different mutational modes of action: those that yield dominant phenotypes and therefore require a single mutant allele (343), and those that give recessive phenotypes (101), requiring that all alleles within the cell be altered. Twelve genes in this list have no clearly defined molecular genetics. It should be noted that the set of dominant genes overlap to a large degree with oncogenes (260 out of 264 genes considered oncogenes [
The paradox of cancer-causing genes being under-represented in the age bin with the highest frequency of mutation suggests there may be an underlying mechanism that explains the shift in mutational frequency revealed by determining the functions of the dominant versus recessive genes. Functional annotation and enrichment analysis of COSMIC genes using DAVID [
Each node in this network represents a group of functionally related genes as returned in DAVID (gene ontology, orthology, functional annotations, etc.). The size of the node represents the number of genes in it. Links between nodes represent gene overlaps between groups, with the width representing the number of genes. Node colors indicate the general functional categories defined in the legend revealing an additional layer of clustering of gene groups. The number in the node indicates the group label as given in
In bacteria, the process of adaptive mutation results in a molecular fingerprint in the form of a cluster of SNVs around each DSB[
To address these shortcomings, we used whole genome sequencing data (ICGC release 19) that showed evidence for DSBs. We then evaluated whether or not SNVs clustered in each sample. Out of 764 tumor samples from seven different sites (pancreas, prostate, bone, ovary, skin, blood, and brain), 668 (87.4%) had evidence of SNV clustering. These clusters do not necessarily represent kataegis, defined as 6 or more mutations with inter-mutational distance of 1kb or less [
We observed a distinctive difference in the non-random spatial distribution of clusters across the genome in both normal and cancer (
Circos plot showing distribution of SNV clustering for chromosomes 1, 3, 13 and 17. Tracks from inside out are: blue, evolutionarily re-used breakpoint regions (EBR); green, amniote homologous synteny regions (mHSB); orange, hot spots of CM clusters in normal; and red, hot spots of CM clusters in cancer. Outside text track are symbols for COSMIC genes in their corresponding genomic locations. Dominant genes are in black fonts and recessive genes are in red font.
If human genes are classified as either “metazoan” (less than 1000 MY old) or “pre-metazoan” (older than 1000 MY) we found that the set of pre-metazoan genes overlapped with HSBs and were excluded from EBRs as it might be expected (
Gene set | Overlap with | Odds Ratio | 95% Confidence Interval | p-value |
---|---|---|---|---|
0.6886 | 0.6503–0.7290 | <2.2x10-16 | ||
1.096 | 1.024–1.1.174 | 8.342 x 10−3 | ||
1.452 | 1.372–1.538 | <2.2x10-16 | ||
0.9124 | 0.8521–0.9769 | 8.342 x 10−3 | ||
1.7240 | 1.409–2.118 | 3.859 x 10−8 | ||
0.7689 | 0.5965–0.9814 | 0.03454 |
Odds Ratios of >1 indicate enrichment, while odds ratios <1 indicated depletion. HSB, homologous synteny region; EBR, evolutionarily re-used breakpoint region.
When considering mutations in normal samples, clustering hotspots co-localized with EBRs and were excluded from HSBs, independently of whether the analysis included all SNV clusters or only those that overlapped genes (
Normal | Hotspots in | Odds Ratio | 95% Confidence Interval | p-value |
---|---|---|---|---|
0.3053 | 0.3012–0.3093 | <2.2x10-16 | ||
1.135 | 1.119–1.152 | <2.2x10-16 | ||
0.3256 | 0.3183–0.3331 | <2.2x10-16 | ||
1.757 | 1.715–1.800 | <2.2x10-16 |
The comparison was run looking at private SNVs (determined from trio comparison) clustering across the entire genome as well as clustering that only overlapped genes. HSB, homologous synteny region; EBR, evolutionarily re-used breakpoint region.
Category | Odds Ratio | 95% Confidence Interval | p-value |
---|---|---|---|
0.7666 | 0.7189–0.8172 | <2.2x10-16 | |
0.3246 | 0.2897–0.4034 | <2.2x10-16 |
For mutations observed in cancer samples, clustering hotspots typically overlapped with younger genes (mean gene age in hotspots was 1035 MY old versus 1360 MY old for genes outside of hotspots, t = -10.412, df = 1007.8, p<2.2.x10-16). Unlike the normal data, the overlap of hotspots with either HSBs or EBRs depended on the clusters included in the analysis. Genome-wide, hotspots were excluded from HSBs, but among clusters that overlapped genes, there was no exclusion or enrichment (
Cancer | Hotspots in | Odds Ratio | 95% Confidence Interval | p-value |
---|---|---|---|---|
0.5443 | 0.532–0.5569 | <2.2x10-16 | ||
0.8291 | 0.8078–0.851 | <2.2x10-16 | ||
0.9598 | 0.9164–1.005 | 0.08166 | ||
1.4354 | 1.370–1.504 | <2.2x10-16 |
The comparison was run looking at all clusters as well as only those clusters that overlap genes. HSB, homologous synteny region; EBR, evolutionarily re-used breakpoint region.
Interestingly, the hotspot enrichment of young genes was even more evident when we compared the age distribution of mutated genes for both normal and cancer data (
(A) Age distribution of all genes mutated in normal samples data (blue), genes that have neutral level of mutation, as expected from a uniform random distribution (green) and genes in hotspots (orange). Grey bars represent the age distribution of all human genes. Numbers in legend are the sizes of each gene set. Cross marks (X) indicate the enrichment in that category is statistically significant according to a bootstrap test (BSQ < 1%, see
Our work highlights the deep evolutionary roots of cancer and the importance of the evolutionary history of the genome in mutational processes driving oncogenesis. Previous studies of cancer gene ages rely on sparse phylogenetic trees [
The answer would seem to lie in the inherent conflict between different levels of selection that operate in a multicellular organism, where, particularly during development, there is selection both at the cellular level and at the organismal level. Many of the genes that are causal in cancer have significant roles in development [
The functional annotation of old recessive cancer genes led to the hypothesis that stress-induced mutation plays a role in genomic instability in cancer and the mutational clusters seen in cancer represent the molecular signature of a conserved stress-induced mutagenesis response. The genes in humans that are orthologous to the error-prone polymerases that mechanistically drive the stress-induced mutation response in bacteria have become specialized DNA polymerases for translesion synthesis (TLS) employed during replication by-pass of DNA damage [
In bacteria the stress-induced mutation response leaves behind a molecular signature that can be detected in the form of SNV clusters around DSBs [
In cancer, the role of TLS in the generation of genomic instability has been recognized but attributed to oncogene-induced replication-stress, not the induction of a programmed mutational response [
Altogether our results on the age of the recessive genes, the homology to the proteins involved in stress-induced mutation in bacteria to non-mutated genes in DNA repair and cell cycle pathways in humans, and the observation of the molecular signature of stress-induced mutation in human tumors are strong evidence for the restoration of a stress-induced mutational response in somatic cells. Our analysis supports the idea that the stress induced mutational program remains functional but has become cell-lineage constrained. Based on our analysis we propose that, in multi-cellular organisms, the restriction of mutational processes that promote evolution in the germline and the immune system was brought about by re-wiring the input for the mutational response to be a developmental signal, rather than a cellular stress signal. This in turn suggests epigenetic control. A variety of conditions, such as chronic inflammation, may lead to microenvironments where the epigenetic regulation that keeps the mutational program under developmental and lineage control are altered, allowing somatic cells inappropriate access to a stress-induced mutational response. Thus, we propose stressed-induced mutation as a hallmark of cancer reflected by genomic instability.
Our results have important implications for the clinical management of cancer. There is already evidence that TLS polymerase expression contributes to both intrinsic and acquired resistance to genotoxic therapies [
In conclusion, our analysis suggests that the observed phenotype of evolvability in cancer is driven by re-activation of an evolutionarily ancient stress-induced mutational response. Understanding the parameters of this response will be key to maximizing the effectiveness of cancer treatment.
(DOCX)
HomoloGene fails to reveal tree nodes corresponding to events of early evolution (older than 1500 MY), in turn giving a relative over-representation of resent events (less than 500 MY). The evolutionary time spanned by HomoloGene is later than the evolution of multicellularity.
(TIFF)
Frequency of Gene Mutation according to gene age. The distribution of values of mutation frequencies for each age group is estimated and shown as vertical violin and box plots. Horizontal lines are the median; circle is the mean and black dots are distribution outliers in each case. Vertical axis is in log scale. Corresponding plots are shown for both normal (A) and cancer data (B). In both cases it is evident that genes the first age bin (age < 500 MY) are typically mutated less frequently than the rest. (C) Distribution of gene lengths according to age group membership. Young genes are typically shorter than other genes. Frequency of gene mutation normalized by gene length for both normal (D) and cancer data (E) shows that young genes are more likely to be mutated. Groups were compared via ANOVA followed by Tukey’s Post-Hoc test to determining which relationships were driving the partitioning of variation. In normal (D), the <500 MY age bin is more frequently mutated compared to all other age bins (for all pair-wise comparisons, p<2.2x10-16). In cancer (E), the <500 MY age bin is more frequently mutated compared to all other age bins (for all pair-wise comparisons, p<2.2x10-16). Additionally, the 500–1000 MY age bin was more frequently mutated compared to 1000–1500 MY (p = 10−7), 1500–2000 MY (p = 3.2x10-6), and 2000–2500 MY (p = 3.6x10-4).
(TIF)
For each human gene, the expected number of mutations is obtained according to the normal mutation pattern: frequency of normal mutations times the total number of cancer mutations. The Enrichment Ratio (ER) is the ratio of observed cancer mutations and the number of expected mutations in the gene. We define six different gene categories according to the level of enrichment and produce age distributions. Unexpected mutated genes are those genes that are never normally mutated but are mutated in cancer; Severely over-mutated genes are those with over 10 times more mutations in cancer than normal (ER>10); Moderately over-mutated genes are mutated 1.5 to 10 times more in cancer than normal (10>ER>1.5); Unaffected genes have more or less the same number of mutations in cancer than normal (1.5>ER>0.67); Moderately under-mutated genes are mutated up to ten times less than normal (0.67>ER>0.1); and Severely under-mutated genes are mutated more over 10 times less than normal, including a few genes that normally mutate but are never found mutated in cancer. Numbers in legend are the sizes of each gene set. Cross marks (X) on bars tips indicate the enrichment in that category is statistically significant according to a bootstrap test.
(TIFF)
Chromosomes 1 to 8.
(TIF)
Chromosomes 9 to 16.
(TIF)
Chromosomes 17 to 22 and X.
(TIF)
(XLSX)
The network plot for the enrichment is shown in
(XLSX)
Human orthologs of
(XLSX)
(XLSX)
We thank Susan Rosenberg and Bob Austin for insightful discussions into the role of genomic instability in cancer within an evolutionary context. We also acknowledge the work of the clinical collaborators, data analysis teams, and funders in generating the WGS data in the ICGC database, release 19. In particular, data from Ewing sarcoma sequencing project was supported by grants from the Institut National de la Santé et de la Recherche Medicale (Inserm) in the frame of the ICGC program. The ICGC/SKCA-BR project was supported by Barretos Cancer Hospital. The ICGC/PACA-IT project was supported by Italian Ministry of Education, University, and Research University of Verona.