Mutation Rate Variation is a Primary Determinant of the Distribution of Allele Frequencies in Humans

doi:10.1371/journal.pgen.1006489

Fig 1.

Hypothesis about information in parallel mutations.

If an identical substitution occurred independently in a closely related species, then the variant is unlikely to be deleterious in humans. An identical substitution in a more distantly related species may be less informative because sequence divergence at interacting sites may change the set of preferred alleles, and hence the selective constraint at the site.

More »

Expand

Fig 2.

The human SFS conditioned on primate substitution patterns.

(A) An example of the phylogenetic conditioning that defines what we denote as “substituted-orangutan” sites. (B) The cumulative distribution functions (CDF) of orangutan cSFS (i.e. the SFS of substituted-orangutan sites), and the SFS of phylogenetically conserved sites. The cSFS are more skewed towards common variants than the SFS of conserved sites. These skews are much more pronounced than in the comparison of synonymous and nonsynonymous sites. (C) The more closely related the substituted-species, the higher the skew of the cSFS towards common variants (only nonsynonymous mutations shown). The inset shows the rare variants slice of the CDF for each species, for both synonymous and nonsynonymous variants.

More »

Expand

Fig 3.

Mutation rates shape the SFS.

All panels show the SFS of derived alleles constructed from intronic sites. The notation of mutation types refers to mutations on either strand (e.g., A->C indicates an A to C change on either strand). (A) SFS stratified by non-CpG mononucleotide mutation types and CpG transitions, represented by different curves. The fraction of rare variants in CpG transitions is nearly half that of other mutations. (B) Focusing on non-CpG mutations, transitions have an SFS significantly skewed towards common variants compared with transversions. (C) Sharing of polymorphisms between East Asians and Europeans. The excess sharing of CpG polymorphisms at low frequencies is suggestive of multiple occurrences of the mutations. x-axis values are binned on a logarithmic scale. (D) Stratification to coding and template strands revealed differences between the two for some mutation types, suggesting transcription-associated mutational mechanisms also affect the SFS. CpG mutations excluded from the analysis in this panel. (E) Recombination rates are negatively correlated with the fraction of rare variants; this could be due to a correlation between recombination rates and mutation rates. x-axis values are standardized to the genomewide mean, and are binned on a logarithmic scale. (F) SFS across chromatin states. Chromatin states in H1 human embryonic stem cells were inferred by ChromHMM. The chromatin state exhibits substantial association with the fraction of rare variants in CpG mutations, and modest association in other mononucleotide mutation types. In panels D,E and F: Points show means; lines show 95% confidence interval computed with nonparametric bootstrap.

More »

Expand

Fig 4.

All panels exhibit the unfolded SFS (i.e., constructed using the derived alleles) of intronic sites. (A) Fit of mutational models to observed SFS. The x-axis shows previously estimated de-novo germ-line mutation rates [45]. These data illustrate that the fraction of rare variants is strongly negatively correlated with germ-line mutation rates. Lines show expectations under various mutational models: yellow—infinite sites model (SFS independent of mutation rate); teal—Jukes Cantor finite-sites model; red—Jukes-Cantor model with within-mutation-type variation (i.e., variation beyond mutation rate heterogeneity due to the type of mutation in sequence). (B) SFS subsampling and the effect of mutation rate. Dots show the fraction of rare variants in the full sample SFS of the European population in ExAC. Lines show the expected fraction of rare variants after subsampling to smaller numbers of individuals. In large samples, the SFS of CpG and non-CpG sites are very different. In smaller samples, these differences shrink. In the shaded region, the trend across mutation types is changed (the inflection point is indicated by an arrow); with these sample sizes, CpG transitions exhibit more rare variation than non-CpG transitions.

More »

Expand

Fig 5.

(A) Some mutation types accumulate in a roughly constant yearly rate across different primate lineages. For these mutation types the expected number of substitutions on an evolutionary branch is proportional to the branch length in years (pink). The yearly rates of other substitution types (blue) depends on various life-history traits like generation times (“generation time effect”). As a result, the composition of substitution types in a lineage depends on lineage-specific traits like generation times; this is illustrated by the blue to pink ratio, which differs across lineages. (B) Model-based expectations for the distribution of mutation rates at substituted-species sites. These results were computed using a theoretical model and a set of realistic parameters. At substituted species sites, we expect a distribution skewed towards higher mutation rates compared to random sites, or to random polymorphic sites. In addition, the distribution of mutation rates is skewed towards higher mutation rates for substituted-species with longer generation times; for the primates we considered in this work, this would imply higher mutation rates for more closely-related substituted-species. (C) CpG transitions enrichment is a strong predictor of cSFS skewness in real data.

More »

Expand

Fig 6.

Depletion of rare variants is correlated with relatedness to substituted species.

The figure shows logistic regression coefficient estimates with their corresponding standard errors. Substituted-species labels are spaced by their split times from humans. The lines are the least-squares line fitting the coefficients to the split times. (A) Estimates from a simple logistic regression to the substituted species. The trend is partly due to mutational composition differences between substituted-species categories. To test whether the trend is driven solely by mutational rate differences, we estimate coefficients in a model including the variation explained by (B) mononucleotide mutation type, and (C) combinations of focal mononucleotide mutations and upstream and downstream nucleotides. Even after controlling for mutational composition with these models, a significant trend persists for nonsynonymous variants.

More »

Expand