Large scale variation in the rate of de novo mutation, base composition, divergence and diversity in humans

It has long been suspected that the rate of mutation varies across the human genome at a large scale based on the divergence between humans and other species. It is now possible to directly investigate this question using the large number of de novo mutations (DNMs) that have been discovered in humans through the sequencing of trios. We show that there is variation in the mutation rate at the 100KB, 1MB and 10MB scale that cannot be explained by variation at smaller scales, however the level of this variation is modest at large scales – at the 1MB scale we infer that ~90% of regions have a mutation rate within 50% of the mean. Different types of mutation show similar levels of variation and appear to vary in concert which suggests the pattern of mutation is relatively constant across the genome and hence unlikely to generate variation in GC-content. We confirm this using two different analyses. We find that genomic features explain less than 50% of the explainable variance in the rate of DNM. As expected the rate of divergence between species and the level of diversity within humans are correlated to the rate of DNM. However, the correlations are weaker than if all the variation in divergence was due to variation in the mutation rate. We provide evidence that this is due the effect of biased gene conversion on the probability that a mutation will become fixed. We find no evidence that linked selection affects the relationship between divergence and DNM density. In contrast to divergence, we find that most of the variation in diversity can be explained by variation in the mutation rate. Finally, we show that the correlation between divergence and DNM density declines as increasingly divergent species are considered. Author summary Using a dataset of 40,000 de novo mutations we show that there is large-scale variation in the mutation rate at the 100KB and 1MB scale. We show that different types of mutation vary in concert and in a manner that is not expected to generate variation in base composition; hence mutation bias is not responsible for the large-scale variation in base composition that is observed across human chromosomes. As expected large-scale variation in the rate of divergence between species and the variation within species across the genome, are correlated to the rate of mutation, but the correlation between divergence and the mutation rate is not as strong as they could be. We show that biased gene conversion is responsible for weakening the correlation. In contrast we find that most of the variation across the genome in diversity can be explained by variation in the mutation rate. Finally, we show that the correlation between the rate of mutation in humans and the divergence between humans and other species, weakens as the species become more divergent.


Introduction
Until recently, the distribution of germ-line mutations across the genome was studied using patterns of nucleotide substitution between species in putatively neutral sequences (see [1] for review of this literature), since under neutrality 3 the rate of substitution should be equal to the mutation rate. However, the sequencing of hundreds of individuals and their parents has led to the discovery of thousands of germ-line de novo mutations (DNMs) in humans [2][3][4][5][6]; it is therefore possible to begin analysing the pattern of DNMs directly rather than inferring their patterns from substitutions. Initial analyses have shown that the rate of germ-line DNM increases with paternal age [4], a result that was never-the-less inferred by Haldane some 70 years ago [7], varies across the genome [5] and is correlated to a number of factors, including the time of replication [3], the rate of recombination [3], GC content [5] and DNA hypersensitivity [5].
Previous analyses have demonstrated that there is large scale variation in the rate of DNM in both the germ-line [3,5] and the somatic tissue [8][9][10][11][12]. Here we focus exclusively on germ-line mutations. We use a collection of over 40,000 germ-line DNMs to address a range of questions pertaining to the large-scale distribution of DNMs. First, we quantify how much variation there is, and investigate whether the variation in the mutation rate at a large-scale can be explained in terms of variation at smaller scales. We also investigate to what extent the variation is correlated between different types of mutation, and to what extent it is correlated to a range of genomic variables.
There is now convincing evidence that biased gene conversion plays a role in the generating at least some of the variation in . However, this does not preclude a role for mutation bias or selection. With a dataset of DNMs we are able to test explicitly whether mutation bias causes variation in GC-content.
The rate of divergence between species is known to vary across the genome at a large scale [1]. As expected this appears to be in part due to variation in 4 the rate of mutation [3]. However, the rate of mutation at the MB scale is not as strongly correlated to the rate of nucleotide substitution between species as it could be if all the variation in divergence between 1MB windows was due to variation in the mutation rate [3]. Instead, the rate of divergence appears to correlate to the rate of recombination as well. This might be due to one, or a combination, of several factors. First, recombination might affect the probability that a mutation becomes fixed by the process of biased gene conversion (BGC) (review by [26]). Second, recombination can affect the probability that a mutation will be fixed by natural selection; in regions of high recombination deleterious mutations are less likely to be fixed, whereas advantageous mutations are more likely. Third, low levels of recombination can increase the effects of genetic hitch-hiking and background selection, both of which can reduce the diversity in the human-chimp ancestor, and the time to coalescence and the divergence between species. There is evidence of this effect in the divergence of humans and chimpanzees, because the divergence between these two species is lower nearer exons and other functional elements [29]. And fourth, the correlation of divergence to both recombination and DNM density might simply be due to limitations in multiple regression; spurious associations can arise if multiple regression is performed on two correlated variables that are not known without error. For example, it might be that divergence only depends on the mutation rate, but that the mutation rate is partially dependent on the rate of recombination. In a multiple regression, divergence might come out as being correlated to both DNM density and the recombination rate, because we do not know the mutation rate without error, since we only have limited number of DNMs. Here, we introduce a test that can resolve between these explanations.
As with divergence, we might expect variation in the level of diversity across a genome to correlate to the mutation rate. The role of the mutation rate variation in determining the level of genetic diversity across the genome has long been a subject of debate. It was noted many years ago that diversity varies across the human genome at a large scale and that this variation is correlated to the rate of recombination [30][31][32]. Because the rate of substitution between species is also correlated to the rate of recombination,

Distribution of rates
To investigate whether there is large scale variation in the mutation rate we divided the genome into non-overlapping windows of 10KB, 100KB, 1MB and 10MB and fit a gamma distribution to the number of mutations per region, taking into account the sampling error associated with the low number of mutations per region. We focussed our analysis at the 1MB scale since this has been extensively studied before. However, we show that the variation at 1MB forms part of a continuum of variation. We repeated almost all our analyses at the 100KB scale with qualitatively similar results (these results are reported in supplementary tables).
We find that the amount of variation differs significantly between the four studies (likelihood ratio tests: p < 0.001) with the level of variation being far greater in the Michaelson dataset, than in the Wong, Francioli and Kong datasets ( Figure 2; Figure S1 for 100KB). The latter three datasets also show significantly different levels of variation (likelihood ratio test: p < 0.001).
However, the differences between the three largest datasets are quantitatively small at the 1MB scale ( Figure 2). The variation between datasets might be due to differences in age or ethnicity between the individuals in each study, or methodological problems -for example, there might be differences between studies in the ability to identify DNMs. We can test whether callability is an issue in the largest of our datasets because Wong et al. [6] estimated the number of trios at which a DNM was callable at each site. If we reanalyse the Wong data using the sum of the callable trios per MB, rather than the number of sites in the human genome assembly, we obtain very similar estimates of the distribution: the coefficient of variation (CV) for the distribution is 0.27 when we use the number of sites and 0.25 when we use the sum of callable trios.
As expected the number of DNMs per site is significantly correlated between the Francioli and Wong datasets (1MB r =0.15, p<0.001). The correlation is weak, but this is likely to be in part due to sampling error. If we simulate data assuming a common distribution, estimating the shape parameter as the mean CV of the distributions fit to the individual datasets, the mean simulated correlation is 0.22. This suggests that much of the variation is common to the two datasets. However, subsequent analyses, reported below, show some important differences between the two datasets, and as a consequence we repeated all analyses on the Francioli and Wong datasets individually.
The CV of the gamma distribution fitted to the density of DNMs is 0.18 and 0.27 for the Francioli and Wong datasets respectively ( Table 1). The level of variation is significant (i.e. the lower 95% confidence interval of the CV is greater than zero) (Table 1), however the level of variation is quite modest ( Figure 3A, B). A gamma distribution with a coefficient of variation of 0.18 is one in which 90% of regions have a mutation rate within 30% of the mean (i.e. if the mean is one, between 0.7 and 1.3); for a gamma with a CV of 0.27, 95% of regions lie within 43% of the mean. The gamma distribution fits the distribution of rates qualitatively well ( Figure S2), even though a goodness-of- If we include estimates of the distribution for 10KB, 100KB and 10MB we find that the log of the CV of the gamma distribution is approximately linearly related to the log of the scale ( Figure 3C). This suggests that the variation at the 1MB scale is part of a continuum of variation at different scales. The linearity of the relationship suggests that a simple phenomenon underlies the variation. However, the slope of this relationship differs between the Francioli (slope = -0.25) and Wong (-0.10) datasets suggesting that there are some systematic differences between these two studies.
If all the variation at the larger scales is explainable by variation at a smaller scale, then the CV at scale x should be equal to the CV at some finer scale, y, divided by the square-root of x/y; on a log-log scale this should yield a slope of -0.5. This therefore suggests that there is variation at a larger scale that against the window size, both on a log scale, for the Francioli (blue) and Wong (orange) datasets separately.

Mutational types
If we estimate the distribution for individual mutational types we find that in many cases the lower CI on the CV is zero; this might be because we do not have enough data to reliably estimate the distribution for each individual mutational type. We therefore combined mutations into a variety of nonmutually exclusive categories. In each case we estimated the distribution for the relevant category of sites -e.g. in considering the distribution of CpG rates we consider the number of CpG DNMs at CpG sites, not at all sites. We find that the estimated distributions are similar for different mutational types although there is rather more variation at CpG sites in the Francioli dataset (Table 1; 100KB results Table S1). Although the distributions are fairly similar for different mutational types, likelihood ratio tests demonstrate that there is significantly more variation at CpG sites than nonCpG sites and for S>W than W>S changes in the Francioli dataset (Table S2 for 1MB and 100KB results).
We also find significant differences between non-CpG transitions and transversions at 100KB scale in both datasets. Never-the-less the differences between different mutational categories are relatively small.

Correlations between mutational types
Given that there is variation in the mutation rate at the 1MB scale and that this variation is quite similar in magnitude for different mutational types, it would seem likely that the rate of mutation for the different mutational types are correlated. We find that this is indeed the case. We observe significant correlations between all categories of mutations in both the Francioli and Wong datasets, except between CpG transitions and transversions in the Wong dataset ( However, the ML method does not rule out the possibility that there is some variation in the pattern of mutation. Furthermore, the method does not take into account the difference in the mutation rate between CpG and non-CpG sites. We therefore used a second approach in which we ranked regions by their current GC-content and grouped regions together. We then estimated the mutation rates for all categories of mutation using the DNM data and used these estimated mutation rates in a simulation of sequence evolution, in which we evolved the sequence to its equilibrium GC content. We find no correlation between the equilibrium GC content to which the sequence evolves and the current GC content (Figure 4; Figure S3 for 100KB).
1 6 Figure 4. The predicted equilibrium GC content against the current GC content for the Francioli (blue) and Wong (orange) datasets. Note the rightmost point is coincident for the two datasets.

Mutation models
It has been suggested that the mutation rate at a site is predictable based on genomic features, such as replication time [5], or the 7-mer sequence in which a site is found [38]. To investigate whether these models can explain the variation at large scales we used the models to predict the average mutation rate for each 100KB or 1MB region and correlated these predictions against the observed number of DNMs per site.
We find that the density of DNMs is significantly correlated to the rates predicted under the 7-mer model of Aggarwala et al. [38]. This correlation is significantly positive for the combined and Wong datasets, as we might expect, but significantly negative for the Francioli dataset (Table 3; Table S4 for 100KB  male and female recombination rates, but many of the other significant correlations are in opposite directions (Table 4; 100KB results Table S5).
Many of the genomic variables are correlated to each other. If we use principle components to reduce the dimensionality, the first principle component (PC) explains 58% of the variation, the second 13%, the third and fourth 6.9 and 5.7% of the variation. We find that the density of DNMs is significantly correlated to the first PC in both datasets, but the correlation is negative for the Francioli data (r = -0.14, p<0.001) and positive for Wong (r = 0.14, p<0.001   In order to try and disentangle which factors might be most important in determining the rate of mutation we used stepwise regression. We find, as expected, that the models selected for the Francioli and Wong datasets have some commonalities but some differences as well (  Table 5. The standardised regression coefficients from a stepwise multiple regression with forward variable selection (parameter has to be significant at p<0.05 to be added to the model).

Correlation with divergence
The rate of divergence between species is expected to depend, at least in part, on the rate of mutation. To investigate whether variation in the rate of substitution is correlated to variation in the rate of mutation we calculated the divergence between humans and chimpanzees, initially by simply counting the numbers of differences between the two species. There are at least three for the MZ alignments ( Figure S4). The EPO alignment method shows no such bias and we consider these alignments to be the best of those available.
Therefore, we use the EPO alignments for the rest of this analysis. To gain a more precise estimation of the number of substitutions we used the method of Duret and Arndt [21], which is a non-stationary model of nucleotide substitution that allows the rate of transition at CpG dinucleotides to differ to than that at other sites. As expected the divergence along the human lineage (since humans split from chimpanzees) is significantly correlated to the rate of DNMs (Francioli = 0.21 p<0.001; Wong = 0.11, p<0.001). However, the correlation between the rate of DNMs and divergence is not expected to be perfect even if variation in the mutation rate is the only factor affecting the rate of substitution between species; this is because we have relatively few DNMs and hence our estimate of the density of DNMs is subject to a large amount of sampling error. To investigate how strong the correlation could be, we follow the procedure suggested by Francioli et al.
[3]; we assume that variation in the mutation rate is the only factor affecting the variation in the substitution rate across the genome between species and that we know the substitution rate without error (this is an approximation, but the sampling error associated with the substitution rate is small relative to the sampling error associated with DNM density because we have so many substitutions). We generated the observed number DNMs according to the rates of substitution, and then considered the correlation between these simulated DNM densities and the observed substitution rates. We repeated this procedure 1000 times to generate a distribution of expected correlations. Performing this simulation, we find that we would expect the correlation between divergence and DNM density to be 0.33 and 0.48 for the Francioli and Wong datasets respectively, considerably greater than the observed values of 0.21 and 0.11 respectively.
In none of the simulations was the simulated correlation as low as the observed correlation.
There are several potential explanations for why the correlation is weaker As detailed in the introduction there are at least four explanations for why recombination might be correlated to the rate of divergence independent of its effect on the rate of DNM: (i) biased gene conversion, (ii) recombination affecting the efficiency of selection, (iii) recombination affecting the depth of the genealogy in the human-chimpanzee ancestor and (iv) problems with regressing against correlated variables that are subject to sampling error. We can potentially differentiate between these four explanations by comparing the slope of the regression between the rate of substitution and the recombination rate (RR), and the rate DNM and the RR. If recombination affects the substitution rate, independent of its effects on DNM mutations, because of GC-biased gene conversion (gBGC), then we expect the slope between divergence and the RR to be greater than the slope between DNM density and the RR for Weak>Strong (W>S), smaller for S>W, and unaffected for S<>S and W<>W changes. The reason is as follows; gBGC increases the probability that a W>S mutation will get fixed but decreases the probability that a S>W mutation will get fixed. This means that regions of the genome with high rates of recombination will tend to have higher substitution rates of W>S mutations than regions with low rates of recombination hence increasing the slope of the relationship between divergence and recombination rate. The opposite is true for S>W mutations, and S<>S and W<>W mutations should be unaffected by gBGC. If selection is the reason that divergence is correlated to recombination independently of its effects on the mutation rate, then we expect all the slopes associated with substitutions to be less than those associated with DNMs. The reason is as follows; if a proportion of mutations are slightly deleterious then those will have a greater chance of being fixed in regions of low recombination than high recombination. If the effect of recombination on the substitution rate is due to variation in the coalescence time in the human-chimp ancestor, then we expect all the slopes associated with substitution to be greater than those associated with DNMs; this is 2 5 because the average time to coalescence is expected to be shorter in regions of low recombination than in regions of high recombination. Finally, if the effect is due to problems with multiple regression then we might expect all the slopes to become shallower. Since the DNM density and divergences are on different scales we divided each by their mean to normalise them and hence make the slopes comparable.
The results of our test are consistent with the gBGC hypothesis; the slope of divergence versus RR is greater than the slope for DNM density versus RR for W>S mutations and less for S>W mutations ( Figure 6); we present the analyses using sex-averaged RR, but the results are similar for either male or female recombination rates, and for 100KB windows (Figures S5 and S6 and   Tables S8 and S9). These differences are significant in the expected direction for all comparisons except W>S from the Wong data (  (Table S10). These results suggest that the majority of the variation in diversity at the 1MB scale is due to variation in the mutation rate.
Although much of the variation in diversity appears to be due to variation in the mutation rate we tested for the effect of gBGC. We find that the slope of SNP density versus RR is lower than the slope between DNM density and RR per window is independent of the evolutionary divergence. As expected, we find that the correlation between the density of DNM and the rate of substitution declines as the evolutionary divergence increases (Figure 7).  ethnicity of the individuals sequenced in each study, or they may be due to systematic biases in the discovery of DNMs.
The patterns and results that are common to both datasets are as follows.
There appears to be rather little variation in the mutation rate at a large scale, however, there is variation at a large scale that cannot be explained by variation at smaller scales, and large-scale variation forms part of a continuum. Furthermore, we can conclude that the level of variation for different mutational types is similar and different mutational types covary together. We find no evidence that there is variation in the pattern of mutation that generates variation in GC content. We confirm the correlation between the mutation rate, as measured by DNM density, is not as strong as it could be, and demonstrate that this is in part due to BGC. In contrast, we find that variation in diversity at large scales is largely a consequence of variation in the mutation rate. Finally, we demonstrate that the correlation between the rate of DNM and the rate of substitution, declines as increasingly divergent species are considered.
Whether the differences between the datasets are due to error or differences between the sampled individuals is not clear. it is possible that some of the Divergence between species has often been used to control for mutation rate variation in humans (for example [29,[49][50][51]). This is clearly not satisfactory given that the correlation between divergence and rate of DNM is only about half as strong as it could and this correlation gets worse as more divergent species are considered. Unfortunately, correcting for mutation rate variation is likely to be difficult because attempting to predict mutation rates from genomic features is unreliable, given the failure of the regression analyses to explain more than half the explainable variation. Furthermore, the largest amounts of variation are at the smallest scales ( Figure 3B)  DNMs over the number of sites against the mean MI (see Figure S7). The regression line was estimated to be log(mutation rate) = -6.73 + 0.0103 x MI.
Using this equation, we predicted the mutation rate at each site in the genome. Michaelson et al. [5] give MIs mapped to hg18; we lifted these over the hg19 using the liftover tool.

Genomic features.
Male, female and sex-averaged standardised recombination rate data [52] were downloaded from http://www.decode.com/additional/male.rmap, which provides recombination rates in 10KB steps. For each 100KB and 1MB windows the recombination rate was calculated as the mean of these scores with a score assigned to the window in which the position of its first base

ߙ ‫ݑ‬ ത ݈
where l is the length of the window. This means that the number of mutations per window is a negative binomial. In considering a particular category of mutations, such as CpG transitions, we considered the number of CpG transition DNMs at CpG sites. We fit the distribution using maximum likelihood using the NMaximize function in Mathematica. Initial analyses suggested that the maximum likelihood value of the mutation rate parameter was very close to the mean estimate of the mutation rate; as a consequence to speed up the maximization we fixed the mutation rate to its estimated mean and found the ML estimate of the shape parameter of the gamma distribution.
We investigated the correlation between different types of mutation across windows by fitting a single distribution to both types of mutation, estimating the shape parameter of the shared distribution as the mean of the CV of the ML estimates of distributions fitted to the two categories independently. We then used this distribution to simulate data; we drew a random variate for each window from the distribution assigning this as the rate for that window.
We then generated two Poisson variates with the appropriate means such 3 8 that the total number of DNMs for each type of mutation was expected to be equal the total number of DNMs of those types.
To test whether the mutation pattern varied across the genome in a manner that would generate variation in the mutation rate we fit the following model.
Let us assume that f e is normally distributed. Then the likelihood of observing i S>W mutations out of a total of n S>W and w>S mutations is The total loglikelihood is therefore the sum of the log of equation 2 for each MB or 100KB window across all the windows in the genome. The maximum likelihood values were obtained by using the NMaximize routine in Mathematica.

Simulations
In a number of analyses, we simulate DNMs under assumed model; for example, using the 7-mer model of Aggarwala et al. [38]. In these simulations, we calculate the expected number of DNMs given the window's mutation rate, the number of relevant sites and the total number of DNMs, and then generated a random Poisson variate from this expectation. In each simulation, we generated 1000 simulated datasets.

Data availability
The DNM data is available from the supplementary information of the original papers or from the corresponding authors: (i) Michaelson et al. [5] Supplementary  o  l  f  e  K  H  ,  S  h  a  r  p  P  M  ,  L  i  W  -H  .  M  u  t  a  t  i  o  n  r  a  t  e  s  d  i  f  f  e  r  a  m  o  n  g  r  e  g  i  o  n  s  o  f  t  h  e   m  a  m  m  a  l  i  a  n  g  e  n  o  m  e  .  N  a  t  u  r  e  .  1  9  8  9  ;  3  3  7  :  2  8  3  -5  .   1  8  .  K  e  n  i  g  s  b  e  r  g  E  ,  Y  e  h  u  d  a  Y  ,  M  a  r  j  a  v  a  a  r  a  L  ,  K  e  s  z  t  h  e  l  y  i  A  ,  C  h  a  b  e  s  A  ,  T  a  n  a  y  A  ,  e  t   a  l  .  T  h  e  m  u  t  a  t  i  o  n  s  p  e  c  t  r  u  m  i  n  g  e  n  o  m  i  c  l  a  t  e  r  e  p  l  i  c  a  t  i  o  n  d  o  m  a  i  n  s  s  h  a  p  e  s   m  a  m  m  a  l  i  a  n  G  C  c  o  n  t  e  n  t  .  N  u  c  l  e  i  c  A  c  i  d  s  R  e  s  .  2  0  1 a  t  e  n  B  ,  H  e  r  r  e  r  o  J  ,  B  e  a  l  K  ,  F  i  t  z  g  e  r  a  l  d  S  ,  B  i  r  n  e  y  E  .  E  n  r  e  d  o  a  n  d  P  e  c  a  n  :   g  e  n  o  m  e  -w  i  d  e  m  a  m  m  a  l  i  a  n  c  o  n  s  i  s  t  e  n  c  y  -b  a  s  e  d  m  u  l  t  i  p  l  e  a  l  i  g  n  m  e  n  t  w  i  t  h  p  a  r  a  l  o  g      Results are shown for male and female specific recombination rates.