Inference of Candidate Germline Mutator Loci in Humans from Genome-Wide Haplotype Data

The rate of germline mutation varies widely between species but little is known about the extent of variation in the germline mutation rate between individuals of the same species. Here we demonstrate that an allele that increases the rate of germline mutation can result in a distinctive signature in the genomic region linked to the affected locus, characterized by a number of haplotypes with a locally high proportion of derived alleles, against a background of haplotypes carrying a typical proportion of derived alleles. We searched for this signature in human haplotype data from phase 3 of the 1000 Genomes Project and report a number of candidate mutator loci, several of which are located close to or within genes involved in DNA repair or the DNA damage response. To investigate whether mutator alleles remained active at any of these loci, we used de novo mutation counts from human parent-offspring trios in the 1000 Genomes and Genome of the Netherlands cohorts, looking for an elevated number of de novo mutations in the offspring of parents carrying a candidate mutator haplotype at each of these loci. We found some support for two of the candidate loci, including one locus just upstream of the BRSK2 gene, which is expressed in the testis and has been reported to be involved in the response to DNA damage.

The rate of germline mutation is a key parameter in molecular evolution and population 2 genetics. As the ultimate source of genetic novelty, germline mutations provide the raw 3 material on which selection acts and the basis for genetic drift over time. Mutation 4 rates are known to differ substantially between species [1], and in eukaryotes, the single 5 nucleotide mutation rate, fundamental to many demographic and evolutionary analyses, 6 ranges over two orders of magnitude [2]. Methods to estimate the rate of de novo 7 mutation predate the knowledge that DNA carries hereditary information. If the 8 frequency of a deleterious allele is in mutation-selection balance, the rate at which 9 deleterious alleles are removed through selection is equal to the rate at which novel 10 alleles arise through mutation. This idea was used by Haldane to provide an indirect 11 estimate of the rate of spontaneous haemophilia from prevalence estimates [3]. 12 Subsequent estimates incorporated knowledge of the physical size of the locus (or more 13 precisely, the number of target nucleotides giving rise to the phenotype of interest) to 14 obtain per-base and per-generation mutation rate estimates (e.g. [4,5]). 15 More recently, whole genome resequencing methods have been used to obtain direct 16 measurements of the human mutation rate from parent-offspring trios [6][7][8][9]. As well as 17 enabling measurement of the genome-wide mutation rate, these studies have opened up 18 the possibility of investigating factors contributing to variation in mutation rate 19 between individuals. A study of 78 Icelandic parent-offspring trios estimated that 97% 20 of variation in the number of de novo mutations called in offspring could be explained 21 by the age of the father [8], and other whole-genome studies have shown similar, albeit 22 weaker, relationships between parental age and germline mutation rate [10][11][12]. 23 Genetic and environmental factors may also influence the rate of germline 24 mutation [10,13], and a number of recent studies point to differences in the germline 25 mutation spectrum between human populations [13,17,18], consistent with a genetic 26 contribution to variation in germline mutation. Indeed, several cancer-associated 27 germline human mutations are known to affect genes involved in DNA proofreading and 28 mismatch repair, increasing cancer risk through an elevated rate of somatic 29 mutation [14,15]. For example, Lynch syndrome, which is associated with a very high 30 lifetime risk of development of cancer of the colon and several other organs, results from 31 germline mutations in a number of mismatch repair genes [16]. Although the effects of 32 such mutations on the rate of germline mutation in human have not yet been 33 determined, deficiency in mismatch repair is known to result in an elevated per 34 generation mutation rate in yeast [10]. 35 Here we investigate the feasibility of detecting genetic polymorphism associated with 36 increased germline mutation rate by looking for an increase in the number of derived 37 alleles in haplotypes carrying the mutator allele. Using simulation, we show that a 38 mutator allele can result in a localized peak in the numbers of derived alleles in a subset 39 of haplotypes, against a background of haplotypes with typical numbers of derived 40 alleles. This is because in the region linked to the mutator allele, haplotypes containing 41 it are always subject to the elevated mutation rate, whereas other haplotypes are only 42 affected when they occur together with the mutator allele in a heterozygote, and this 43 will occur only rarely if the mutator allele is rare. Detecting this pattern depends on 44 persistence of the mutator allele for a large number of generations, and we discuss the 45 likelihood of this given estimates of the selective disadvantage of a mutator allele. We 46 search for this pattern in data from the 1000 Genomes Project (G1K) and report and 47 characterise a number of candidate mutator loci. For a subset of the candidate loci the 48 highly derived haplotypes, characteristic of a mutator allele, were found among parents 49 of two human trio datasets obtained from the Genome of the Netherlands (GoNL) [19] 50 and G1K [20] projects and in each case we tested for an elevated number of de novo 51 mutations in the offspring of parents carrying a putative mutator haplotype. 52

2/19
Results and Discussion 53 We hypothesized that mutations that increase the rate of germline mutation in human 54 populations may be detectable from genome-wide polymorphism data. Consider a 55 population with a genetic variant at a locus that increases the germline mutation rate 56 by a factor φ (in heterozygotes). If the variant arose g generations before the present, 57 we can estimate the number of additional derived alleles, per site, that are expected in 58 haplotypes containing the mutator allele over haplotypes with the wild-type allele as 59 gµ(φ − 1), where µ is the wild-type mutation rate. Let ρ be the rate of recombination 60 per generation, which, for simplicity, we assume is uniform across the genome. The 61 lengths of the genomic segments to the right and left of the mutator locus that have 62 been free from recombination for g generations are then exponentially distributed with 63 mean 1 gρ and therefore the segment around the mutator locus that has been free from 64 recombination since the mutator variant arose has an expected length of 2 gρ . The 65 expected number of additional mutations in a segment of this length is 2µ(φ−1) ρ . Note 66 that this is independent of g, the age of the mutator allele. Because the number of additional derived alleles is independent of the time since the 85 mutator allele arose, but the segment length is not, the power to detect a mutator allele 86 above the background is greater for alleles that arose a large number of generations 87 before the present. Given that the mutator allele is also required to be associated with a 88 large effect on the germline mutation rate, it is necessary to consider whether such an 89 allele could persist for a large number of generations despite the reduction in fitness 90 that will result from an increased germline mutation rate. Although the increased 91 burden of de novo mutations associated with the mutator allele is likely to reduce the 92 fitness of individuals carrying it, this reduction may be relatively small [23], and the per 93 generation germline mutation rate is capable of sustaining substantial variation in 94 animals, for example ranging widely across mammalian species [2]. Bear in mind also 95 that de novo mutations will remain linked with the mutator allele only for an average of 96 two generations before recombining away [2]. Indeed, a mutator allele, even at relatively 97 low frequency, may contribute a large number of novel mutations, exerting a 98 disproportionate influence on the evolution of the population as a whole while it persists. 99 The selective disadvantage associated with a (heterozygous) allele that increases the 100 germline mutation rate can be approximated as 2s d ∆U [2], where s d is the mean selective disadvantage of a heterozygous deleterious mutation, U is the genomic 102 deleterious mutation rate and ∆U is the change in U resulting from the mutator allele. 103 Thus given a value for U , a mutator allele that increases the germline mutation rate by 104 a factor φ will be associated with a selective coefficient (for heterozygotes) of The parameters s d and U are not straightforward to estimate. Lynch [2] provides 107 values for multicellular eukaryotes from the literature that range over orders of  Estimates of U in particular genomic regions have been made by comparing 120 between-species sequence divergence within these regions with divergence in putatively 121 neutral regions (such as fourfold degenerate coding sites, pseudogenes or inactive 122 transposable elements). Using this approach, Keightley [24] estimated a relatively high 123 value of 2.2 deleterious mutations per generation across the whole genome in humans.

124
There are caveats associated with this estimate (such as the assumption that genomic 125 mutation rates are the same across sequence categories and possible bias from alignment 126 in calculating genome-wide divergence), but it does seem likely that U is not 127 substantially less than one.

128
The mean selective disadvantage is more difficult to estimate, due in part to the  Consistent with this, Do et al. [27] have shown that the absence of differential removal 143 of deleterious mutations from the genomes of Africans and non-Africans implies that selection, can be calculated [28]. For example a mutation with initial frequency of 0.2 157 and a selective coefficient of -0.0002 (-0.0001 in heterozygotes) has an expected time to 158 extinction of 13,000 generations.

159
Proof of concept simulations 160 We used simulation to investigate the effect of a mutator allele on the numbers of An example of simulated data showing a subset of haplotypes with a peak in the number of derived alleles in a 10 Kb sliding window. A region of 100 Kb was simulated over 40,000 generations using a coalescent approach with recombination. In this example a mutator allele with φ = 5 was introduced 20,000 generations before the present and was assumed to be weakly deleterious, with a selective coefficient of -0.0002. The red and green lines show the maximum and trimmed mean number of derived alleles in the window. Individual sampled haplotypes are shown in grey.

Genome scan for candidate mutator loci in humans 178
To search for human loci that may have carried mutator alleles, we obtained data from 179 phase 3 of the G1K project [20]. The data consisted of inferred haplotypes for a total of 180 2504 individuals in 26 populations, along with putative ancestral alleles for each 181 variable site inferred from Ensembl Compara release 59 [29]. For each haplotype in a 182 population we counted the number of derived alleles in a sliding window of size 10 Kb 183 along the genome. In each window we then calculated the maximum number of derived 184 alleles in the window and the interquartile mean number of derived alleles across all 185 haplotypes in the population (i.e. excluding the upper and lower 25th percentiles).

186
As expected, there was a strong linear correlation between the maximum and the 187 interquartile mean (Fig 2). Our further expectation, supported by the above 188 simulations, is that loci at which a mutator allele has existed in a population for a large 189 number of generations can be identified from a characteristic pattern of variation, 190 comprising a number of haplotypes with an unusually large number of derived alleles 191 against a background of haplotypes with a typical number of derived alleles. 192

5/19
Diversifying selection is likely to increase both the interquartile mean and the maximum 193 number of derived alleles. By contrast, a low-frequency mutator allele should affect only 194 the maximum, and consequently we excluded from consideration windows with large 195 values of the interquartile mean (>75th percentile). The remaining windows were 196 arranged in order of a statistic M , defined as the residual distance to the regression line 197 (the latter shown in red in Fig 2), with windows showing the greatest M -value 198 considered as the best candidates for sites of ancient germline mutator alleles. We also 199 removed candidate loci in which a high proportion of SNPs (> 5%) were not in  Table 1.  Table 1. Note that each row of Table 1 can correspond to multiple points in the figure, due to multiple sliding windows overlapping the locus. this locus is much more frequent (occurring in 302 haplotypes). The opposite is the case 209 for the structural variant at 192.6 Mb on chromosome 1, which is much more frequent 210 (1871 haplotypes) than the highly derived haplotype at this locus (88 haplotypes). 211 Therefore, in neither case is it plausible that mapping errors created by these structural 212 variants are the cause of the observed signal. However, we cannot rule out that 213 unreported structural variants are the cause of some of the candidate mutator loci that 214 we identified.

215
DNA repair genes are enriched at the candidate loci 216 Genes in proximity to the candidate loci shown in Table 1 are significantly enriched for 217 just one cluster of overlapping biological processes, and encouragingly, these processes 218 relate to DNA damage and repair. Five of these genes are annotated with the UniProt 219 keywords DNA damage and DNA repair (p = 0.0004 and p = 0.0009, respectively), 220 namely GEN1, SMC6, TDP1, UBE2T and ZSWIM7. Among these are genes that are 221 annotated with the GO terms double-strand break repair via homologous recombination 222 (GO:0000724; p = 0.005) and DNA repair (GO:0006281; p = 0.047). However, because 223 genes are known to cluster on the genome by function, the above test of enrichment may 224 be biased (for example, one candidate locus is responsible for two of the DNA-repair 225 associated genes). To perform an unbiased test of enrichment of the candidate loci with 226 DNA repair genes, we selected 20 loci at random and counted the number of DNA 227 repair related genes within 100 Kb of each locus. Repeating this process 100,000 times, 228 this number matched or exceeded the observed number 625 times (p = 0.006). This 229 suggests that the enrichment of DNA repair genes is unlikely to be explained by 230 functional clustering, and supports the existence of a subset of true positives among the 231 list of candidate mutator alleles identified.

232
In addition to the known DNA repair genes, several other genes are plausible 233 candidates for affecting the rate of germline mutation. HDAC2 is a Class I histone 234 deacetylase that has been shown to be enriched at replication forks [31] and to have a 235 role in the DNA damage response, at least in the context of double strand break 236 repair [32]. Mutations in UBE2T have recently been reported to cause Fanconi 237 anemia [33], a condition characterised by genome instability and cancer susceptibility 238 that frequently results from defective DNA repair genes. BRSK2 is included among 296 239 genes in the PathCards DNA repair superpath [34] and recently MOB2 has also been 240 strongly implicated in the DNA damage response [35]. Materials and Methods for details). As with the real data, the simulated data showed a 245 strong linear relationship between the maximum (over haplotypes) derived allele count 246 and the interquartile mean (Fig. S2). Two very different demographic scenarios were 247 considered. In the first, a sample of size 2,500 (corresponding to the number of 248 individuals in phase three of G1K) was taken from a single panmictic population. In the 249 second, to explore the effects of population structure, the same number of individuals  Table 1 were found in the simulated data, far 259 more sliding windows exceeding the threshold for inclusion in Table 1 (residual ≥ 18) 260 were found in the real data than in the simulated data (there were 95 such 10 Kb sliding 261 windows in the real data, compared to 11 and 16 in the single and five population 262 simulations, respectively). We also investigated the effect of mutation clustering caused 263 by multinucleotide mutation events by adding these to the coalescent simulation (see 264 Materials and Methods). This had limited impact on the results (the number of windows 265 exceeding the threshold rose from 11 to 13 in the single population simulations). 266 Thus we conclude that the signals observed at loci listed in Table 1 Table 1 fits this description, but 282 the remaining peaks do not (Fig. S3). For all of the remaining peaks, there is 283 substantial diversity among the highly derived haplotypes, consistent with a greater rate 284 of divergence than for the less highly derived haplotypes. Multiple gene conversion 285 events would result in polyphyly of the diverged haplotypes, and this is not observed for 286 most of the peaks; however, accurate phylogenetic inference may be exceptionally 287 difficult in the gene conversion scenario.

288
Another possible mechanism might be an extreme multi-nucleotide mutation event, 289 more substantial than those included in our simulations, giving rise to multiple 290 mutations simultaneously. However, similarly to gene conversion, this would also create 291 two clades, both consisting of short branches, separated by a long branch that reflects 292 the multi nucleotide mutation. Thus for trees of this nature we cannot rule out either 293 gene conversion or extreme multi-nucleotide mutation. However we note that such trees 294 are also consistent with the effects of a mutator allele, since a relatively recent 295 coalescence of the highly derived haplotypes would also produce a long internal branch 296 separating clades consisting mostly of short branches. Indeed, this was frequently 297 observed in the proof-of-concept simulations (Fig. S4).

298
Introgression of haplotypes derived from hybridisation between modern humans and 299 Neanderthals as well as other ancient hominins has been reported [37][38][39][40]. However, this 300 introgression is not likely to result in a subset of haplotypes with an excess of derived 301 alleles. Alleles were designated as derived or ancestral by comparison to other primates 302 and the number of derived alleles on a given Neanderthal haplotype should be similar to 303 the number for a haplotype from modern humans. Moreover, almost all of the peaks in 304 Table 1 are found in Africans, but only non-Africans have Neanderthal ancestry. The 305 one exception is again the peak on chromosome 18. Indeed, for this peak we found that 306 8/19 the highly derived haplotypes are enriched for haplotypes of likely Neanderthal origin 307 (as determined by [41]). This peak is not close to any DNA repair genes and its removal 308 from the list of candidates would in fact slightly increase the statistical significance of 309 the association with DNA repair genes discussed above.

310
As a further potential alternative explanation, we also considered the case of a 311 genomic segmental duplication segregating within the population, which could give rise 312 to mapping errors in individuals carrying it, such that derived alleles called in the 313 candidate locus are due to incorrectly mapped sequence reads. However, a simple 314 duplication could cause at most an apparent doubling of divergence on affected 315 haplotypes in this way, and therefore would be unlikely to be included as a candidate 316 under the thresholds we have applied. A more complicated tandem duplication with 317 several copies might give rise to higher apparent divergence, but such sites are very 318 unlikely to have passed the filtering criteria used in the G1K calling protocol.

319
The peaks we observe in the derived allele count could potentially be caused by strong enough to bring multiple linked derived alleles to high frequencies had acted on 326 human populations we find no evidence for this hypothesis in the data, as, for most of 327 the peaks, the highly derived haplotypes are found in human populations from multiple 328 populations (Fig. S5).

329
Our proof-of-concept simulations suggest that most of the peaks in Table 1 are 330 consistent with the effects of mutator alleles with large effect size which were 331 maintained in the population for a large number of generations. The difference between 332 the maximum and inter-quartile mean number of derived alleles in Table 1 is below 30 333 for all but the first locus, and in the low twenties for loci towards the bottom of the 334 table. The proof-of-concept simulations also had large differences in the maximum and 335 interquartile mean when the signal was detectable (just over 20 for simulations 8 and 9 336 in Fig. S1). The top locus in the table, however, has a much larger difference between 337 the maximum and interquartile mean number of derived alleles. Although we could not 338 find an alternative explanation for this signal, it seems difficult to reconcile with the 339 effects of a mutator allele whose effect size would have enabled it to persist in the active and linked to a highly-derived haplotype, then it might be possible to find 345 supporting evidence in whole genome sequence data from trios. In principle, an active a 346 mutator allele would give rise to an association between the presence of the 347 highly-derived haplotype in the parents and an elevated number of de novo mutations in 348 the offspring. 349 We investigated this possibility for the candidate loci shown in Table 1 by obtaining 350 counts of de novo mutations in the offspring of complete trios and genome-wide parental 351 genotype data from the G1K (n = 59) and GoNL (n = 248) projects [19,20]. For each 352 of the loci shown in Table 1 we tested for a positive association between the number of 353 de novo mutations in the offspring and the presence in the corresponding parent of the 354 highly derived haplotypes at the putative mutator locus ( Table 2). P-values from two 355 tests are reported for each peak, one based on fitting a robust linear model and the 356 second based on permutation (referred to as p t and p perm in Table 2, respectively; see 357

9/19
Materials and Methods for details). The analysis was carried out separately for male 358 and female parents and for the two projects, to avoid biases resulting from differences in 359 study methodology confounded with differences in populations of origin. In each case 360 we required at least five parents with a highly derived haplotype to perform the test.

361
In the case of the G1K dataset 21 tests were performed (10 paternal and 11 362 maternal). Of these, two showed a nominally significant positive association between 363 the number of de novo mutations and the presence of the highly derived haplotype in 364 the parent in both statistical tests ( Table 2). The most significant of these was for the 365 peak just upstream of BRSK2 on chromosome 11 (illustrated in Fig 3). With the GoNL 366 dataset, 15 tests could be performed (seven paternal and eight maternal), but no 367 significant associations were observed ( Table 2). Considering that 36 tests in total were 368 carried out in total, the associations in Table 2 do not remain significant following 369 Bonferroni correction, and additional diverse trio datasets may be necessary to confirm 370 these associations definitively. Data forming the basis for the tests shown in Table 2 are 371 provided as S1 Table and S2 Table. 372 dataset. For the G1K dataset population of origin information was available for the 378 samples; however, there was no statistically significant relationship between population 379 of origin and the de novo mutation count (Fig. S6).

380
Interestingly, the candidate locus that shows the strongest evidence of an association 381 with the de novo mutation count in the G1K trios is just upstream of BRSK2 (Fig 3b), 382 a highly conserved serine threonine kinase which is preferentially expressed in brain and 383 testis and shows enhanced activity in response to DNA damage [42,43]. The observed 384 10/19 association was maternal rather than paternal; however, expression in testis is 385 consistent with a role in germline DNA replication. 386 At the same time, this haplotype is not associated with an elevated de novo 387 mutation count in the GoNL data (Table 2). A possible explanation for this is that 388 despite its much larger sample size, the GoNL cohort, which comprises individuals of 389 Dutch ancestry, is substantially less diverse than the G1K cohort. As noted above, 390 although the highly derived haplotype was observed in parents of trios in the GoNL 391 cohort, its presence does not necessarily imply the presence of the causative mutator 392 allele: a recombination event on the ancestral lineage leading to the highly derived 393 haplotype in the Dutch population may have separated the two. Indeed, in this way it 394 is possible for an extinct mutator allele to leave a fossil peak in the derived allele count 395 of a contemporary population, and it is plausible that many of the signals we have 396 detected are of this nature.

397
The candidate mutator locus on chromosome 14 is also noteworthy. This peak 398 overlaps precisely with an exon of the tyrosyl-DNA phosphodiesterase 1 (TDP1 ) gene 399 (Fig. S7). The product of TDP1 has a role in the repair of stalled topoisomerase I-DNA 400 complexes. This candidate is not supported by the available trio datasets, despite the 401 highly derived haplotypes being found in sufficient numbers in both cohorts, suggesting 402 that its occurrence on the TDP1 gene is either a striking coincidence or that it may be 403 a fossilised peak resulting from a mutator allele that is extinct or at least much rarer 404 than the highly derived haplotypes to which it has given rise.

406
We have shown that when a genomic locus contains a polymorphism in which one allele 407 increases the germline mutation rate, a characteristic signature may result, comprising a 408 subset of haplotypes that are locally more divergent than other haplotypes at that locus. 409 Whether this signature can be detected depends on the magnitude of the effect and on 410 the time since the mutator allele arose. The number of additional mutations on an 411 unrecombined segment of expected length carrying the mutator allele is independent of 412 the time since the mutator allele arose, but the likelihood of detecting a fixed number of 413 additional mutations over the background increases as the segment length decreases and 414 is thus higher for older mutations. 415 We searched for such signatures in data from Phase 3 of the 1000 Genomes Project, 416 and found candidate human genetic loci that may have once contained (and in some 417 cases may still contain) germline mutator alleles. Consistent with this expectation, the 418 set of genes located in the vicinity of these loci is enriched for genes involved in DNA 419 repair. A test for the continued presence of active mutator alleles in these haplotypes, 420 by looking for association with high numbers of de novo mutations passed to offspring 421 in two trio sequencing cohorts, found support for association in two cases, but no 422 associations with overall significance.

423
It may therefore be that mutator alleles are no longer present, or are now very rare, 424 in most of the candidate loci we identified. However the trio cohorts we examined were 425 small relative to what may be feasible in future studies, and in one case (the GoNL 426 data) represented a population which is comparatively homogeneous in terms of its 427 11/19 genetic ancestry. Further trio cohorts from more genetically diverse populations may be 428 necessary to test some of the candidates we identified. Such studies will also shed light 429 on the full extent of polymorphism in the rate of de novo mutation in human 430 populations.

431
It is difficult to conclusively exclude alternative explanations for the signals we have 432 detected. However we have considered several such explanations and have shown that 433 they are either far less likely to generate signals of this nature (in the case of neutral 434 demographic processes) or would generate loci with characteristics not seen in these 435 cases (for example in the case of a mis-mapped segmental duplication).

436
More generally, it is far from implausible a priori that there are genetic factors 437 affecting the germline mutation rate in humans, and that there may arise from time to 438 time mutator alleles with a moderately large effect size. We have also shown that such 439 alleles can plausibly persist for many generations and leave a detectable signal that 440 survives after they have disappeared. It may also be that different such alleles affect the 441 mutation process in different ways, and that as well as changing the overall germline 442 mutation rate they give rise to different mutational signatures. Indeed recent studies 443 have found variation in the mutational spectrum (expressed in terms of the relative 444 frequency of different triplet sequence contexts for de novo mutations) between human 445 populations and over time, including evidence for past activity of mutator alleles that 446 are now rare or extinct ( [18,54]).

447
On longer timescales, comparison of mutation rates estimated from inter-specific 448 divergence (e.g. [44]) with measurements of the mutation rate in contemporary 449 populations points to a slowdown in human and great ape evolution [55]. Some of this 450 may be due to changes in life history parameters such as generation time; however, 451 there may also have been changes in the underlying per-generation mutation rate, or 452 more specifically, changes in cellular mutation rates during development and 453 gametogenesis, perhaps driven by genetic variation at loci such as those detected here. 454 Indeed, it is worth considering that the higher mutation rate inferred for some 455 haplotypes at candidate loci may represent the ancestral rather than derived state, such 456 that the mutations in question resulted in anti-mutator rather than mutator alleles.

458
Proof of concept simulations 459 We used a coalescent approach to simulate the effects of a mutation that increases the 460 germline mutation rate. The mutator allele is introduced at an initial frequency p 0 at 461 time t 0 and we assume no mutation between the mutator allele and wild-type after time 462 t 0 . We use a constant effective population size of 10,000 individuals and incorporate 463 recombination, occurring at a constant rate across the simulated genomic region (see 464 below for simulations of null models with variable recombination rate and 465 demographics). The mutator allele is associated with a selective coefficient, s, and we 466 incorporate selection against the mutator allele using an approach described by Kaplan 467 et al. [46]. We first obtained the deterministic allele frequency trajectory in the forward 468 direction, with s = −0.0002. Because we require that chromosomes carrying the 469 mutator allele have an elevated mutation rate the simulations could not be performed 470 using existing implementations of the coalescent with recombination and selection 471 (e.g. [47]) and were instead implemented as a Perl script, which is available from the 472 authors, on request. To allow for the effects of the mutator allele on the rate of 473 mutation of the homologous chromosome in heterozygous individuals, at each generation 474 the mutation rate was elevated for chromosomes that did not carry the mutator allele 475 with a probability equal to the frequency of the mutator allele in that generation. populations. Simulations were carried out using the coalescent simulator msprime [48], 486 with a variable recombination rate matching that found in the human genome 487 (including recombination hotspots) [49], and a uniform genome-wide mutation rate.  [20]. Putative ancestral alleles (derived by the 1,000 Genomes 496 project from Ensembl Compara release 59 [29]) were obtained from the vcf files, 497 restricting to SNPs with high confidence ancestral alleles. Using a sliding window 498 (window size: 10 Kb; step size: 1 Kb) we calculated the interquartile mean (also called 499 the 25% trimmed mean) and the maximum number of derived alleles in each window.

500
There was a linear relationship between the maximum and the interquartile mean, 501 reflecting the fact that windows within which the interquartile mean number of derived 502 alleles was large also had a large value for the maximum number of derived alleles (over 503 all haplotypes). We fitted linear regression models to the maximum, as a function of the 504 interquartile mean and calculated the distance from the regression line for each window. 505 Windows with an elevated value of the interquartile mean (>75th percentile) were 506 excluded and the remaining windows were sorted by distance to the regression line. χ 2 507 tests of Hardy-Weinberg equilibrium (at alpha = 0.05) were carried out separately 508 within each population for each SNP within the candidate regions. Windows within 509 which a high proportion (>5%) of SNPs rejected Hardy-Weinberg equilibrium were 510 removed. This resulted in the removal of nine loci.

511
Trio validation 512 We obtained de novo mutation counts in offspring and parent variant call data for 513 human parent-offspring trios from the G1K (n = 59) and GoNL (n = 248) 514 projects [19,20]. For each candidate mutator locus we determined the number (0, 1 or 2) 515 of copies of the highly derived haplotype in each parent. For the phased G1k data we 516 plotted the number of derived alleles per haplotype and if this distribution included a 517 subset of highly derived haplotypes we selected a threshold that gave the best 518 separation between the highly derived and typically derived haplotypes (Fig. S8) and 519 used this to classify each haplotype. For unphased data (GoNL) we plotted the number 520 of derived alleles per individual. When the highly derived haplotypes occurred at low 521 frequency this distribution was typically bimodal (all individuals had 0 or 1 highly 522 derived haplotype), becoming trimodal when the highly derived haplotypes occurred at 523 higher frequency (corresponding to individuals with 0, 1 or 2 highly derived haplotypes). 524

13/19
In each case we set a threshold number of derived alleles that gave the best separation 525 of the peaks in the distribution of the derived allele count (Fig. S9). We then tested for 526 an association between the offspring de novo mutation count and the number of copies 527 of the highly derived haplotype by fitting robust linear regression models using the rlm 528 function from the MASS package [50] in R [51], with the default M-estimation method. 529 The G1K and GoNL trios were analysed separately to avoid the potential for 530 confounding between study methodologies and population of origin and results are 531 reported separately for each parent (to enable the detection of parent-of-origin specific 532 effects). Parental age at conception was available for the GoNL trios and included as a 533 covariate in the linear models. P-values for the null hypothesis that the regression 534 coefficient for the offspring de novo mutation count as a function of the number of 535 copies of the highly derived haplotype was not greater than zero were calculated in two 536 ways: from the right tail of the distribution of the regression coefficient t-statistic 537 (referred to as p t in Table 2) and by permutation. For the latter we performed 10,000 538 permutations of the de novo mutation counts and determined the proportion of times 539 the t-statistic from the robust regression for the permuted data was greater than or 540 equal to the value for the unpermuted data (reported as p perm in Table 2). Correction 541 for multiple testing was carried out using the Bonferroni method.

542
Enrichment of DNA repair genes close to candidate loci 543 We identified all genes within 100 Kb of each of the candidate mutator loci (excluding 544 pseudogenes) and tested for functional enrichment using DAVID [52,53], version 6.8 545 with default parameters (on 3/11/2016). Given that there were several genes found 546 within 100 Kb of many of the candidate mutator loci the phenomenon of functional 547 gene clustering means that this set of genes is not independent, creating a potential for 548 bias in the statistical tests of enrichment. We therefore also devised a test based on 549 randomisation of the genomic loci. For this test 20 loci (equal to the number of loci in 550 Table 1) were sampled at random from the regions of the data for which we had variant 551 call data and the number of DNA repair genes (annotated with GO term GO:0006281 552 in Ensembl 84) within 100 Kb of these random loci was determined. This was repeated 553 100,000 times and the numbers of DNA repair genes in the random data was compared 554 to the observed number of genes with this GO term within 100 Kb

Supporting Information
Fig. S1 Proof of concept simulations. Ten random simulations, equivalent to the simulation shown in Fig. 1.   Table 1. For each candidate mutator locus in Table 1 we show a phylogenetic tree (left), constructed from a 10 Kb window overlapping the peak. Due to the difficulty displaying all 5008 haplotypes from the 1000 Genomes Project we sampled 100 haplotypes at random and supplemented these with up to approximately 50 of the highly derived haplotypes, again at random. Bottom right panels illustrate for this reduced set of haplotypes the number of derived (red) and ancestral alleles (blue) in each haplotype across the 10Kb window. Each haplotype is represented as a row on the plot and the haplotypes with the largest number of derived alleles are grouped at the top of the plot. Tick marks indicate the positions of individual SNPs. The top right panels show the maximum (red) and interquartile mean (green) number of derived alleles, as well as the number of derived alleles on actual haplotypes (grey).

Fig. S4
Trees from proof-of-concept simulations. Trees from the subset of the ten proof-of-concept simulations that showed a peak in the maximal derived allele count.

18/19
Fig. S9 Derived allele count thresholds used to decide the number of highly derived haplotypes in each individual for the unphased GoNL data S1 Table. Trio data from 1000 Genomes. The number of de novo mutations and number of copies of the highly derived haplotype in trios from the 1000 Genomes Project.
S2 Table. Trio data from the Genome of the Netherlands Project. The number of de novo mutations and estimated number of copies of the highly derived haplotype in trios from the Genome of the Netherlands Project.