Demography and the Age of Rare Variants

Large whole-genome sequencing projects have provided access to much rare variation in human populations, which is highly informative about population structure and recent demography. Here, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how these ages can be related to historical relationships between populations. We investigate the distribution of the age of variants occurring exactly twice ( variants) in a worldwide sample sequenced by the 1000 Genomes Project, revealing enormous variation across populations. The median age of haplotypes carrying variants is 50 to 160 generations across populations within Europe or Asia, and 170 to 320 generations within Africa. Haplotypes shared between continents are much older with median ages for haplotypes shared between Europe and Asia ranging from 320 to 670 generations. The distribution of the ages of haplotypes is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the effect of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.


Introduction
The recent availability of large numbers of fully sequenced human genomes has allowed, for the first time, detailed investigation of rare genetic variants. These are highly differentiated between populations 1, 2 , may make an important contribution to genetic susceptibility to disease [3][4][5][6][7] , and 1 mutations and assuming that the f 2 variant is derived, those individuals must share an f 2 haplotype at that position. We then scan left and right along the genome, until we reach a point where the two individuals have inconsistent homozygote genotypes (0 and 2, Fig. 1a).
Using both the genetic and physical lengths of the region, and the number of singletons, we compute an approximate likelihood for the age of the haplotype (Fig. 1b). We use the data to estimate error terms to take into account the fact that the algorithm described above does not find the shared haplotypes precisely. Then, for each haplotype, we find the maximum likelihood estimate (MLE) of the age of each haplotype. We investigate the distribution of these MLEs for different classes of f 2 variants, for example those shared within or between specific populations.

Simulation results
To test our approach, we ran whole genome simulations for a sample of 100 diploid individuals with MaCS 16 , using the combined recombination maps from the HapMap 2 project 17 , and an effective mutation rate of 1.2 × 10 −8 per-base per-generation, assuming a constant effective population size (N e ) of 14,000. We investigated both our power to detect the f 2 haplotypes and how accurately we could estimate the distribution of f 2 ages (Fig. 2). We detected around 26% of all f 2 haplotypes. Unsurprisingly we have more power to detect very long haplotypes, but we detected many small haplotypes as well: 19% of our total had true genetic length less than than 0.1cM. Having imperfect power to detect f 2 variants does not have a large effect on our power to detect f 2 haplotypes since most haplotypes carry more than one f 2 variant.
We have higher power for more recent haplotypes because they are longer, but this effect is cancelled to some extent for older haplotypes because the branches above them tend to be longer and therefore more likely to carry mutations. We can estimate the overall distribution of ages fairly well though there is high uncertainly in the age of any individual haplotype (Fig. 2a).
However, we can compute well-calibrated confidence intervals ( Supplementary Fig. 2). Most of the information about the ages comes from the genetic length, and the main advantage of the singleton information is for very old haplotypes which otherwise have their age overestimated ( Supplementary Fig. 3).
In addition, we ran simulations to check that the model was robust to more complicated demographic scenarios including splits, bottlenecks and expansions, as well as mis-specification of N e (Supplementary Fig. 4). We also investigated the effect that these scenarios had on the distribution of the ages of the f 2 haplotypes, demonstrating that we could detect the signatures of demographic events. For example, population bottlenecks lead to a high density of f 2 haplotypes during the bottleneck, and following a population split haplotypes shared between populations have median age roughly equal to the split time ( Supplementary Fig. 5).

1,000 Genomes data
We applied our estimator to the phase 1 data release of the 1,000 Genomes Project, which consists of whole-genome variant calls for 1,092 individuals drawn from 14 populations 15 . Restricting our analysis to the autosomes, we extracted f 1 and f 2 variants from the full callset, generated by combining sequence and array data, and then detected haplotype lengths using only the array data, to minimise the effect of genotyping errors. From 4,066,530 f 2 variants we detected 1,893,391 f 2 haplotypes, with median genetic and physical lengths of 0.7cM and 600kb respectively. The median number of singletons per haplotype was 3. Of the 1.9 million f 2 haplotypes, 0.7 million were shared within populations and 1.5 million were shared within continents.
Sharing of f 2 variants largely reflects expected patterns of relatedness on a population level, and also reveals substructure in some populations, notably GBR ( Supplementary Fig. 6).
We used the combined recombination rate map from HapMap 2 to determine genetic lengths, and assumed a mutation rate of 0.4 × 10 −8 per-base per-generation (reflecting a true mutation rate of 1.2 × 10 −8 and a power of 1 3 to detect singletons 15, 18 ) and N e = 185, 000 (estimated from the number of singletons in the dataset, Methods). We then computed MLEs for the ages of all the f 2 haplotypes shared between every pair of populations (Fig. 3, Supplementary Fig. 7, For haplotypes shared within populations (Fig. 3a), the MLEs of haplotypes within most 4 European and Asian populations are clustered around 100 generations ago. For example, the median age of GBR-GBR haplotypes is 90 generations. PUR and, to a lesser extent, CLM have many very recent haplotypes (peaking around 11 generations ago), consistent with a historical bottleneck in these populations 300-350 years ago. FIN haplotypes peak around 14 generations (400-450 years) ago. African populations have many recent haplotypes but also a much longer tail than the other populations, with ancestry apparently extending back for thousands of generations. For example the median age of LWK-LWK haplotypes is 320 generations, but the 95% quantile is 8,500 generations.
Similarly, between-population sharing is largely consistent with the historical relationships among populations ( Fig. 3b-d). Within continents, sharing within Asia or Europe has a median of 50-160 generations, depending on the populations, and sharing within Africa 170-340 generations. Sharing between continents is much older, with median Asian-European sharing 320-670 generations old, and Asian-African sharing rather older, with a median around 2,300 generations ago for LWK and 1,700 generations ago for YRI. The age of European-African sharing varies between populations, from 1,000 to 2,000 generations ago, but is more recent than Asian-African sharing, perhaps suggesting greater subsequent migration between these continents.
Admixed populations have age distributions which are combinations of the distributions of the admixing populations ( Fig. 3e-f). Even in these populations we can see signs of more subtle history. For example, GBR-CLM haplotypes have an age distribution which looks more like GBR-TSI or GBR-IBS than GBR-CEU, presumably representing the fact that the major contribution to European admixture in the Americas is from southern Europe ( Supplementary   Fig. 8).
We also looked at the distribution of the ages of f 2 variants broken down by functional annotation (Fig. 4, Methods). We found that for variants shared within a single population, loss-of-function (LOF) variants are younger than coding variants, which are younger than functional noncoding variants, and all annotated variants are younger than unannotated variants.
The median ages of these variants are 58, 83, 112 and 125 generations for LOF, coding, func-5 tional noncoding and unannotated variants respectively. This is presumably because purifying selection against damaging mutations means that functional variants are less likely to become old (positive selection for beneficial mutations would have the same effect). This effect has previously been both predicted and observed 19,20 . However, it is not true for variants shared between different populations and, in fact, the effect is partially reversed (median ages are 176, 205, 186 and 195 generations for LOF, coding, functional noncoding and unannotated variants).
A likely explanation is that functional variants which survive long enough to be shared between populations are selectively neutral or recessive and thus unaffected by selection at low frequency.
This suggests that studies looking for disease-causing rare variants should concentrate on variants private to a single population, since variants shared across populations are unlikely to have large phenotypic effects.

Discussion
Our approach provides a simple method for estimating the age of f 2 haplotypes, allowing us to observe the effects of recent demographic events. We have low power for very old events for two reasons. First, our power to detect f 2 haplotypes is age-dependent, although we could correct for this empirically by estimating power from simulations. Second, f 2 variants themselves are not distributed uniformly in time so with this approach we only have power around one timescale. We could increase our resolution back in time by extending this method to more common variants.
One point to note is that we have estimated ages for the haplotypes on which the variants lie, rather than the ages of the variants themselves. This is largely because, for the purposes of demographic inference, the age of the haplotype, or equivalently, the age of the most recent ancestor to carry the variant, is what is really important. If we wanted to estimate the exact time at which the mutation occurred it would be relatively easy to do so, using the observation that the mutations occur uniformly on the branch directly above the MRCA.
An important question is how accurately we can estimate the age of demographic events.
Consider the split between European and East Asian populations. Model based estimates of this split time have ranged from 14 to 40 thousand years ago (kya) [21][22][23] , although these are likely to be too low by a factor of up to 2, because they assumed a mutation rate, µ, of 2−2.5×10 −8 per-base per-generation, now thought to be an overestimate 18  An alternative explanation for the first observation might be a systematic bias in our estimates. Uncertainty in either the mutation or recombination rate used might create this bias.
Our approach is relatively insensitive to the mutation rate, except for very old haplotypes, so it is unlikely that this could have a large effect. Moreover if the true mutation rate were slightly higher than 1.25 × 10 −8 then mutational clock based methods would give lower estimates and our results would be more consistent with them. We also ignore variation in mutation rates across the genome but, for the same reason, this is more likely to affect methods which rely entirely on the mutational clock.
To test our robustness to errors in the recombination rate, we ran simulations with a differ-7 ent genetic map to the HapMap map that we used to determine genetic length. We tested a population-based African American map 26 , a map derived from an Icelandic pedigree 27 and a chimpanzee map from a small population 28 , but none of these made a substantial difference to the results and we conclude that the length of the haplotypes we investigate is sufficiently large that they are robust to the uncertainty in the recombination map ( Supplementary Fig. 10).
Another source of systematic error is homoplasy (where the same mutation occurs independently on two different lineages). While this rate is expected to be low, it may be locally high in some parts of the genome, for example in CpG islands which have an order of magnitude higher mutation rate than the genomic background. If such false positives do occur, they would appear as very short haplotypes that we would infer to be very old, so they cannot explain our systematically lower ages. However it is likely that some of the very old haplotypes we see are, in fact, due to repeat mutations. Finally, we cannot exclude the possibility that unmodelled features of real data might have lead to bias in our results.
However, systematic biases in our estimates cannot explain the second observation that the age distributions vary greatly between different pairs of populations. This strongly suggests that there is variation in the extent of gene flow. For example, Asian-FIN sharing seems to be more recent than other Asian-European sharing, suggesting relatively recent contact between East Asian and Finnish populations, compared to other European populations. It seems likely that the worldwide demographic history is sufficiently complicated that trying to estimate a single Asian-European (or African-Non African) split time is futile, and that a complex model of many splits and migrations is required to explain the relationship between different populations.
We deliberately kept our likelihood function as simple as possible to allow fast and robust analysis, but there are certainly improvements to be made, particularly by better approximating the genetic length distributions. Finally, we note three directions in which this work could be extended. First, there is a clear path to extending this work to cover more common variants (and therefore more ancient events). To estimate the age of f 3 and higher variants, we could simply build small local trees and estimate their heights by using the lengths and mutation 8 counts of each of the branches. Second, the impending availability of samples with sizes on the order of tens or hundreds of thousands will mean that this approach can be used to date variants and observe events much more recent than we observed in this study. Finally, the statistics we developed here could be incorporated into more highly parameterised models, and combined with other sources of information like sequential coalescent methods to provide a more complete picture of genetic history.

Definitions
Suppose we have a sample of size N of genotypes from a single genetic region. Define an f 2 variant to be one which occurs exactly twice in the sample in different individuals. That is, either two individuals have genotype 1 and all the others have genotype 0, or two individuals have genotype 1 and the others have genotype 2. We assume that the minor allele is the derived allele. Under the neutral coalescent, for a sample of 2N chromosomes, the minor allele will be the derived allele with probability 2N −1 2N +1 ≈ 1 for large N so this is a reasonable assumption for large samples. Throughout we assume that the effective population size N e is known and constant.
Define an f 2 haplotype shared between chromosomes a and b to be a region satisfying the following two conditions: 1) The time to the most recent common ancestor (TMRCA) of a and b does not change over the region. 2) At one or more sites in the region, a and b coalesce with each other before either of them coalesce with any other chromosome. In other words, they are unique genealogical nearest neighbours ( Supplementary Fig. 1). We call the coalescent time of a and b the age of the haploype. Additionally, we say that individuals i and j (i = j) share an f 2 haplotype if a is one of i's two chromosomes and b is one of j's two chromosomes.
The problem we solve is to find the f 2 haplotypes and then determine their ages. Since each f 2 variant must lie in an f 2 haplotype, the variants provide a simple way of detecting the 9 haplotypes. We use the algorithm described in the main text to find regions which are strictly larger than the f 2 haplotypes. Then the next problem is to determine the likelihood of the age. We describe our approximate likelihood below, but first, as an example, we describe exact inference in the absence of confounding factors.

Exact case
Suppose we knew the exact genetic and physical lengths of the f 2 haplotype and the number of singletons. Call these exact quantities L * g , L * p and S * . Let the age of this haplotype be t generations, or τ in coalescent time (τ = t 2Ne ). Then, for a randomly chosen f 2 haplotype (but not a haplotype at a randomly chosen position, discussed in the next section), L * g has an exponential distribution with parameter 4N e τ and S * has a Poisson distribution with parameter θL * p τ where θ = 4N e µ and µ is the per-base per-generation mutation rate (approximately 1.2 × 10 −8 in humans 18 ). Therefore (ignoring terms that do not depend on τ ), the log-likelihood of τ given L * g , L * p and S * is τ ; l * g , l * p , s * = (1 + s * ) log (τ ) − 4N e τ l * g − θl * p τ and the maximum likelihood estimator of t is thereforê Note that this does not depend on N e , implying that in the full case, we will depend on N e only through the error terms, and therefore that mis-specifying N e will not have a large impact on the results.

Approximate likelihood for genetic length
There are two corrections to the likelihood for genetic length. The first relates to the ascertainment process of the haplotypes, and the second to the overestimate in the length due to the way we detect the endpoints.
The ascertainment problem is as follows. Suppose we pick a haplotype at random, then its length is exponentially distributed (i.e. gamma with shape parameter 1). However, if we pick a point on the sequence at random then the distribution of the length of the haplotype in which it falls is gamma distributed with shape parameter 2. This is an example of the "inspection paradox" and it is because in the second case, we are sampling haplotypes effectively weighted by their length. In our case, we detect haplotypes if they contain one or more f 2 variants. Therefore the probability that we find a haplotype is increasing with its physical length (because longer haplotypes are more likely to carry f 2 variants), but sub-linearly. The probability is also increasing with genetic length, but in a complex way that depends on the variation of recombination and mutation rate along the genome, and on the age of the haplotype (older haplotypes are likely to have longer branches above them, and therefore to have more f 2 mutations). Rather than trying to take all of these effects into account, we made the simplifying assumption that we could model the genetic length L * g as a gamma distribution with shape parameter k where 1 < k < 2 and rate 4N e τ . Simulations suggested that k around 1.5 was optimal ( Supplementary Fig. 10), and we used this value throughout.
The second correction involves the overestimate of genetic length. We tried to detect the ends of the haplotype by looking for inconsistent homozygote genotypes, but of course in practice, after the end of the f 2 haplotype, there will be some distance before reaching such a site. This (genetic) distance ∆ g is the amount by which we overestimate the length of the haplotype. We estimate the distribution of ∆ g for a given sample by sampling pairs of genotype vectors, then sampling sites at random and computing the sum of genetic distance to the first inconsistent homozygote site on either side. We then fit a gamma distribution with (shape, rate) parameters (k e , λ e ) to this distribution. The likelihood of τ is given by the convolution density of L * g and ∆ g , where f γ (x; (κ, λ)) = 1 Γ(κ) λ κ x κ−1 e −κx is the density of a gamma distribution with (shape, rate) parameters (κ, λ). This integral, and therefore the loglikelihood (τ ; l g ) = log [L(τ ; l g )] can be expressed in terms of the confluent hypergeometric function 1 F 1 (ignoring terms that do not depend on τ ), Where, recall, we assume k = 1.5. Finally, note that the rate at which recombination events occur on the branch connecting the two shared haplotypes is 4N e τ . We assume that the first such event marks the end of the haplotype. However, there is a non-zero probability that a recombination event occurring on this branch does not change the TMRCA of a and b. Simulations suggest that for large numbers of chromosomes, this probability is extremely small ( Supplementary Fig. 11) and so we assume it is 0. In practice, for small samples, this might be a non-negligible effect.

Approximate likelihood for singleton count
Recall that the physical length of the shared haplotype is L p bases. We assume that we can find this exactly. Then assuming a constant mutation rate µ per base per generation, the sum of the number of singletons on the shared haplotypes, S * has a Poisson distribution with parameter Now consider the distribution of singletons on the unshared haplotypes. We make the following two assumptions: 1) There is no recombination on the unshared haplotypes over the region.
2) The distribution of the time to first coalescence of the unshared haplotypes is exponential with parameter N (Recall that N is the number of sampled individuals). In fact the true distribution is a mixture of exponentials but the exponential distribution at least matches the correct mean and variance, 1 N and 1 N 2 respectively 29 . Conditional on the time to first coalescence, τ 1 say, the number of mutations on each of the unshared haplotypes is Poisson with parameter θL p τ 1 and so, using the assumptions above, the unconditional distribution is geometric (on 0, 1 . . . ) with parameter 1 and therefore the distribution of the number of mutations on both unshared haplotypes, ∆ S , is the sum of two geometric distributions which is negative binomial with parameters 2, θLp θLp+2N . Therefore the density of the total number of singletons, S is the convolution of these two densities

Approximate full likelihood
We can now write the approximate log-likelihood for τ as the sum of Equation 2 and the log of Equation 3, assuming that the recombinational process is independent of the mutational process, (τ ; l g , l p , s) = log [L(τ ; l g )] + log [L(τ ; l p , s)] .
We maximise it numerically with respect to τ in order to find the maximum likelihood estimate (MLE).
Estimating N e We counted S = 7, 970, 795 autosomal f 1 variants in the 1000 Genomes dataset. We assume an autosomal genome length L of 2.7Gb, a per-base per-generation mutation rate µ of 1.2 × 10 −8 and power q S of 1 3 to detect f 1 variants. Under a neutral coalescent model, we expect the number of observed f 1 variants to be given by 4N e µLq S . Therefore we estimate that N e = 13 7,970,795 4×0.4×10 −8 ×2.7×10 9 ≈ 185, 000.
If we do the same calculation using the number of f 2 variants (4,066,530) (expected to be half as numerous as f 1 variants under a neutral coalescent), assuming the power to detect them is 2 3 , then we estimate N e ≈ 24, 000, indicating rapid recent growth. We used the value N e = 185, 000 for our analysis since this represents roughly the population size over the timescales over which the f 1 variants arose, which is the same timescale over which the recombinations and mutations that we observe occurred. However, simulations suggest that our results our results are robust to the assumptions we make about N e (Supplementary Fig.   4d).
Detailed explanations of the annotations can be found there, but briefly the classifications are as follows: • Loss-of-function: Includes premature stop codons, and essential splice site disruptions.
• Coding: Variants in coding regions.
• Functional noncoding: Including variants in noncoding RNAs, promoters, enhancers and transcription factor binding sites.
• Unannotated: Any variant not included in any of the above categories.
We included haplotypes in more than one of these categories if they contain multiple variants.

Code
All the code we used in this paper to run simulations and analyse the 1000 Genomes data is available from www.github.com/mathii/f2. For each f 2 variant in the sample (green), we scan left and right until we find inconsistent homozygote genotypes (red), record the physical and genetic length of this region (blue), and the number of singletons (purple). b: Model for haplotype age t. Consider the 4 chromosomes (grey) of the two individuals sharing an f 2 haplotype (blue). We model the total genetic length of the inferred haplotype, L g , as the sum of the true genetic length L * g and an error ∆ g . Similarly, we model the number of singletons S as the sum of the number on the shared chromosome (S * ) and the number on the unshared chromosomes, ∆ S . We ignore the fact that we overestimate L p and therefore that some of the singletons might lie in the unshared part of the chromosome.

Supplementary information
Supplementary Figure 1.
Example of an f 2 haplotype. Recombination events are shown as a series of marginal trees, as we move left to right along a sequence. In the blue region, a and b share an f 2 haplotype. At sites 2 and 3, a and b are unique nearest neighbours. A mutation at site 2 (green), will be detected as an f 2 variant. At site 4, they are no longer unique nearest neighbours, but the TMRCA is unchanged. At site 5, the TMRCA has changed and the haplotype breaks. In the other direction, at site 1 the haplotype breaks because the TMRCA of a and b changes, even though they are still unique nearest neighbours. We count this as breaking the haplotype, even though we cannot detect this event. Coverage of approximate confidence intervals. We performed simulations as described in Figure 1 in the main text, but only on chromosome 20. We computed approximate α-confidence intervals using the χ 2 approximation to the likelihood. This figure shows the proportion of true haplotype ages that lie inside their approximate confidence intervals.  where N e is actually 140,000 but we ran inference assuming it to be 14,000.   Figure 3 in the main text. Each of these subfigures shows the distribution of ages of haplotypes shared between one population and all the others.  Note that in b, the peak of the f 2 age density is shifted to the left relative to a, indicating that many of the f 2 variants shared between populations derive from post-split migration rather than predating the split.   Figure 10. The effect of simulating haplotypes with a different recombination map to the one used to determine genetic length. The left column shows true versus estimated ages for detected haplotypes, and a qq plot of the MLEs, as in Figure 1a in the main text. The right column shows the coverage of the asymptotic confidence intervals as in Supplementary Figure S3. Each row shows the results of simulations using a different map (references in main text), but in every case the HapMap combined map was used to determine the genetic length of the detected haplotypes: a,b: Simulated using the HapMap map. c,d: Simulated using a map derived from African Americans. e,f: Simulated using a map derived from an Icelandic pedigree. g,h: Simulated using a map derived from Chimpanzees, rescaled to have the same total length as the human map. In each case, we simulated Chromosome 20 for 100 individuals using a mutation rate of 1.2 × 10 −8 per-base per-generation and N e = 14, 000. Supplementary Figure 11. Effect on the density estimate of varying k, the shape parameter of the gamma distribution used to model the genetic length of the haplotypes. This shows qq plots generated from simulations as in Figure 2 in the main text, but for chromosome 20 only. Probability that a recombination on the branch between two haplotypes does not change their TMRCA. For varying sample sizes n, we simulated sequences with recombination (Methods). We show the probability that the first recombination on the branch between two samples does not change the TMRCA (τ ) of those two samples, as a function of τ . Solid circles show 5% and 95% quantiles. The dashed lines show the theoretical lower bounds on this probability; 1 n 1 − 1 nt 1 − e −nt , exact for n = 2. The numbers in brackets in the legend show the probability that a recombination does not change the TMRCA, averaged over all events. Note that as n increases, for fixed τ , the probability of not changing the TMRCA decreases, but in addition the distribution of τ becomes smaller which also decreases the probability of not changing the TMRCA.