Dating genomic variants and shared ancestry in population-scale sequencing data

The origin and fate of new mutations within species is the fundamental process underlying evolution. However, while much attention has been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a nonparametric approach for estimating the date of origin of genetic variants in large-scale sequencing data sets. The accuracy and robustness of the approach is demonstrated through simulation. Using data from two publicly available human genomic diversity resources, we estimated the age of more than 45 million single-nucleotide polymorphisms (SNPs) in the human genome and release the Atlas of Variant Age as a public online database. We characterize the relationship between variant age and frequency in different geographical regions and demonstrate the value of age information in interpreting variants of functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the ancestry shared between individual genomes and to quantify genealogical relationships at different points in the past, as well as to describe and explore the evolutionary history of modern human populations.


Estimation of variant age in publicly available data sets
We used GEVA to estimate the age of all variants on Chromosomes 1-22 (biallelic SNPs, except singletons and variants at alternate allele frequency >99%) in data from two human genome resources that are available in the public domain: • 1000 Genomes Project (TGP), Phase 3, final release [1]; • Simons Genetic Diversity Project (SGDP), fully public data set [2].
The two data sets are described in detail further below. In total, we estimated the age of 45,393,705 variants across all autosomes, for each clock model (mutation clock, recombination clock, and joint clock). This includes 43,232,520 variants dated in sample data from TGP, and 15,834,824 variants from SGDP. For the 13,673,639 variants identified in both TGP and SGDP, we additionally estimated the age after combining information from both data sources (described in Section 4). Overall, we analyzed 32,087,462,147 (>32 billion) haplotype pairs. A breakdown of variants dated per chromosome, as well as a summary of the haplotype pairs analyzed, is given in S1 Table. We make these results available as a public resource, referred to as the Atlas of Variant Age for the human genome. All data, including the full age estimation profiles for each clock model and the results of every pairwise analysis, are available online: https://human.genome.dating Throughout, we applied GEVA using the following specifications. We set N e =10,000 to internally scale time in units of 2N e , which adheres to the usually quoted value [3].
Though, we note that more recent estimates of N e indicate a much lower effective size and high variability among different human populations [4,5]. Results are reported in units of generations, after rescaling time given the specified value for N e . We assumed a constant rate of mutation, set to µ = 1.2 × 10 −8 per base per generation, following recent estimates of the human mutation rate [6]. We used variable recombination rates according to HapMap genetic maps as available per chromosome (Phase 2; GRCh37) [7].
The maximum number of concordant and discordant haplotype pairs sampled per variant (specified by parameters max C and max D ) differed for the analyses conducted using TGP and SGDP data (see below). Throughout, variant age was estimated after applying the heuristic pair rejection method to exclude outliers in the pairwise TMRCA distributions of concordant and discordant pairs selected per variant, and we report the quality score (QS) for the age estimated under each clock model.

Information about ancestral and derived allelic states
The GEVA method (in its current implementation) assumes that ancestral and derived allelic states have been correctly assigned to the reference and alternate allele, respectively, as seen in a given data set. We acquired information as available for the human genome from Ensembl (human assembly GRCh37; release 92; version 20180221), to determine the ancestral allele as predicted through multi-species alignments in the Ensembl Enredo-Pecan-Ortheus (EPO) pipeline. † We used this information to annotate available variant data, so as to (optionally) retain those variants in downstream analyses for which the ancestral allele was known and mapped to the reference allele. In TGP, we matched variant annotations based on chromosomal position, rsID, and consistent reference and alternate alleles. In SGDP, we did the same but omitted matching by rsID, as this information was absent in the available data set.
We report the mode, mean, and median of the composite posterior distribution as point estimates for variant age, as well as a 95% confidence interval. Correlation between estimators was high throughout (Pearson's r 2 > 0.99; Spearman's ρ > 0.99, based on all 43.2 million variants, in all comparisons between mode, mean, and median, under each clock model).
We computed the quality score (QS) after rejecting outlier haplotype pairs for every

Pathogenic variants
We used information from the Ensembl Variant Effect Predictor (VEP; release 75) [8], available through TGP, where functional consequences have been assigned to a subset of variants in the TGP sample (Phase 3; GRCh37). Variant effects have been predicted using PolyPhen-2 [9] and SIFT [10]. We selected all variants that had annotations from either PolyPhen-2 or SIFT proportions of allelic class found in each QS bin; here defined as the ancestral allele matching the reference allele (blue), the ancestral allele matching the alternate allele (red), or where the ancestral allele was either unknown or did not match the reference or alternate alleles (yellow) at the variants (SNPs) dated in each data source. Note that GEVA attempts to estimate the age of the alternate allele by default, assuming that the reference allele represents the ancestral state.
Data available from SGDP consists of 276 individuals (556 genomes) from 130 populations worldwide (fully public data set; hg19/GRCh37). We used the already phased panel (labelled "PS2") that had been phased using SHAPEIT2. Given the relatively small sample size of the SGDP panel (compared to TGP), we set max C = 100 and max D = 100 as the sampling limits for concordant and discordant pairs per variant. We estimated the age for 15.8 million variants across autosomes, which involved the analysis of 0.7 billion concordant and 1.5 billion discordant pairs. Overall computation time was 33,168 hours (3.8 CPU years; Intel Xeon E5-2650 v2, Ivy Bridge EP, 2.60GHz). Note that the processing time was on average faster compared to the analysis of TGP data, due to the smaller sample size and the lower sampling limits set per variant.
As with variants dated using TGP data, we found that the mode, mean, and median as point estimates of variant age were highly correlated (Pearson's r 2 > 0.99; Spearman's ρ > 0.99, based on all 15.8 million variants, in all comparisons between mode, mean, and median, under each clock model).
We additionally re-estimated the age of >0.36 million variants on Chromosome 20 using max C = 500 and max D = 500, to assess differences resulting from estimation with lower and higher sampling limits. We found that age estimates were highly consistent under each clock model (Spearman's ρ > 0.9), indicating that the lower sampling limits (max C = 100, max D = 100) were sufficiently high for the analysis of variants in SGDP; see figure below.
Consistency of variant age estimated with high and low sampling limits. Density scatterplot showing the relationship between allele age estimated with different sampling limits for concordant and discordant pairs; set to maxC = 500 and maxD = 500 (x-axis), and set to maxC = 100 and maxD = 100 (y-axis). Age was estimated for all variants on Chromosome 20 in SGDP (>0.36 million) under each clock model. Colors indicate the relative density scaled by the maximum per panel. The inserts (bottom) show the correlation between the two analyses; Spearman rank correlation coefficient (ρ) and the square of the Pearson correlation coefficient (r 2 ; calculated on log-scaled allele ages).

Combined age estimation of variants present in TGP and SGDP
The true age of a variant refers to the time of a singular mutation event in the past, which follows from the assumption (infinite sites model) that mutations occur only once per genomic locus in the history of the population. Although this assumption is readily violated in reality, in particular for very old mutations or if the mutation rate is high, we would nonetheless expect it to hold for the vast majority of variants observed in the (human) genome. Estimating the age of the same allele in different data sources, in which it has been identified in genomic sequence data of independent (unrelated) samples, should therefore yield consistent results.
We identified 13,673,639 variants present as biallelic SNPs (non-singletons and below 100% allele frequency) in both TGP and SGDP; matched by genomic position and reference and alternate alleles. The ages of alleles at corresponding sites, independently estimated from TGP and SGDP data, were highly consistent; see S5 Fig. We then, additionally, estimated the age of these variants by combining information from both sources, to implicitly increase the genealogical resolution, but without combining sequence data directly.
Recall that we estimate the age from the composite distribution of locally inferred TMRCA posteriors at concordant and discordant haplotype pairs; see S1 Text. The TMRCA posterior is modeled using the Gamma distribution, where parameters α, β are obtained from the data as defined by the clock model used (mutation clock, recombination clock, or joint clock). For each variant, we combined information by recovering TMRCA posteriors from the parameter values as obtained in the two data sources (without rejecting outliers). We estimated the "combined" age, and computed a quality score, after rejecting outlier pairs in the combined sets of concordant and discordant comparisons.
Again, as with variants dated using TGP or SGDP data alone, we found that the mode, mean, and median as point estimates of variant age were highly correlated (Pearson's r 2 > 0.99; Spearman's ρ > 0.99, based on all 13.7 million variants, in all comparisons between mode, mean, and median, under each clock model).