Fig 1.
A: To apply our method for inferring selection, we begin by sampling the posterior ARG of a set of recombining chromosomes. B: For each sample ARG, we extract local trees at the site of interest (blue). C: For each sample local tree, we run an HMM to calculate the likelihood of selection, marginalizing out the hidden allele frequency trajectory based on coalescence in the sample tree. We later use the recursions performed in this step to calculate the posterior allele frequency trajectory. D: An example of the estimated likelihood function for an allele under neutrality (top) and selection (bottom). E: An example of the inferred allele frequency trajectory compared to the ground truth trajectory under neutrality (top) and selection (bottom). Both (D) and (E) are inferred from data simulated under a European demographic model with n = 50 haplotypes, conditioning on the derived allele segregating at 75% in the present day with s = 0 and s = 0.003, respectively.
Table 1.
(Companion to Fig 2) The numbers of derived, ancestral, and mixed lineages at each time point.
Numbers with unshaded cells factor into the likelihood calculation, whereas numbers with shaded cells do not.
Fig 2.
(Companion to Table 1) Coalescence conditioned on the allele frequency trajectory (dashed line).
Blue lineages subtend the derived allele, whereas black lineages do not. Black lineages belong to the ancestral class while the derived allele has Xt > 0, and they belong to the mixed class while Xt = 0.
Fig 3.
(A-I) ROC curves illustrating performance of tests between selection and neutrality.
Rows correspond to simulations conditioned on the same present-day allele frequency (A-C: 25%; D-F: 50%; G-I: 75%), and columns correspond to simulations with the same value of s (A,D,G: 0.001; B,E,H: 0.003; C,F,I: 0.01). Simulations were performed under a model of constant effective population size (Ne = 104) using a locus of 100kb, n = 25 diploid individuals and μ = 2.5 × 10−8 mut/bp/gen, r = 1.25 × 10−8 recombinations/bp/gen.
Fig 4.
ROC curves illustrating performance of tests between selection and neutrality.
Rows correspond to simulations conditioned on the same present-day allele frequency (A-C: 25%; D-F: 50%; G-I: 75%), and columns correspond to simulations with the same value of s (A,D,G: 0.001; B,E,H: 0.003; C,F,I: 0.01). Simulations were performed under a model of European demography using a locus of 200kb, n = 25 diploid individuals and μ = 2.5 × 10−8 mut/bp/gen, r = 1.25 × 10−8 recombinations/bp/gen.
Fig 5.
Inference of selection coefficients of varying strength using importance sampling method based on ARGweaver local trees.
A: Constant population size. B: Tennessen CEU model. Marker color denotes present-day allele frequency (25/50/75% correspond to yellow/red/purple, respectively). Horizontal dashed lines denote the true value of the selection coefficient.
Fig 6.
Allele frequency trajectories inferred from true trees (A,B) and ARGweaver local trees (C,D).
Stepwise trajectories are inferred (vertical bars denote 25th-75th percentiles), dashed trajectories are the ground truth. Columns correspond to different initial allele frequencies (A,C: 25%, B,D: 75%) colors correspond to different selection coefficients. For each condition we show 25 randomly selected simulations and their corresponding inferences. All data are simulated under a model of constant effective population size (Ne = 104).
Fig 7.
(A) Trajectories inferred from true trees under both hard sweeps and recent selection on a standing variant (i.e. soft sweeps) when both s and time of selection onset are unknown. (B) The log-likelihood surface for joint inference of s and onset of selection, averaged over 100 simulations, taking the true tree as observed. (C) ROC curves (using importance sampling) illustrating performance of tests between selection from a standing variant where onset of selection occurs 100 generations ago. We condition on a present day frequency of 50%.
Fig 8.
(A) Probability of correctly identifying the selected site in a head-to-head test with a linked neutral site. Vertical bars represent 95% CIs estimated by plug-in bootstrap. (B) Estimates of the selection coefficient at the causal vs. linked neutral sites, given the true tree. Mean estimates are represented by blue hash marks.
Fig 9.
Comparison of inferred allele frequency trajectories for a sweep at rs4988235 (MCM6) in GBR under an ancient DNA (aDNA) based method vs. CLUES, which only uses contemporary modern data.
Black curve is the posterior median allele frequency, whereas gray areas are a 95% posterior interval. The red surface is the posterior of the frequency trajectory within Steppe ancestry conditioned on an ancient DNA time series, adapted from [55].
Fig 10.
Allele frequencies trajectories inferred for 11 pigmentation-associated SNPs in GBR (A-K, gene names and accession numbers inset).