An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data

Most current methods for detecting natural selection from DNA sequence data are limited in that they are either based on summary statistics or a composite likelihood, and as a consequence, do not make full use of the information available in DNA sequence data. We here present a new importance sampling approach for approximating the full likelihood function for the selection coefficient. Our method CLUES treats the ancestral recombination graph (ARG) as a latent variable that is integrated out using previously published Markov Chain Monte Carlo (MCMC) methods. The method can be used for detecting selection, estimating selection coefficients, testing models of changes in the strength of selection, estimating the time of the start of a selective sweep, and for inferring the allele frequency trajectory of a selected or neutral allele. We perform extensive simulations to evaluate the method and show that it uniformly improves power to detect selection compared to current popular methods such as nSL and SDS, and can provide reliable inferences of allele frequency trajectories under many conditions. We also explore the potential of our method to detect extremely recent changes in the strength of selection. We use the method to infer the past allele frequency trajectory for a lactase persistence SNP (MCM6) in Europeans. We also infer the trajectory of a SNP (EDAR) in Han Chinese, finding evidence that this allele’s age is much older than previously claimed. We also study a set of 11 pigmentation-associated variants. Several genes show evidence of strong selection particularly within the last 5,000 years, including ASIP, KITLG, and TYR. However, selection on OCA2/HERC2 seems to be much older and, in contrast to previous claims, we find no evidence of selection on TYRP1.


Calculating allele frequency transition probabilities
Our likelihood calculations require allele frequency transition distributions for different selection coefficients, population sizes, and spans of time. Rather than employ the more common approach of numerically calculating allele frequency transition distributions using the Wright-Fisher diffusion process with drift and selection (e.g., [1,2]), we follow [3] and precompute allele frequency transition distributions on a grid of time spans (i.e., generations) and scaled selection coefficients (i.e., α = 2N s) using the Wright-Fisher model of reproduction in a finite population experiencing genetic drift and natural selection (see [1]). Specifically, for each value of α, we use simple matrix multiplication to produce allele frequency transition matrices for discrete frequencies in a haploid population of size N hap = 2000 at a number of generations spanning from g = 1 to g = g max (corresponding to Let X k be the allele frequency in the kth epoch. Then, conditional on X k = x k , Y k+1 := N hap X k+1 follows a binomal distribution Bin N hap , p ‡ (x k ) , where and u and v are the mutation rates from derived to the ancestral type and vise versa, respectively. We note that u and v are also rescaled similarly to s in order to approximate mutation in a population of smaller size. Thus, the transition probability from i → j is simply the probability The spacing of time points for these transition probabilities is chosen a priori; in practice, we use 3/15 linear spacing for recent history and/or periods of population growth. We bin allele frequencies into d discrete frequency categories unevenly distributed between 0 and 1 such that extreme frequency bins outnumber intermediate frequency bins. To calculate allele frequency transition distributions for time spans and selection coefficients not contained in the grid of pre-computed values, we linearly interpolate between the nearest precomputed values. See [3] for details.
We also note that if the time of the onset of selection, t s , is to be inferred, then it is necessary to let s depend on the epoch i; specifically, whether the allele is under selection vs. neutral during said epoch. Let s i denote the value of the selection coefficient during epoch i, and s = (s 1 , . . . , s K ).
Additionally, we condition the allele frequency process on the present-day frequency X 0 by using the following reweighting: 4/15

Forward and backward probabilities
Here we derive recursions for the forward and backward probabilities f i (x i ) and b i (x i ), respectively.
These quantities are equivalent to P(C 1:i | X i , N i−1 ) and P(C i+1: . We calculate this quantity recursively moving from i = 1 → i: and we can apply this recursion to calculate the likelihood function of s given G as The above is commonly known as the backward algorithm when applied to HMMs. In our model, the backward algorithm's recursion proceeds backwards through time. Alternatively, using the forward algorithm, with its recursion proceeding forwards in time: and we can apply this recursion to calculate the likelihood function of s given G as First, let us establish that P(X i | G k , s) = P(X i | C k , s); i.e., that the topology of the tree, conditioned on the allelic labeling of its leaves, does not affect the posterior probability of X i . We will suppress s for easy of notation.
Because the topology is independent of the allele frequency if we condition on the allelic labeling, where we use topo to denote the topology of the tree, conditioned on its allelic labeling.
Next, we derive the importance sampling estimator of the allele frequency marginal posterior: where κ is L(s)/L(s = 0) −1 , for which we have already established an importance sampling estimator (main text). Thus, our importance sampling estimate of the posterior marginal given s iŝ 7/15

Bayesian estimates of the selection coefficient
Allowing a prior distribution on s, π(s), the posterior of the selection coefficient π(s | D) follows Then the estimate of the posterior marginal is given bŷ which can be approximated by a sum over d discretized values of s, whereπ represents a probability mass function over s.

Simulations of trajectories, local trees, and haplotypes
To simulate data, we used a slight modification of the standard discoal package [4], available at https://github.com/kern-lab/discoal. In the standard discoal , there is no option to output the allele frequency trajectory, and there is also no option to simultaneously output the sample's local trees and haplotypes. This is important in order to compare inference vs. ground truth for the same replicate. Our modified version prints the trajectory to stdout, as well as the local trees and the haplotypes, and is available on the CLUES Github page. Nonetheless, the standard discoal documentation applies completely to our modified version, and we will leave the reader to learn the exact meaning of the arguments and options from documentation available through that repository.
To simulate data under the constant effective population size model, we ran

Reformatting discoal output
It is necessary to parse the output of discoal to not only prepare the input files for ARGweaver (and CLUES , which just uses ARGweaver -formatted data), but also useful to separate trajectories, local trees, and haplotypes into separate data files. We wrote a Python script to run this process, parseDiscoalOutput.py, which is available on our Github page. The command to run this script is 10/15 $ python parseDiscoalOutput.py example.discoal <length_of_sequence> <num_sites> <num_haps > <out> where you set the arguments to be the length of the sequence in base-pairs, the number of sites in the discoal simulation (in the two examples above, this would be 10 5 ), the total number of haplotypes sampled (n = 51), and a basename for output files. This script will generate 3 files, with extensions .traj, .trees, .sites, that hold the trajectory, the true local tree at the site of interest, and the haplotypes reformatted in ARGweaver format, respectively. These files will be generated and named by the index of each replicate simulated in the file example.discoal. Note that currently this script is hardcoded to assume the SNP of interest is located at the center of the locus.

Performing ARG-sampling using ARGweaver
We use the arg-sample function in the ARGweaver package, available at https://github.com/mjhubisz/ argweaver, to sample the posterior ARG [5]. This function requires one major input: the .sites we generated in the previous step. However, it is also necessary to provide the proper demographic model, mutation rate, and recombination rate. Furthermore, you should specify the desired length The files that are specified using --times-file, --age-file, and --popsize-file correspond to specifying the time discretization (in generations), the age of the ancient haplotype used to polarize 11/15 alleles (in generations), and the population size trajectory. We supply all of the corresponding files on the CLUES Github page.
We also want to point out several tuning parameters: --resample-window and --resample-window -iters adjust the size of the resampling window and the number of resamples to perform on a particular window. Adjusting these parameters can affect the behavior of the MCMC routine by changing how aggressively changes are proposed to the ARG. Increasing the resample window will decrease the acceptance probability of a given proposal, but increasing the number of iterations will increase that probability that any of these proposals will be accepted. These parameters should be adjusted to yield about a 30-70% acceptance rate (Melissa Hubisz, personal communication).
This procedure will output a series of .smc.gz files.
log -E > argweaver.example.trees This saves a list of Newick trees extracted from the site 50000 to argweaver.example.trees.

Preliminaries for CLUES
CLUES depends on a probabilistic model for allele frequency changes. Thus, it is necessary to either download our pre-computed transition probabilities for either the constant N = 10 4 or European demography models (formatted in HDF5 using the h5py package [6]). We provide an example file example.f_75.hdf5, precomputed conditioned on X(0) = 0.75, but one can alternatively compute transition probabilities from scratch for a custom model. We next describe how to do so.
To compute transition probabilities for a set of selection coefficients s1,s2,...sL from scratch, run the following commands: There are also several options you can deploy. Here we show --thin and --burnin, which we use here to treat the first 100 trees in example.trees as burnin, and thin down to every 10th tree in the file after that point. This corresponds to an overall burnin of 1000 samples and an overall thinning rate of 100 trees, assuming you use the baseline ARGweaver thinning rate of 10 trees. The --output option saves the output of CLUES (e.g. the likelihood surface, importance sampling weights, MLEs, trajectory posterior marginals, and more) to a HDF5 file named example.clues.h5.
There are more options available. To run CLUES under an SSV model, simply use the --ssv option. (Note: running --ssv will require a transition probability file computed using the --ssv option in conditional_transition_matrices.py.) To fix the value of s = s (rather than optimize over all potential values of s), use the option --selCoeff s'. To deploy a uniform prior on s, use the option --prior. To specify the position of the site of interest, set --posn <position>, which defaults to 50000.

Runtime
To give a sense of the expected runtime of transition probability pre-computation, ARG sampling in ARGweaver , and CLUES itself, we timed each of these 3 steps for an example analysis. We found that transition probabilities ran in 36 minutes (14 minutes of unconditional transition probabilities, 22 minutes of conditioning; we used 45 discretized allele frequencies, 39 discrete timepoints, and 25 different values of the selection coefficient, on par with values used in our analysis in the main text).
We ran ARG sampling and CLUES on the dataset used in the our study of background selection (see "Effects of background selection", main text). Time required to perform ARG sampling varied across replicates, but generally fell within 40-60 minutes. Time required to run CLUES was 5 minutes, 14/15 using M = 40 sample ARGs after thinning. For analyses of larger regions and/or sample sizes, the ratio of ARGweaver runtime to CLUES runtime will increase.