Polygenic score accuracy in ancient samples: Quantifying the effects of allelic turnover

Polygenic scores link the genotypes of ancient individuals to their phenotypes, which are often unobservable, offering a tantalizing opportunity to reconstruct complex trait evolution. In practice, however, interpretation of ancient polygenic scores is subject to numerous assumptions. For one, the genome-wide association (GWA) studies from which polygenic scores are derived, can only estimate effect sizes for loci segregating in contemporary populations. Therefore, a GWA study may not correctly identify all loci relevant to trait variation in the ancient population. In addition, the frequencies of trait-associated loci may have changed in the intervening years. Here, we devise a theoretical framework to quantify the effect of this allelic turnover on the statistical properties of polygenic scores as functions of population genetic dynamics, trait architecture, power to detect significant loci, and the age of the ancient sample. We model the allele frequencies of loci underlying trait variation using the Wright-Fisher diffusion, and employ the spectral representation of its transition density to find analytical expressions for several error metrics, including the expected sample correlation between the polygenic scores of ancient individuals and their true phenotypes, referred to as polygenic score accuracy. Our theory also applies to a two-population scenario and demonstrates that allelic turnover alone may explain a substantial percentage of the reduced accuracy observed in cross-population predictions, akin to those performed in human genetics. Finally, we use simulations to explore the effects of recent directional selection, a bias-inducing process, on the statistics of interest. We find that even in the presence of bias, weak selection induces minimal deviations from our neutral expectations for the decay of polygenic score accuracy. By quantifying the limitations of polygenic scores in an explicit evolutionary context, our work lays the foundation for the development of more sophisticated statistical procedures to analyze both temporally and geographically resolved polygenic scores.

Introduction Decay in linkage disequilibrium (LD) between tagging and causal sites, population stratification, variation in allele frequencies within and across populations, and environmental heterogeneity, among other factors, are all thought to negatively impact the prediction accuracy of polygenic scores (see e.g., [1][2][3][4][5][6][7], and more recently in humans, e.g., [8][9][10][11][12][13]). Many of these issues likely influence both withinand out-of-sample predictions, where out-of-sample may refer to an individual sampled from a distinct time or location relative to that of the GWA study. While empirical [12,14] and simulation [1,13,15] or combined [16] studies have explored particular population genetic scenarios or experimental contexts, we still do not know the extent to which each of these factors compromises prediction accuracy in general.
In this work, we address an issue pertinent to out-of-sample prediction: that causal loci may have different allele frequencies in the GWA study and focal populations. Variants common in the GWA study may be rare in the focal population, and vice versa. We refer to this phenomenon as allelic turnover. Allelic turnover implies that effect estimates ported across space and time, or both, may not reflect all of the genetic variation relevant to phenotypic variation in an ancient or geographically distinct population. Allelic turnover further suggests that the statistical properties of ancient polygenic scores depend on when an ancient individual was sampled -a feature not currently accounted for in ancient DNA analyses. Similarly, statistical properties of geographically disparate polygenic scores depend on the divergence time between the GWA study and focal populations. An understanding of allelic turnover in these contexts may ultimately improve statistical analyses of temporally (e.g., [17][18][19][20]) and geographically resolved polygenic scores (e.g., [9,10]), analyses which are increasingly commonplace.
We aim to quantify the effect of allelic turnover on the polygenic scores of such out-of-sample individuals when they are computed using effect estimates from a contemporary population. We expect that increases in ancient sampling time or divergence time will be associated with declines in polygenic score accuracy due exclusively to allelic turnover. The question is, by how much does accuracy decline? And, can allelic turnover alone explain the reduced accuracy of out-of-sample predictions observed in numerous human (e.g., [15,16]), animal (e.g., [1,2,4]) and plant (e.g., [21,22]) experiments and simulation studies. The answer is likely to depend on the particular population genetic, trait, and GWA study features of the system under study [3]. We attempt to capture some important aspects of this diversity in our modeling framework.
Here, we consider a standard implementation of the polygenic scoreŶ which attributes non-zero effects to a particular set of loci, S. An individual's polygenic score is a weighted sum of its genotype, where the weights are the estimated allelic effects. The loci in S and their estimated effects are usually identified in large-scale GWA studies, often performed in regional biobanks with sample sizes in the tens to hundreds of thousands of individuals (e.g., the UK Biobank [23], BioBank Japan [24]). Frequently, the set S includes loci which are approximately independent and surpass some allele frequency and p-value thresholds. Though there are numerous ways to define a polygenic score (e.g., [25,26] and see Section 4 in S1 Text), the "prune and threshold" method is commonly used and proves analytically tractable in our framework.
Previous quantitative genetic approaches, such as [27] and [16], largely ignore the underlying population genetic dynamics. For example, Wang et al. [16] estimate the reduction in polygenic score accuracy in a focal population relative to the GWA study population as a function of the fixed population-specific trait heritabilities, allele frequencies, and LD patterns, and the estimated per-locus effects. In contrast, we embed the ancient polygenic score in an explicit population genetic framework, allowing us to take into account changes in allele frequency as well as the statistical constraint imposed by a finite GWA study sample size. And, distinct from previous approaches to the evolutionary modeling of polygenic scores [28], we track the frequencies of all loci that potentially contribute to a trait-not just the loci included in the polygenic score (i.e., loci in S).
Henceforth, we frame our study in terms of ancient polygenic scores. However, we formally demonstrate that our theoretical results apply to out-of-space polygenic scores, where the population divergence time multiplied by two is analogous to the ancient sampling time (see Fig 1 and Section 1 in S1 Text). The latter scenario can represent an ancient individual sampled from a population not directly ancestral to that of the GWA study as the two populations must have diverged at some point in the past. This scenario, to a first approximation, describes the population displacement events thought to be ubiquitous in the history of humans (e.g., [29]). However, human history is additionally characterized by numerous admixture events and population size changes (e.g., [29]) which are not yet captured within our modeling framework.
We use several statistics to characterize ancient polygenic score error in distinct population genetic and GWA study scenarios. Each statistic is indexed by the ancient sampling time τ: the bias, bias(τ), mean-squared error, mse(τ), estimated additive genetic variance,V A ðtÞ, and polygenic score accuracy, ρ 2 (τ), which approximates the expectation of the squared sample correlation coefficient between the polygenic scores and phenotypes of an ancient sample. In addition, we can readily express these statistics as functions of the genetic divergence between the ancient and GWA study populations, as measured by the fixation index, F ST (Section 11 in S1 Text). We first derive general forms for these statistics that are agnostic to almost all of our modeling assumptions and which provide conceptual insights into the effects of allelic turnover. Next, we derive explicit, parameter-dependent expressions for each statistic when the trait is neutrally evolving in a population of constant size subject to recurrent mutationwhich for small mutation rates approximates the infinite sites model. We take advantage of the spectral representation of the transition density function of the Wright-Fisher diffusion (tdf) to execute these computations [30][31][32][33]. We then find interpretable linear approximations for the initial rate of increase (or decrease) of the metrics under study. These approximations apply for the small ancient sampling times typical of ancient humans remains (e.g., see [18]).
Consistent with our expectations, mse(τ) increases and the estimated additive genetic vari-anceV A ðtÞ decreases with increasing sampling age τ. Despite the fact that mse(τ) andV A ðtÞ are measuring distinct quantities-and indeed have different functional forms-our linear approximations reveal that, under our assumptions, both statistics initially change at approximately the same rate. This rate is proportional to the product of the mutation rate and the power to detect trait-associated loci in the GWA study, which in turn, is influenced by both study size, the magnitude of the true per-locus effect, and the underlying distribution of the allele frequencies of causal loci.
Moreover, we show that polygenic score accuracy ρ 2 (τ) is proportional toV A ðtÞ, which, as stated, is sensitive to the GWA study and evolutionary parameters. UnlikeV A ðtÞ, ρ 2 (τ) depends on the trait heritability h 2 , with larger values of h 2 increasing its rate of decay. In contrast, for small mutation rates, relative accuracy, defined as the ratio of ρ 2 (τ) to accuracy measured in a present-day sample ρ 2 (0), is insensitive to h 2 , the true per-locus effect size, and the GWA study parameters, as long as the GWA study size n exceeds some minimum threshold. We show that this result likely holds for an arbitrary distribution of effects. Importantly, accuracy and relative accuracy decay considerably over the short time spans characteristic of ancient human samples and geographically distinct human populations.
With equal probability of detecting positive versus negative effect alleles, and under neutrality, the bias of the polygenic score is zero for all ancient sampling times. In practice, both of these conditions are likely violated. For example, detection imbalances have been observed in case-control GWA studies [34], and many polygenic traits are likely under some form of selection [35,36]. While unequal thresholds do not precisely capture the phenomena described in [34], they do yield a non-zero bias(τ) within our framework. The magnitude of this bias is  A) and (B) portray the two demographic scenarios encompassed by our modeling framework. In (A), the ancient individual is sampled at an earlier time τ from the same population in which the GWA study is conducted. In (B), the ancient individual is sampled at an arbitrary time τ 0 from a population that split from the population in which the GWA study was conducted at some time τ split in the past. The dotted line schematically relates τ 0 to the ancient sampling time τ of (A), i.e., τ = 2τ split − τ 0 . In (C), a graphical model relates the random variables explicit and implicit in the polygenic scoreŶ ðtÞ and phenotype Y(τ) of an ancient individual sampled τ generations in the past, as in (A). Darkly shaded and thickly bordered nodes are observed quantities. Unshaded and thinly bordered nodes are unobserved. Lightly shaded nodes bordered by dashed lines denote estimated quantities. Edges denote direct dependencies between connected nodes. For example, conditional on the ancient genotype X(τ), the polygenic scoreŶ ðtÞ is independent of the population allele frequencies Z(τ). Quantities in blue are associated with the present day only, and include the population allele frequencies Z(0); the genotypes of the n individuals in the GWA study, fX i g n i¼1 and their phenotypes, fY i g n i¼1 ; and, the effects and intercept term estimated in the GWA study,β andĈ, respectively.
https://doi.org/10.1371/journal.pgen.1010170.g001 small, implying that other perturbations would be necessary to explain an observed, appreciable bias. To relax the neutrality assumption, we simulate recent directional selection. We find that when the selection coefficient is large enough (4Ns � 1), selection indeed yields biased polygenic scores. Though this selection-induced bias is several orders of magnitude larger than that induced by asymmetry in the detection thresholds, it is still small relative to the variance explained by segregating genetic variants. Additionally, weak selection only induces small deviations from neutral theoretical expectations for the other statistics, suggesting that our neutral theory may still accurately capture accuracy declines in the presence of weak directional selection. Altogether, our theoretical results suggest that allelic turnover may make large contributions to out-of-sample reductions in accuracy, even under neutrality.

Model and metrics
Our modeling framework readily encompasses two demographic scenarios. In the first, the focal individual is sampled from the same population in which the GWA study was performed, but at a previous point in time τ (Fig 1A). We specify τ in coalescent time units: An ancient sampling time of τ corresponds to 2N � τ generations in the past, with 2N as the diploid population size. When τ = 0, the focal individual is an independent sample from the GWA study population. In the second scenario (Fig 1B), the focal individual is sampled at τ 0 from a population that diverged from the GWA study population at τ split (in coalescent time units) in the past. However, we show in Section 1 in S1 Text that scenario (A) is equivalent to scenario (B) if the ancient sampling time τ is equal to 2τ split − τ 0 . Therefore, we proceed according to the first scenario, while emphasizing that our conclusions readily translate to the second.
We summarize the full model in Fig 1C and detail its constituent parts in the proceeding subsections. Briefly, the genotype of the ancient individual is sampled conditional on the population allele frequencies at τ. The ancient individual's phenotype is then sampled conditional on its genotype. Population allele frequencies for all loci that potentially affect the trait evolve until present day, at which point the GWA study is conducted. In particular, the effect sizes included in the polygenic score model are estimated from the genotypes and phenotypes of n contemporary individuals. Finally, the ancient polygenic score is computed from the ancient individual's genotype and the polygenic score model derived from the results of a contemporary GWA study.

Sampling the genotype of a time-indexed individual
We assume that each site is at most bi-allelic, with possible alleles A 1 and A 2 . We denote the genotype of an individual sampled at some time t (in coalescent units) as X iℓ (t), where i indexes the individual, and ℓ the locus. For the ancient individual(s), t = τ; for the participants in the GWA study, t = 0. For mathematical convenience, we use a symmetric genotype encoding, that is X iℓ (t) 2 {−1, 0, 1}, corresponding to genotypes A 1 A 1 , A 1 A 2 , and A 2 A 2 , respectively. Conditional on the population allele frequency of allele A 2 at t, Z ℓ (t), the distribution of X iℓ (t) is given by the Hardy-Weinberg sampling probabilities:

Modeling the true phenotype
The genetic basis of a polygenic trait, Y, is determined by a set L, consisting of L distinct genetic loci (jLj ¼ L), each with a true per-locus additive effect b ' 2 R (for ℓ = 1, 2, . . ., L). We further assume that the L loci contribute linearly to the trait, such that the true phenotype of the i-th individual sampled at t is specified by the commonly used additive genetic model [37], where C is a constant; β ℓ is the true additive effect of locus ℓ; and � i ðtÞ � N ð0; s 2 e Þ is a normally distributed random variable that incorporates variance in the phenotype due to the environment. The summation in Eq 1 is often referred to as an individual's genetic value [25]. A locus ℓ contributes ±β ℓ to the genetic value (and phenotype) of an individual who is homozygous at ℓ, and zero to that of a heterozygous individual. C is thus the phenotype of an hypothetical all heterozygous individual. Without loss of generality, we set C = 0. In addition, we assume, without loss of generality, that all β ℓ � 0 such that locus ℓ contributes −β ℓ to the genetic values of A 1 A 1 individuals and +β ℓ to the genetic values of A 2 A 2 individuals.
A fixed locus, Z ℓ (t) 2 {0, 1}, will affect the mean phenotype of the population at t by ±β ℓ but will not contribute to phenotypic variation. We illustrate this fact by conditioning on the allele frequencies of all loci in L at t, Z(t) 2 [0, 1] L . Assuming linkage equilibrium between loci as well as independence between the environmental and genetic effects, we have, The summation in Eq 2 is the additive genetic variance at t, V A (t). For a segregating site, the summand is proportional to Z ℓ (t)(1 − Z ℓ (t)), with 0 < Z ℓ (t)(1 − Z ℓ (t)) < 1. For a fixed site, the summand is zero and the site does not contribute to the additive genetic variance V A (t). An important feature of our model is that some of the L loci may not exhibit genetic variation in the population at a given time. More concretely, the set of loci with non-zero estimated effects on the polygenic score, S, may only be a small subset of L. Thus, we assume that L is a superset of S.

Constructing a model for the polygenic score
As our aim is to isolate the effects of allelic turnover on the statistical properties of polygenic scores, we make the additional assumption that the genotyped sites are the causal sites. (We have already assumed that all loci are in linkage equilibrium.) Akin to [38], we employ a simple threshold model for the effect estimates. For a GWA study consisting of n individuals (and 2n chromosomes),b where D ℓ is the allele count of the trait-increasing allele A 2 at the ℓ-th site in the GWA study sample; and d ℓ1 and d ℓ2 are the site-specific detection thresholds. In this simplified model, the true effect is estimated perfectly for all sites with allele counts within the intervals (d ℓ1 , 2n − d ℓ2 ) for ' 2 L. In Section 4 in S1 Text, we relate Eq 3 to two alternative estimation procedures: maximum likelihood estimation (MLE) and the best linear unbiased predictor (BLUP).
We allow the two thresholds to differ in order to encompass scenarios in which power is an asymmetric function of the sample allele frequencies, e.g., there is more power to detect low frequency (D ℓ < n) versus high frequency (D ℓ > n) trait-increasing alleles. Such situations may arise with polygenic disease inheritance and imbalanced case and control sample sizes [34]. In most cases, however, we will consider symmetric detection thresholds, with d ℓ1 = d ℓ2 = d ℓ . The threshold d ℓ depends on on the phenotypic variance, genome-wide significance threshold, true per-locus effect β ℓ , and GWA study size n. In Section 2 in S1 Text, we give an explicit form for this dependency for a continuous focal trait and equal detection thresholds. Varying d ℓ while keeping the GWA sample size fixed is equivalent to varying the true per-locus effect β ℓ . Varying the GWA study size n while keeping β ℓ and the other parameters fixed is akin to varying the GWA study's power to detect loci of a particular effect size. In Analytical Results, we do both.
The threshold model arises in the large GWA study size n limit for the model ofb ' provided in Equation 5 in S1 Text. Namely, as long as D ℓ is not too small, the variance ofb ' goes to zero as n grows. Thus, the threshold model in Eq 3 will necessarily underestimate the true variance ofb (Section 4 in S1 Text). Still, this model captures the dependency ofb ' on the GWA study sample size n and the true per-locus effect β ℓ , while facilitating our analytical treatment.
In order to compare the polygenic score with an individual's true phenotype, we need to account for all sites in the mutational target L, not just those in S, the set of sites with nonzero effect estimates in the polygenic score. Asb ' ¼ 0 for any site in L but not S, we express the polygenic score as a function of all loci in L. The ancient polygenic score of individual i sampled τ generations in the past is then given by, whereĈ is the average phenotype of the GWA sample after subtracting the estimated genetic effects at all loci,Ĉ with � Y ¼ 1 n P n j¼1 Y j and � X ' ¼ 1 n P n j¼1 X j' as the mean phenotype and genotype at locus ℓ in the GWA study sample, respectively. Here, and in the remainder of our study, we omit timeindexing for random variables associated with the GWA study at t = 0. By design, the estimated interceptĈ absorbs the effects of all loci which were not detected as significant in the GWA study, i.e., those sites for whichb ' ¼ 0. Its presence in the polygenic score of Eq 4 is necessitated by the fact that, to facilitate our analytical treatment, we did not center nor scale the genotypes and phenotypes in the GWA study. Importantly, all of our results are independent of this choice (Section 5 in S1 Text). Henceforth, unless otherwise noted, we refer to Eq 4 as the polygenic score and to the summation in Eq 4 as the genetic prediction.

Modeling population genetic dynamics
Population genetic processes govern the correlations between allele frequencies at distinct points in time. We model this correlation using the Wright-Fisher diffusion with recurrent mutation. As we assumed all loci were in linkage equilibrium, their allele frequencies evolve forward in time independently, subject to genetic drift and mutation. At each site, alleles mutate from A 1 ! A 2 with rate μ, and from A 2 ! A 1 with rate ν. While our results readily generalize to arbitrary μ and ν, we restrict ourselves to equal mutation rates, μ = ν.
We further assume that the population is at equilibrium. In this setting, the marginal allele frequencies are beta-distributed, with shape and scale parameters specified by the populationscaled mutation rate; we denote the latter quantity by a, with a = 4Nμ = 4Nν.
The relative magnitudes of mutation and genetic drift determine which force dominates an allele frequency trajectory. For example, as a approaches 0, the effects of mutation on the frequencies of segregating mutations become negligible and genetic drift dominates. In this low mutation regime (a � 1, or equivalently m � 1 2N ), the recurrent mutation model approximates the infinite sites model, while still retaining the features that make it attractive for our analytical treatment. In particular, the stationary allele frequency distribution is a well-defined probability distribution under the recurrent mutation model, but not under the infinite sites model. We concern ourselves almost exclusively with the low mutation regime.

Quantifying out-of-sample prediction errors
To quantify how well the polygenic score approximates the true phenotype of an individual sampled uniformly at random from the population at time τ before the present, we use several statistics: Bias. We define the bias as the expectation of the difference between the polygenic score and true phenotype, where, here and elsewhere, we omit the subscript when there is only one sample. The expectation in Eq 6 is with respect to the entire random process, encompassing the underlying population genetic dynamics, estimation of the per-locus effects in the GWA study, and computation of the ancient polygenic score (illustrated in Fig 1C).

Mean-squared error (mse).
We define the mse as the expectation of the squared prediction error, mseðtÞ≔E½ðŶ ðtÞ À YðtÞÞ 2 �: ð7Þ As in Eq 6, the expectation in Eq 7 is with respect to all sources of randomness in the model. The variance of the prediction error equals the difference of the mse and the square of the bias, and thus it is fully characterized by these two metrics.
Expected estimated additive genetic variance (V A ). The estimated additive genetic variance is an estimate of the amount of phenotypic variance in the ancient population explained by additive genetic effects alone. We useV A ðtÞ to represent the expectation of this quantity,V whereẐ ' ðtÞ is an estimate of the ancient population allele frequency computed from a sample of n a individuals sampled at τ. The expected true additive genetic variance, E½V A �, can be found by taking the expectation of the summation in Eq 2. Polygenic score accuracy (ρ 2 ). Practitioners often compute the sample correlation coefficient r 2 to measure the accuracy of a predictor in a sample. Here, our sample is n a ancient individuals sampled at time τ, thus, where Cov[�, �] and Var [�] are the sample covariance and variance operators, respectively, and Y ðtÞ; YðtÞ 2 R n a are the n a -dimensional vectors of polygenic scores and phenotypes of the ancient individuals, respectively. Ideally, we would compute the expectation of this quantitybut, this is challenging due to the common difficulty of computing an expectation of a ratio of random variables. Thus, we approximate the expectation of r 2 (τ) as the ratio of expectations, where, as above, the covariance and variances are taken with respect to the sample of n a ancient individuals, while the expectation is over all sources of randomness in Fig 1C (see Section 7.4 in S1 Text for more details). We present simulations in the section Polygenic score accuracy of Analytical Results showing that ρ 2 (τ) is a good approximation for the expectation of r 2 (τ) in the parameter regimes of interest.

Analytical Results
By how much does the prediction accuracy of a polygenic score decrease as the time between sampling the ancient individual and conducting the GWA study increases? To answer this question, we consider a trait potentially influenced by L genetic loci, each with true effect β ℓ � 0, ℓ = 1, . . ., L. The forward evolution of sites underlying this trait is modulated by a per site, per generation mutation rate, μ, and a population scaled rate of a = 4Nμ. The diploid population of size 2N chromosomes is assumed to be at equilibrium. The parameters dictating the GWA study are the sample size n and the detection thresholds specified by d 1 The metrics are indexed by the ancient sampling time τ in coalescent time-units. An ancient sampling time of τ corresponds to 2N � τ generations in the past. We omit the time index for variables associated with the GWA study, which occurs at present day (t = 0). (We show in Section 11 in S1 Text, that the metrics can also be expressed as a function of divergence or F ST between the ancient and contemporary populations). Each subsection is structured as follows: We first derive a general expression for the statistic that does not depend on how we model the population genetic dynamics nor the GWA study. Second, we derive an analytical expression for the statistic under the population genetic assumptions and the GWA study threshold model described in Model and metrics.

Bias
We can rewrite the sampling time-dependent bias defined in Eq 6 as, where bias ℓ (τ) is the contribution of locus ℓ to bias(τ). From Eq 11, we see that bias ℓ (τ) � 0 when either or both ofb ' � b ' and � X ' � X ' ðtÞ are true. Thus, bias ℓ (τ) is minimal when (i) effect estimates are accurate, and (ii) the allele frequencies have not changed substantially in the interval [τ, 0].
Under the assumption of equal mutation rates and detection thresholds (d ℓ1 = d ℓ2 ), bias ℓ (τ) = 0 for τ � 0 for a reason distinct from those stated above. Trait-increasing alleles at high frequencies (D ℓ > n) and low frequencies (D ℓ < n) are detected as significant (b ' 6 ¼ 0) with equal probability. An equivalent assumption is that power is not affected by whether the most prevalent allele is trait-increasing or decreasing. Subsequent evolution of the allele frequencies preserves this symmetry and bias(τ) remains equal to zero for all τ. It follows that in the absence of additional perturbing forces, an estimate of the mean polygenic score from a sample of n a ancient individuals will also be unbiased, and therefore will on average accurately reflect the lack of change in the mean phenotype.
However, if we introduce asymmetry in the detection thresholds (d ℓ1 6 ¼ d ℓ2 ), bias(τ) is nonzero for all τ (Section 7.1 in S1 Text). Using the spectral representation of the transition density of the Wright-Fisher diffusion (tdf), we derive the per-locus contribution to the bias, bias ℓ (τ) (Section 7.1 in S1 Text). For a small population-scaled mutation rate a and a large GWA study size n, we approximate this expression (given in Equation 45 in S1 Text) as, where, is the probability that the allele count of site ℓ is less than d ℓi , i.e., D ℓ < d ℓi for i = 1, 2; and, B(�, �) is the beta function. Thus, the magnitude of bias ℓ (τ) is approximately proportional to the difference in the probability of detecting high (D ℓ > n) versus low (D ℓ < n) frequency alleles, and increases exponentially with τ. With a large GWA study size n and a small mutation rate a, this difference is small relative to the square root of the additive genetic variance-the ratio of these two quantities is smaller than OðaÞ (Fig S1a in S1 Text). This is due to the fact that when the mutation rate is small, most alleles are close to fixation or fixed. The stationary population allele frequency density Varying d ℓi then has relatively little impact on P ðd 'i Þ , constraining the difference between the one-sided detection probabilities (Fig S1b in S1 Text).

Mean-squared error
The sampling time-dependent mean-squared error mse(τ) can be expressed as, where s 2 e is the variance in the phenotype due to the environment (Section 7.2 in S1 Text). Note the similarity of the left term in Eq 14 to the form of bias(τ) given in Eq 11-similar heuristics apply. Under the threshold model specified in Eq 3, sites at moderate frequencies in the GWA study sample, D ℓ 2 [d ℓ , 2n − d ℓ ], will not contribute to mse(τ) sinceb ' ¼ b ' . Only sites with frequencies outside this interval (including sites invariant in the GWA study sample) will contribute, and their contributions will be proportional to the squared difference between X ℓ (τ) and � X ' . In practice, moderate frequency loci will also contribute to mse(τ) due to errors in the estimation of the effect estimates and any difference between the ancient genotypes and the average genotypes in the GWA study sample at these sites (Section 4 in S1 Text).
We use the spectral representation of the tdf (Section 6 in S1 Text) to derive an analytical expression for mse ℓ (τ), the per-locus contribution to the mse (Section 7.2 in S1 Text). From this expression, Equation 50 in S1 Text, we derive a linear approximation for the initial perlocus increase in this statistic, Δmse ℓ (τ). With a symmetric detection threshold (d ℓ1 = d ℓ2 = d ℓ ) we have, where mse ℓ (0) is the contribution of site ℓ to mse(τ) for τ = 0 (Equation 76 in S1 Text); and 2P ðd ' Þ , defined in Eq 13, is the probability that the allele count of site ℓ is outside the detection interval such thatb ' ¼ 0. Both mse ℓ (0) and P ðd ' Þ depend on the mutation rate a, the GWA study size n, and the detection threshold d ℓ .
Δmse ℓ (τ) reflects the time-dependent contributions of sites not detected in the GWA study. To see this, we condition on the effect estimateb ' , � a for small τ, and consequently, the combined effects of drift and mutation on mse ℓ (τ) are captured in the product of the mutation rate and sampling time aτ.
In addition, Eq 15 suggests that the rate at which mse ℓ (τ) increases will be shared across parameter regimes when aP ðd ' Þ is similar (Fig S4a in S1 Text). To illustrate this, we use our analytic formula (given in Equation 50 in S1 Text) to compute mse ℓ (τ) for several low mutation rates, a 2 {10 −4 , 10 −3 , 10 −2 }, and three GWA study sizes, n 2 {10 4 , 10 5 , 10 6 } (Fig 2A). These mutation rates and sample sizes span the range of parameter values appropriate for human data. We depict our results in two ways: (i) we plot the change in mse ℓ (τ), and (ii) we plot mse ℓ (τ) normalized by the expected additive genetic variance contributed by a single site. At stationarity the expected additive genetic variance is constant and equal to, for a scaled-mutation rate a. The plot of the former, Fig 2A, exhibits the functional relationship revealed by Eq 15, while the latter, Fig 2B, approximates the noise-to-signal ratio. In Section 9 in S1 Text, we demonstrate that Eq 15 is a good approximation to mse(τ) for τ � 0.2, particularly when the GWA study size n is large (in particular, see Fig S5 in S1 Text).
To find the GWA study size specific detection thresholds used in Fig 2A and 2B, we solve Equation 11 in S1 Text for a given effect size β, phenotypic variance V p , and significance threshold α, while varying the GWA study sample size. For β 2 = 0.01, V p = 1, and α = 10 −8 , the detection thresholds are d = 4142, 3340, 3290 in order of increasing sample size, which corresponds to sample allele frequencies of approximately 0.2, 0.02, amd 0.002, respectively. Thus, for a given effect size, larger sample sizes will lead to the detection of alleles at more extreme allele frequencies, while smaller samples will restrict detection to alleles at more intermediate frequencies.
Due to non-identifiability, the parameter choices are fairly arbitrary.
We find that for small mutation rates, the cumulative change in the mse, Δmse ℓ (τ), is mostly insensitive to differences in the GWA study sample size (Fig 2A and 2B). The approximation in Eq 15 helps to explain this result. The rate of increase is approximately proportional to 2aP ðd ' Þ t. For small mutation rates (a � 1) and an arbitrary detection threshold d ℓ , the probability of not detecting a locus as significantly associated with the trait is roughly 2aP ðd ' Þ � 1 for all sufficiently large n (Fig S1b in S1 Text). In this regime, increasing the GWA study sample size only yields small increases in the probability of detecting a locus as significant. Thus, for small mutation rates, the product of this quantity with the mutation rate is 2aP ðd ' Þ � a, and indeed, we observe a cumulative increase in mse ℓ (τ) that is OðaÞ for τ = 1 (Fig 2A). We note that increasing the GWA study sample size does enable detection of loci with smaller effects.
The result in Fig 2A, however, hides the fact that a small absolute increase in mse(τ) may correspond to a substantial increase in the noise-to-signal ratio. Indeed, for a = 10 −3 (blue lines throughout), mse ℓ (τ) ultimately exceeds the expected additive genetic variance E½V A' � for all GWA study sample sizes (Fig 2B). By τ = 0.2, a sampling time characteristic of ancient humans, mse ℓ (τ) due to allelic turnover is approximately 20% of the additive genetic variance E½V A' �. For sufficiently large τ, mse ℓ (τ) is at least the same order of magnitude as the expected additive genetic variance. In addition, while mse ℓ (τ) increases at approximately the same rate irrespective of study size, its initial value mse ℓ (0) is sample size dependent (Fig 2B and see Fig  S4b and S4e in S1 Text for a larger parameter space). Yet, for a given value of d ℓ , reductions in mse ℓ (0) mediated by sample size diminish once n is large enough (Fig S4b and S4e in S1 Text).
Further, Fig 2A obscures the fact that different mutation rates may yield similar noise-tosignal ratios. As discussed, for small a, mse ℓ (τ) increases with τ at a rate that is OðaÞ. For small a, the additive genetic variance is likewise OðaÞ, yielding a relative increase that is mostly insensitive to the mutation rate. Normalized mse ℓ (0) is also similar across small mutation rates (Fig S4b and S4e in S1 Text), rendering relative mse ℓ (τ) mostly insensitive to a. We thus omitted the other two mutation rates from Fig 2B. Lastly, we fix the GWA study sample size at n = 10 5 and vary the detection threshold d ( Fig  2C). Varying d while keeping n fixed is analogous to varying the true per-locus effect size β, or keeping β fixed while varying the significance threshold α. The minimum threshold is d = 10, whereas d = n = 10 5 maximizes mse ℓ (τ) sinceb ' would equal zero for all ℓ. Consistent with our analysis above, for small a, (i) mse ℓ (0) depends critically on d, while (ii) mse ℓ (τ)'s approximately linear growth rate is largely insensitive to d. Furthermore, by our previous arguments, relative mse ℓ (τ) is similar across small mutation rates, and they are also omitted in Fig 2C. For   Fig 2. Per locus contributions to the mean-squared error and estimated additive genetic variance across sample sizes, mutation rates, and detection thresholds. In (A), we plot the per-locus increase in mse, Δmse ℓ (τ), normalized by β 2 , for three mutation rates a = 10 −4 , 10 −3 , 10 −2 by color, and for the three sample sizes, n = 10 4 , 10 5 , 10 6 by shape, respectively. For a squared effect size of β 2 = 0.01, each sample size, in part, specifies a value of d ℓ , with d = 4142, 3340, 3290, or sample allele frequencies of approximately 0.2, 0.02, and 0.002, in order of increasing sample size. In (B-C), we restrict ourselves to a = 10 −3 as the lines for different mutation rates would otherwise largely coincide. In (B), we plot mse ℓ (τ) normalized by the expected additive genetic variance at stationarity, E½V A' � ¼ b 2 a=ð2a þ 1Þ. In (C), we fix n = 10 4 and vary the detection threshold over several orders of magnitude, d 2 {10, . . ., 10 5 }, plotting mse ℓ (τ) normalized by E½V A' �. In (D-F), we repeat (A-C), but for the statisticV A' ðtÞ, with the following exception: BecauseV A' ðtÞ decreases with τ, we plot the absolute value of its difference fromV independent and identically distributed (iid) loci and s 2 e ¼ 0, the per-locus mse ℓ (τ) values presented in Fig 2B and 2C are equal to the corresponding trait-wide statistics mse(τ).

Additive genetic variance
The per-locus contribution to the expected estimated additive genetic varianceV A ðtÞ is, whereẐðtÞ ¼ 1 2n a P n a i¼1 X i ðtÞ þ 1 ð Þ is the estimated allele frequency at τ, computed in a sample of n a ancient individuals. Whenb ' ¼ 0 or Z ℓ (τ) 2 {0, 1}, site ℓ will not contribute toV A ðtÞ. Thus, a site ℓ has a non-zero contribution to the estimated additive genetic varianceonly when it is segregating at both the present day and τ. This condition is necessary for botĥ Z ' ðtÞð1 ÀẐ ' ðtÞÞ > 0 andb ' 6 ¼ 0 to be true. As with the two previous statistics, we use the spectral representation of the tdf to derive an analytical expression forV A ðtÞ under our population genetic assumptions (Section 7.3 in S1 Text). The resulting expression, Equation 54 in S1 Text, indicates that the expected additive genetic variance decays exponentially. We then, to first order in the ancient sampling time τ, approximate the initial decrease in the per-locus estimated additive genetic variance DV A' ðtÞ, whereV A' ð0Þ isV A' ðtÞ evaluated at τ = 0 (Equation 77 in S1 Text);and 2P (d ℓ ) , defined in Eq 6, is the probability thatb ' ¼ 0. The factor due to finite sampling, 2n a /(2n a − 1), is �1 when the ancient sample size n a is large. Thus, apart from sign, DV A' ðtÞ is equal to Δmse ℓ (τ) of Eq 15. Therefore, for small τ,V A ðtÞ decreases at approximately the same rate as mse(τ) increases. This result further suggests that for a � 1 and a large GWA study size n,V A' ðtÞ=E½V A' � � 1 À mse ' ðtÞ=E½V A' � for small τ (Fig 2C and 2F). Although, this relationship trivially breaks down for large τ as mse ℓ (τ) is not bounded by one.
To compareV A' ðtÞ across mutation rates, we mirror our treatment of mse ℓ (τ) in the previous section. We plot (i) its increase DV A' ðtÞ ( Fig 2D); (ii)V A' ðtÞ normalized by the expectation of the true additive genetic variance at stationarity ( Fig 2E); and (iii) normalizedV A' ðtÞ, varying the detection threshold for a fixed GWA study sample size (Fig 2F). Akin to mse ℓ (τ), normalizedV A' ðtÞ is very similar across small mutation rates. And, while the GWA study size n and the detection threshold d influence the initial estimated additive genetic variancê V A' ð0Þ, its rate of change is mostly insensitive to the two GWA study parameters.
AsV A ðtÞ largely recapitulates our results for mse(τ) with opposing sign, we focus on their differences. Indeed, they have different functional forms and behave differently for modest or large τ (see Equations 50 and 54 in S1 Text, respectively). Conceptually, this discrepancy is not unexpected: In the previous section, we showed that a site only contributes to mse(τ) if its allele count falls outside the detection interval andb ' ¼ 0. Thus, mse(τ) increases with τ due to alleles shifting from intermediate frequencies in the ancient population to frequencies outside of the detection region in the contemporary population. For the expected estimated additive genetic varianceV A ðtÞ, the converse is true: The slope represents the decline inV A ðtÞ due to alleles changing from frequencies near or at fixation in the ancient population to frequencies within the detection interval in the contemporary population. While our results reveal similar functional behavior for these two quantities (with opposing signs) that applies for small τ, we caution that statements aboutV A ðtÞ do not immediately translate to statements about mse(τ), particularly for τ ⪆ 0.2.

Polygenic score accuracy
While our framework, in principle, encompasses a trait with varying effect sizes, we will first assume that all sites are iid with true effect size β. Our approximation to the expectation of the sample correlation coefficient simplifies to, where the compound parameter s 2 e 0 ¼ s 2 e =Lb 2 is the environmental variance normalized by the product of the number of loci in the mutational target L and the squared per-locus effect size β (Section 7.4 in S1 Text). By comparing Eq 18 with Eq 16, we can see that ρ 2 (τ) is closely related to the estimated additive genetic variance. Thus, likeV A ðtÞ, ρ 2 (τ) will decrease with τ due to loci having changed from frequencies close to zero or one in the ancient population to intermediate frequencies in the contemporary population. However, unlikeV A ðtÞ, ρ 2 (τ) does not depend on the ancient sample size. Therefore, to relate the two statistics, we multiply by the inverse of the ancient sample size dependent factor implicit inV A ðtÞ, For s 2 e ¼ 0, barring the sample size factor, Eq 19 is equal toV A ðtÞ normalized by the expected additive genetic variance. By extension, this quantity approximates the expected sample correlation coefficient r 2 (τ). By invoking our additional population genetic and GWA study assumptions, we arrive at an approximation for the decrease in polygenic score accuracy, Now, to relate our theory to empirical and simulation studies, we compute ρ 2 (τ) for a given narrow-sense heritability h 2 and mutation rate a pair. We define h 2 for a trait with a mutational target of L loci of equal effects β, where the equality follows from our population genetic assumptions. Together with a, h 2 fully specifies the compound parameter s 2 e 0 with, We plot our analytical expressions for both accuracy ( Fig 3A) and relative accuracy ( Fig  3B), defined as the ratio of ρ 2 (τ) to ρ 2 (0) for τ 2 [1, 0] spanning 2N generations. For humans, this time span corresponds to approximately 500,000 years in the past, encompassing the "Out-of-Africa" migration event estimated to have occurred 50,000-100,000 years ago [39]. As with the preceding statistics, when τ = 0, ρ 2 (τ) approximates the accuracy of the polygenic score within the GWA study population. Relative accuracy then directly measures reductions in accuracy relative to the GWA study population. We set h 2 = 0.5 and a = 10 −3 , and fix the GWA study sample size at n = 10 5 . We then compute ρ 2 (τ), varying the detection threshold over several orders of magnitude ( Fig 3A). (See Fig S6 in S1 Text for accuracy as a function of the fixation index, or F ST .) Our results for ρ 2 (τ) necessarily recapitulate those ofV A ðtÞ: While increasing the detection threshold d reduces accuracy substantially, it does not have a large impact on relative accuracy for n = 10 5 (Fig 3A). Indeed, for small mutation rates, relative accuracy is insensitive to the mutation rate and threshold, and is well approximated by e −τ (Equation 68 in S1 Text). Thus, its derivative is also exponential. Absolute accuracy ρ 2 (τ) likewise decays exponentially, but its derivative is scaled by a quantity that reflects features of the GWA study and the phenotypic variance. For a small mutation rate a � 1, its derivative is approximately 2P ðdÞ ða=ða þ s 2 e 0 ÞÞe À t , which, in turn, is approximately 2P (d) h 2 e −τ (Equation 67 in S1 Text). The latter expression suggests that the probability of not detecting a significant association P (d) and trait heritability h 2 are the key determinants of prediction accuracy. Importantly, ρ 2 (τ) declines considerably over the interval τ 2 [1, 0] irrespective of the detection threshold d.
In addition, we glean from Eq 18 that while heritability affects the magnitude of ρ 2 (τ) through the compound parameter s 2 e 0 , it does not influence the relative accuracy, consistent with previous results [16]. Our simulations suggest that this is also true of the sample correlation coefficient, as simulated estimates of r 2 (τ) agree extremely well with our theory for ρ 2 (τ) (Fig 3B). We note that this result is contingent on the fact that the environmental variance s 2 e only enters our simple threshold model in the specification of the threshold d (Equation 11 in S1 Text), and does not contribute directly to the variance of the polygenic score (Section 7.4 in S1 Text). Therefore, we expect this result to hold only for large GWA study sample sizes for which the threshold model is a good approximation to the distribution ofb. While the finding that relative accuracy is insensitive to the GWA study parameters relies on the assumption that all loci are iid and share a causal effect β, we provide preliminary theoretical evidence that our results will hold when β varies across loci (see Equation 69 in S1 Text and ensuing comments).

Simulation results for recent directional selection
We use simulations to explore if and how the statistics under study deviate from their neutral expectations in the presence of recent directional selection. Each copy of the A 2 allele at the ℓth site confers a fitness advantage of +s ℓ , and so the fitness ratio of the three possible genotypes A 1 A 1 :A 1 A 2 :A 2 A 2 is 1:(1 + s ℓ ):(1 + 2s ℓ ). In our simulations, the population evolves neutrally until the onset of selection at N generations (or τ s = 0.5 in coalescent time units) before present. Thereafter, the population evolves according to discrete Wright-Fisher dynamics with selection.
In the presence of selection, the allele frequency distributionis no longer symmetric; rather, it is skewed toward the beneficial allele. The severity of the skew depends on the selection coefficient and mutation rate, as well as the amount of time that selection has been acting. As we restrict s ℓ to positive values, designatingthe A 2 or + allele as beneficial, the allele frequency distribution will be skewed toward one. If we instead designated the A 1 allele as the beneficial allele, the allele frequency distribution would be skewed toward zero. The former models "positive" selection whereas the latter models "negative" selection. Because bias(τ) is proportional to β, its sign will be sensitive to this choice, but its magnitude will be unaltered. The other statistics will not be affected as long as the detection thresholds are symmetric. Therefore, our results are general up to the sign of bias(τ).
We conduct simulations over a range of selection coefficients, σ = 4Ns 2 {0, 0.1, 1, 10}, for a mutation rate of a = 10 −3 . Under directional selection, σ is proportional to the locus effect size β; mutations with larger effect sizes will be more likely to establish and achieve appreciable frequencies [40]. In addition, we plot results for two different detection thresholds, d 2 {10 3 , 10 4 }, in a GWA study sample of size n = 10 4 . More details on the simulation procedures are provided in Section 3 in S1 Text.
When σ � 1, the polygenic score is biased towards positive values for τ > 0 for both detection thresholds (Fig 4A). In other words, with directional selection acting to increase the trait value,Ŷ ðtÞ tends to overestimate Y(τ). The magnitude of bias ℓ (τ) depends critically on the strength of selection relative to mutation: We observe a larger bias for σ = 10 relative to σ = 1, and likewise the bias is larger for σ = 1 relative to σ = 0.1. In fact, the smaller selection coefficient σ = 0.1 is not distinguishable from neutral expectations. For 0 � τ < τ s , bias ℓ (τ) increases at an accelerating rate; for τ � τ s , bias(τ) appears constant in this parameter regime.
A higher detection threshold decreases the detection probability. Thus, we expect that the magnitude of bias ℓ (τ) will increase with the detection threshold. Indeed, bias ℓ (τ) is larger and increases more quickly for the larger detection threshold d = 10 4 compared to d = 10 3 (Fig 4A). Further, our simulations suggest that the detection threshold coupled with the time of the onset of selection govern the magnitude of the bias for τ > τ s . For some large τ, bias ℓ (τ) will reach an equilibrium value that depends approximately on the asymmetry of the detection thresholds at the present day, which in turn, depends on both the timing and strength of selection (Section 10 in S1 Text).
The underlying allele frequency dynamics provide some insight into these patterns. Before the onset of selection, the allele frequency distribution is stationary and symmetric around 0.5. After the onset of selection, trait-increasing alleles tend to increase in frequency, skewing the distribution toward one. Thus, alleles not detected in the GWA study will tend be at higher versus lower frequencies at t = 0, yielding E½ � X ' jb ' ¼ 0� > 0 for σ > 0. For large τ, the allele frequencies of sites not detected in the GWA study, i.e., withb ' ¼ 0, may have been substantially different in the ancient population. Each one of these sites will make a contribution to bias(τ) 11). Looking backward in time, the shift in the allele frequency distribution ensures that the conditional expectation of X ℓ (τ) is smaller than that of � X ' , yielding a positive bias ℓ (τ) for τ > 0. Notably, the magnitude of bias ℓ (τ) induced by selection is several orders of magnitude larger than that induced by asymmetry in the detection threshold alone (Fig S1a in S1 Text).
The effects of selection on mse ℓ (τ) are qualitatively consistent with those on bias ℓ (τ) ( Fig  4B). Although, here, the only selection coefficient which induces significant deviations from neutral expectations is σ = 10. And, mse(τ) is larger for d = 10 4 compared to d = 10 3 . As with bias(τ), for 0 � τ < τ s , mse ℓ (τ) increases at an accelerating rate; before τ s (τ � τ s ), mse ℓ (τ) appears to increase linearly. Values of σ < 10 do not induce noticeable deviations from neutrality for the correlation coefficient ρ 2 (τ) either ( Fig 4C). However, strong selection (σ = 10) does lead to substantially larger reductions in accuracy relative to our neutral expectations. In addition, for σ = 10, relative accuracy is sensitive to the detection threshold, with accuracy decreasing faster for the larger detection threshold (Fig 4D).

Discussion
In this work, we devised a theoretical framework to quantify the effect of allelic turnover on the error and accuracy of out-of-sample polygenic scores. Unlike previous theoretical approaches [16,27], we averaged over the evolutionary process governing trait evolution, the GWA study from which a polygenic score model is constructed, and the ancient individual's genotype and phenotype. In doing so, we found explicit expressions for several commonly used metrics that depend on the focal individual's sampling time, as well as the parameters governing the population genetic dynamics and power to detect trait-associated loci in the GWA study. Mathematical properties of the recurrent mutation model at stationarity enabled us to compute analytical expressions for the metrics of interest under neutrality, and approximations thereof.
Our analytical expressions suggest that allelic turnover alone may be responsible for large reductions in accuracy: For small mutation rates, ρ 2 (τ) (and r 2 (τ)) decreases substantially within short time-spans, by about 20 percent in 0.2N generations (corresponding to approximately 120,000 years in humans). In addition, increasing the detection threshold yielded lower polygenic score accuracy, as a locus was less likely to have a non-zero effect. These results are broadly consistent with a concurrent study by Yair and Coop [41], in which the authors used simulations to assess cross-population prediction accuracy, defined as the ratio of the variance of and individual's polygenic score to that of their genetic value, under neutrality and in the presence of stabilizing selection. When Yair and Coop restricted the polygenic score to the top one percent of SNPs, roughly analogous to altering the detection threshold, they similarly found that the accuracy declined in the focal population.
Yet, while the detection threshold influenced the magnitude of the polygenic score accuracy, relative accuracy was insensitive to this parameter. In other words, under neutrality, relative accuracy is insensitive to the magnitude of the per-locus effect and only depends on the underlying allele frequency distribution. In addition, relative accuracy was independent of the size of the mutational target when the constituent loci were iid. Our theory suggests that these results will hold for arbitrary distributions of the true effect β. Consideration of several effect size distributions in a parameter regime consistent with the UK Biobank further supports this conjecture (Section 8 in S1 Text). Although more work is required to fully substantiate this claim.
Selection, however, induces a dependency between an allele's effect and its frequency, and may thereby render relative accuracy sensitive to the detection threshold. Our simulations provide preliminary evidence in support of this claim. For a small mutation rate of a = 4Nμ = 10 −3 and a large per-locus selection coefficient σ = 4Ns = 10, relative accuracy was lower for the larger detection threshold of d = 10 4 compared to d = 10 3 . Yet, the difference between detection thresholds was small relative to that induced by selection, and was negligible for smaller selection coefficients. Indeed, smaller selection coefficients (σ � 1) did not yield appreciable deviations from our neutral expectations for the mse, accuracy, nor relative accuracy. Therefore, excluding strong selection (σ > 1), our neutral expectations for these statistics appear to be good approximations to their true values. Our theoretical results under neutrality thus may prove an accurate description of temporally-resolved polygenic scores when polygenic adaptation is achieved by concurrent small frequency changes at numerous small effect loci-a plausible scenario [28,35]. In addition, the simple patterns revealed by our simulations suggest that it may be possible to derive (approximate) analytic expressions for the given metrics in the presence of strong selection, when loci exhibit selective sweep-like behavior.
It is unclear whether our neutral expectations will hold in the context of more sophisticated polygenic trait modeling. In our simulation study, as in our theoretical work, we focus on dynamics at a single locus. Thus, our results are most relevant to scenarios in which single locus dynamics can be decoupled from the evolution of the mean phenotype and the genetic background [40]. Namely, the effect of an individual locus must be small relative to the mean phenotype [38,40]. Future work will assess polygenic score accuracy under more sophisticated models of polygenic adaptation (e.g., [38,42]).
Of the two bias-inducing processes explored, detection threshold asymmetry and directional selection, the latter induced much larger deviations from our neutral expectation for the bias, i.e., under neutrality bias(τ) = 0 for all ancient sampling times τ. In the presence of detection asymmetry, bias(τ) is approximately proportional to the difference between the one-sided detection probabilities, which in turn is constrained by the shape of the allele frequency distribution. Under neutrality, and for small mutation rates, most alleles are at very low frequencies or fixed, such that changing the detection threshold minimally influences the one-sided detection probabilities. Selection, however, perturbs the underlying allele frequency density. At equilibrium, this density is proportional to e σz z −1 (1 − z) −1 for small a, where σ = 4Ns. Depending on σ, the one-sided detection probabilities may differ markedly, yielding larger values of bias(τ). We thus suspect that detection asymmetry has the potential to further exacerbate any bias induced by selection. These results are interesting in light of those of Chan et al. 2014 [34], who demonstrated that polygenic disease inheritance under the liability threshold model induced differences in the power to detect protective versus susceptible alleles. In Chan et al., this effect was further increased by imbalances in the case and control sample sizes in the GWA study. Additional work is needed to incorporate these features of case-control studies into our modeling framework.
The effects of selection on the bias have implications for assessments of mean differences between ancient polygenic scores from distinct time points. In particular, our results suggest that sufficiently strong positive directional selection will lead to overestimation of the difference between the polygenic scores of ancient individuals sampled before and after the onset of selection. Likewise, in the presence of negative selection, the polygenic score will underestimate this difference. At the same time, as discussed above, estimation error increases (as measured by mse(τ)) and accuracy (as measured by ρ 2 (τ)) decreases as the ancient sampling time increases.
Our results clarify relationships between various commonly used metrics of prediction error and accuracy. For example, we demonstrated an approximate functional relationship between the mean-squared error mse(τ) and the expected additive genetic varianceV A ðtÞ that applies for small ancient sampling times and mutation rates. This shared initial rate emerged despite fundamental differences between these statistics: mse(τ) measures error due to variants near or at fixation in the contemporary sample, which were segregating at intermediate frequencies in the ancient sample. In contrast,V A ðtÞ measures error due to variants segregating in the contemporary sample, which were near or at fixation in the ancient sample. This conceptual result does not rely on any of our population genetic or GWA modeling assumptions, and perhaps could be exploited to learn about the genetic architecture of quantitative traits from multi-population data. In addition, we showed formally that polygenic score accuracy ρ 2 (τ), an approximation to the expectation of the sample correlation coefficient r 2 (τ), is proportional to the ratio ofV A ðtÞ to the total phenotypic variance. We believe that these relations, and their evolutionary and GWA study dependent forms, may facilitate the development of novel, more principled statistical procedures for the analysis of out-of-sample polygenic scores.
At the same time, the simplifying assumptions underlying our results indicate that significant challenges remain. For one, our model does not incorporate the complex demographic processes, such as admixture and population size changes, inherent in human history. This implies that an ancient sampling time of t years in the past likely does not correspond to a sampling time of τ = t/2N in our model, where 2N is the contemporary population size. Indeed, allelic turnover cannot explain all of the reductions in accuracy observed in out-of-sample predictions in humans. For example, our neutral theory predicts an approximately fifty percent reduction in accuracy when F ST between the focal and GWA study populations is comparable to African-European divergence (F ST � 0.1). This more severely overestimates the prediction accuracy of height in a sample of individuals with African ancestry compared to the Wang et al. predictions, which take into account both LD and allele frequency changes (Section 12 in S1 Text). Thus, to achieve the same accuracy reductions observed in both simulated, e.g., [15,16] and empirical, e.g., [14,16,43], studies of cross-population polygenic scores for contemporary humans, allelic turnover under neutrality would require population divergence times that far exceed their estimated values (Fig S7 in S1 Text).
Differences in LD between contemporary human populations may largely explain this discrepancy as most trait-associated loci are likely to be tagging rather than causal sites [12,16]. As with geographically distinct populations, if LD between the genotyped and causal sites differed in the ancient population, then polygenic score accuracy would suffer [1]. We did not model this effect and assumed that the genotyped site was the causal site. This assumption may be justified when ancient sampling or population divergence times are recent, as high marker density in the GWA study may mitigate accuracy losses due to LD decay, but more theoretical work is required to substantiate this claim. While our framework can readily incorporate LD, it is difficult to obtain analytical results when the genotyped marker is not the causal site. In lieu of theoretical results, large-scale simulations in simple population genetic scenarios may provide insight into the relative contributions of LD-which depends on the allele frequencies of the tagging and causal sites-and allelic turnover to declines in polygenic score accuracy.
Furthermore, our assumption of linkage equilibrium between loci roughly equates to assuming that each LD block contains only a single causal site. Thus, our results will be most applicable to traits with relatively sparse genetic architectures for which the distance between any two causal loci is large compared to the scale of LD. In contrast, when the trait architecture is dense, a large number of variants have non-zero effect on the trait. Causal sites in close proximity are necessarily linked, and our assumption of linkage equilibrium would be violated. In addition, under a dense trait architecture, the "prune and threshold" polygenic score described herein may achieve lower accuracy than a best linear unbiased predictor (BLUP) that allows all segregating loci to have non-zero effects. In Section 4 in S1 Text, we speculate on the accuracy of BLUP in the context of our modeling framework when the trait has a dense architecture.
In addition, we assumed that per-locus causal effects were shared by the ancient and contemporary samples. Differences in causal effects across contemporary populations, perhaps due to changes in the environment, epistasis, or gene-by-environment interactions, likely contribute to accuracy reductions [8,12]. Indeed, Cox et al. [18] found that trends in the polygenic scores of temporally disparate ancient samples did not always recapitulate those of the true phenotype. We conjecture that fluctuations in the per-locus effects would increase mse(τ) and decrease accuracy, but not profoundly alter our conclusions. Perhaps, if the fluctuations were asymmetric, e.g., effect sizes tended to increase in time, then bias(τ) may be non-zero under neutrality. Population stratification in the GWA study population may also lead to biased ancient polygenic scores, as has been observed in cross-population predictions in humans [9,10]. Lastly, technical challenges inherent to the extraction and sequencing of ancient DNA often result in noisy estimates of the ancient genotypes. This additional source of randomness is likely to reduce accuracy and increase mse(τ), but otherwise should not substantially alter our conclusions.
Supporting information S1 Text. Extended model, methods, and results. This supplementary text contains detailed derivations and additional analyses. (PDF)