The time and place of European admixture in Ashkenazi Jewish history

The Ashkenazi Jewish (AJ) population is important in genetics due to its high rate of Mendelian disorders. AJ appeared in Europe in the 10th century, and their ancestry is thought to comprise European (EU) and Middle-Eastern (ME) components. However, both the time and place of admixture are subject to debate. Here, we attempt to characterize the AJ admixture history using a careful application of new and existing methods on a large AJ sample. Our main approach was based on local ancestry inference, in which we first classified each AJ genomic segment as EU or ME, and then compared allele frequencies along the EU segments to those of different EU populations. The contribution of each EU source was also estimated using GLOBETROTTER and haplotype sharing. The time of admixture was inferred based on multiple statistics, including ME segment lengths, the total EU ancestry per chromosome, and the correlation of ancestries along the chromosome. The major source of EU ancestry in AJ was found to be Southern Europe (≈60–80% of EU ancestry), with the rest being likely Eastern European. The inferred admixture time was ≈30 generations ago, but multiple lines of evidence suggest that it represents an average over two or more events, pre- and post-dating the founder event experienced by AJ in late medieval times. The time of the pre-bottleneck admixture event, which was likely Southern European, was estimated to ≈25–50 generations ago.

disproportionately retaining segments of Northern European origin, thus wrongly localizing the EU segments even if the true source is Southern Europe. To guarantee the unbiased nature of our pipeline, we therefore did not filter any SNPs in all subsequent analyses.

Section 2: PCAMask
PCAMask is a software tool that performs principal component analysis restricted to the SNPs in each individual that derive from a specific ancestry [5,6]. In theory, such a tool should be able to pinpoint the subcontinental ancestries of admixed individuals, but the utility of PCAMask on admixture between closely related populations was unknown. Running PCAMask on the AJ genomes (along with the reference panels described in the main text), we found that occasionally, the European component of the AJ genomes clustered around Southern Europe and that the Middle-Eastern component of the AJ genomes clustered around the Levant region, in concordance with the results presented in the main text. Nevertheless, we did not include these results due to a number of technical issues (see also [7]). Specifically, we found that in certain situations, the algorithm did not reach convergence and some AJ individuals were localized far away from the main AJ cluster. In addition, we found that the program did not appear to control for the number of admixed individuals: we noticed that increasing the number of AJ individuals led to their inconsistent placement. Finally, we compared the clustering of the reference EU and ME individuals between PCAMask and the commonly used SmartPCA tool [8], and noticed discrepancies in the clustering pattern. We therefore leave a more rigorous interpretation of PCAMask's results to future work.

Confidence intervals for the inferred ancestry proportions
To obtain confidence intervals for the ancestry proportions of each EU region inferred in Figure 2 of the main text, we resampled AJ individuals 1000 times with replacement, and used linear regression in the region near the real AJ data point to obtain the simulated values matching each bootstrap iteration. We used a similar procedure for the admixture time estimated in Figure 3 of the main text. We stress that this procedure accounts only for the sampling noise, but the magnitude of other biases is clearly higher. For example, note the difference in the Southern EU ancestry proportion between panels (A) and (B) of Figure  2 (31% vs 37%, respectively), as well as the results from GLOBETROTTER.

The proportion of Levant ancestry
When generating Figure 2 of the main text, we fixed the Levant ancestry to 50% and varied the amount of the different European ancestries. To determine whether our results are sensitive to the assumed proportion of Levant ancestry, we fixed the proportion of Western and Eastern European ancestry to 8% each, and varied the proportion of Southern European ancestry between 20% and 80% in increments of 5%, with the remaining ancestry being Levantine. The best match to the AJ data (in terms of the proportion of chromosomes classified as Southern European) was obtained for Levant ancestry proportions of 49% (leaving Southern EU with 35%). In another experiment, we fixed the Levantine ancestry to 60% and varied the Southern EU ancestry in increments of 5% (the remaining ancestry being Eastern European). The best match to the AJ data was found at 30% Southern European ancestry, close to the 34% inferred using the original pipeline.

The contribution of Iberia
We removed Iberia from our reference panel at an early stage of the analysis. To directly estimate the Iberia ancestry proportions in AJ, we simulated admixed genomes with Levant ancestry proportions of 50%, Southern EU proportions of 34%, and varying Western/Eastern EU and Iberia ancestry between 0 and 16%. The closest match to the real AJ data (in terms of the proportion of chromosomes classified as Iberia) was obtained when the Iberia component was 2%.
Exclusion of the true ancestral source from the Middle-Eastern or European reference panels Here, we study the robustness of our results to not having available samples from the precise ancestral source of AJ. This is the case if the original source population has gone extinct, experienced strong genetic drift, absorbed migration since the time of admixture, or otherwise was missed from our sample. To investigate the expected effect on our results, we study the case when the true source used for simulating the admixed genomes is removed from the reference panel.
For the case of the Middle-East, we arbitrarily selected Jordanians, and simulated admixed genomes with 50% Jordanian ancestry, along with 34% Southern EU ancestry, 8% Western EU ancestry, and 8% Eastern EU ancestry (the ancestry proportions inferred in the main text, Figure 2). We then excluded Jordanians from the ME reference panel, and attempted to reconstruct the ancestry proportions from each European source (assuming the total EU ancestry was 50%), following precisely the pipeline described in Figure 2 for the real AJ data. Our estimate for the Southern European ancestry proportion was 32.5%, very close to the true simulated proportion (34%). We thus conclude that our inference pipeline is reasonably robust to exclusion of the precise ancestral ME source from the panel.
For the case of Europe, we focused on Western EU, since our sampling there was relatively sparse (and specifically, did not include Germany). We simulated admixed genomes with 50% French and 50% Levant ancestry, and then removed French from the Western EU panel. The proportion of chromosomes whose European component was correctly classified as Western EU was 49%, compared to 33% classified as Southern EU. (In comparison, under a simulation of 50% Southern EU and 50% Levant ancestries, the proportion of chromosomes classified as Southern EU was 53% compared 29% Western EU.) Thus, even in the absence of the specific ancestral source from the Western European reference panel, we were still able to correctly infer its regional affiliation.

The simulated admixture time
In the simulations used in Figure 2 of the main text, we assumed that the admixture time was 30 generations ago, which is the value we estimated both here and previously [9]. To determine the robustness of our results to deviations from this estimate, we repeated the inference procedure, but when setting the simulated admixture time to 20 generations. We also set the admixture time parameter of RFMix to 50 generations. We fixed the Levantine ancestry to 50% and increased the Southern EU ancestry fraction in increments of 2% (the remaining ancestry was 8% Western EU and 8% Eastern EU). We found that the best fit to the AJ data was obtained at Southern EU ancestry of 38%, close to the originally inferred proportions of 34%.

Section 4: The IBD sharing analysis
The coalescence time of an IBD segment The IBD sharing analysis described in the main text assumed that given that a site is in an IBD segment, it coalesces around the time of the bottleneck. The exact posterior distribution of the coalescence time of a segment of length (in Morgans) between m1 and m2 is given by (e.g., [10,11]) where NA is the ancestral population size (in diploids), h(t) is the coalescence probability per generation (more precisely, the inverse of the population size when scaled by 2NA), and the time t is scaled by 2NA. For a bottleneck that has reduced the population size from 10k to 300 individuals 30 generations ago and was followed by a rapid expansion to 1M individuals, as inferred for AJ [9,11], we find that coalescence times of segments of length [3,7]cM are narrowly distributed, with ≈86% of events taking place within [20,40] generations ago. This suggests that the ancestry of IBD segments reflects predominantly the ancestry during the generations close to the bottleneck. Information on deviations from this assumption is encoded in the lengths of the segments and may be modeled in future work.

The dependence of the errors on the ancestry
The IBD sharing analysis relies on the assumption that false positive IBD and/or uninformed LAI are independent of the ancestry. To determine whether IBD detection accuracy varies across ancestries, we plotted, in S7 Figure, the Haploscore for each segment against its ME ancestry, averaged over the four haplotypes involved, and observed that the Haploscore was nearly constant across the entire range of ME ancestries. To investigate how the LAI error varies across ancestries, we simulated admixed genomes with ancestry proportions 50:34:8:8% Levantine, Southern EU, Eastern EU, and Western EU, respectively. We then ran RFMix and determined the proportion of SNPs correctly classified as either ME or EU. The Levant SNPs were classified correctly 70.5% of the time, compared to 68.8% for the EU SNPs, which supports using an ancestry-independent error rate. A subtle caveat is, though, the relatively low classification accuracy of Southern EU SNPs: 62.8%, compared to 80.1% and 82.3% for Eastern and Western EU, respectively. To model the effect of this result on our estimate of the pre-bottleneck ancestry proportions, and assuming that pre-bottleneck gene flow was mostly Southern European ( Figure 5 of the main text), denote the true pre-bottleneck ME ancestry as fME, the estimated pre-bottleneck ME ancestry as gME (58%, see Results in the main text), the probability of correctly classifying a ME SNP as p(ME→ME) (70.5%), and the probability of incorrectly classifying a Southern EU SNP as ME as p(SEU→ME) (37.2%). The following equation then approximately holds: gME=fME•p(ME→ME)+(1-fME)•p(SEU→ME). Solving for fME, we obtain fME=62%, higher than the uncorrected estimate of 58%. Thus, our approach is conservative, in the sense that if biased, it has likely underestimated the evidence for an elevated pre-bottleneck ME ancestry, and consequently, the evidence for post-bottleneck European gene flow.

Comparing EU ancestry proportion estimates between RFMix and GLOBETROTTER
The estimate of the total EU ancestry from the RFMix analysis was 53%, consistently with our previous estimate of ≈50-55% based on whole-genome data [9], the estimate from the f4 analysis (when corrected by simulations), and the experiment mentioned in Supplementary Text S3 above ("The proportion of Levant ancestry"). In contrast, the estimate from GLOBETROTTER [12] was 70%, among which 55% was Southern European. We find that reconciling these estimates is difficult, as evidence exists to support both the LAI-based estimate and the GLOBETROTTER based estimate.
To test GLOBETROTTER, we simulated individuals with ancestry proportions 8% Western EU, 8% Eastern EU, 34% Southern EU, and 50% Levant, with admixture happening 30 generations ago. GLOBETROTTER was able to recover all proportions within ±1% of the simulated ones. For simulations with ancestry proportions 70% Southern EU and 30% Levant, the GLOBETROTTER-inferred EU proportions were slightly overestimated at 73%. Thus, given GLOBETROTTER's estimated 70% EU ancestry in AJ, the implied true EU ancestry in AJ would be 67%. On the other hand, RFMix's inferred proportions were underestimated at 62%. However, the bias for simulated 50% Southern EU and 50% Levant ancestries was much lower, with RFMix's inferred EU proportions at 48%.
In conclusion, there remains some uncertainty regarding the amount of EU ancestry in AJ, to be further investigated in future studies. It seems plausible that the true EU ancestry proportions are midway between RFMix's and GLOBETROTTER's estimates. For most of this paper we assumed the RFMix estimate (≈55%), as (i), it is supported by other lines of evidence; and (ii) the results from the two modes of GLOBETROTTER were discordant (see below).

GLOBETROTTER-inferred admixture parameters on simulated data
We used simulations to test the ability of GLOBETROTTER to jointly infer admixture time and sources [12]. The simulated individuals had 70% Southern EU and 30% Levant ancestries, with admixture occurring 30 generations ago. GLOBETROTTER inferred two sources: the first, comprising of 39% of the total ancestry, was a mixture of 15% Southern European ancestry and 85% Levant ancestry; the second source was 1% Eastern European, 28% Western European, and 71% Southern European. Thus, the true Southern EU ancestry proportions were not properly recovered (inferred 49% vs simulated 70%), although the global EU ancestry was correctly inferred (67% vs simulated 70%). The inferred admixture time was overestimated at 40 generations.

The number of admixture events
GLOBETROTTER is able to infer multiple admixture events, although for AJ, the inferred history included only a single event. This might be at odds with our hypothesis (supported by the IBD analysis) of prebottleneck admixture with Southern Europeans followed by post-bottleneck admixture with (possibly) Eastern Europeans. However, we note that one source of ancestral population inferred by GLOBETROTTER is a mixture of Southern EU and Levant, which may correspond to the earlier event we have identified. It is also possible that the two events are too close to be teased apart, or that the inference of the admixture times is confounded by the severe AJ bottleneck [12].
Bibliography S1 Text section 6: The distribution of ancestry proportions under two-wave admixture 1 The distribution of ancestry proportions under general distributions of segment lengths In the main text, we considered a simple admixture pulse model, under which the distribution of segment lengths in A and B is exponential with rates (1 − q)t and qt, respectively. Under that model, the distribution of ancestry proportions was available in a closed form. Under a more complex admixture history, we assume that the distribution of the length of A and B segments take the general form g A ( ) and g B ( ). We still assume that A and B segments are independent (see below). The process can then be modeled as a two-state process. We start on the left end of the chromosome in state A or B with probabilities p A = A / ( A + B ) and 1 − p A , respectively (where A and B are the mean segment lengths), and draw a random segment length from the selected ancestry. When the first segment terminates, we switch ancestries and draw a segment length from the other ancestry, and so on until we reach the end of the chromosome. The distribution of x, the A ancestry proportion, can be computed in Laplace space by extending renewal theory methods developed in the physics domain (e.g., [1,2]). Let s be the Laplace pair of L (the total chromosome length) and u the Laplace pair of L A = xL (the total chromosome length covered by A segments). We then transform the density f (L A ; L) (from which the density of x can be easily obtained) tof (u; s). After some calculations using renewal theory, we eventually obtain, In the above equation,ĝ A (s) andĝ B (s) are the Laplace transforms ( → s) of g A ( ) and g B ( ). The details of the derivation are somewhat tedious and are therefore omitted. It can be shown, using Eq. (1), that the mean ancestry proportion x approaches p A as L → ∞. It can be also shown that Eq. (1) reduces to Eq. (5) in the main text for the admixture pulse model. 1

Conditions under which consecutive segments are independent
To study complex admixture histories, we use the model developed by Gravel [3] (section General incoming migration in the absence of drift and Figure 3 therein). Gravel proposed that the ancestry along the chromosome could be described by a Markov process, whose states correspond to the identity of the source population (i.e., A or B), combined with the time when each segment entered the admixed population. Gravel then derived the transition rates for any general admixture history. While the extended state space process is Markovian under any history, consecutive A and B segment lengths are no longer independent. However, further examination demonstrates that as long as migration beyond the initial event is limited to just one population, consecutive segment lengths remain independent.

A two-wave admixture model
Consider a model where populations A and B have merged t 1 generations ago, contributing proportions q and 1−q to the admixed population. Then, t 2 (< t 1 ) generations ago, migrants from population A have replaced a proportion µ of the gene pool of the admixed population. No other events then take place until the present. The corresponding Markov process, using Gravel's method [3], has three states: A 1 , A 2 , and B, representing migrant segments from A at time t 1 , from A at time t 2 , and from B (at time t 1 ), respectively. Let us compute the distributions of the lengths of A and B segments. The transition rate is t 1 when at states A 1 and B, and t 2 when at A 2 . It can be shown that once a transition is made, the next state is chosen according to the following transition probability matrix The states are ordered as (A 1 , A 2 , B) and P ij (i, j = 1, 2, 3) is the probability to jump from state i to state j. Note that we neglected the first generation after admixture, during which A and B segments do not yet mix [3].
It is now easy to see that B segment lengths are distributed exponentially with rate t 1 (1 − P B,B ), or This equation was also (implicitly) derived in [4] in a different way. For the A segments, define g A1 ( ) as the distribution of A segment lengths, when the process entered the A states at state A 1 , and similarly for g A2 ( ). Since the process enters A 1 and A 2 from B (with the possible exception at the leftmost end of the chromosome), the distribution of A segments satisfies To find g A1 ( ) and g A2 ( ), we can write integral equations, We solved these equations by Laplace transform ( → s). Using the convolution theorem,ĝ A1 (s) = These are two linear equations in two variables (ĝ A,1 (s) andĝ A,2 (s)), which are easily solved. Then, g A,1 ( ) and g A,2 ( ) are obtained by Laplace transform inversion. We then use Eq. (4) to obtain g A ( ). We carried out these steps in Mathematica, leading to the final result, where and β. Now that we have g A and g B (Eqs. (7) and (3), respectively), we can use Eq.
(1) for the distribution of the ancestry proportions. We invertedf (u; s) with respect to u using Mathematica and then numerically with respect to s to obtain f (x; L).

Simulation results and fitting
We ran simulations of the Markovian Wright-Fisher model described by Gravel [3]. The model assumes 2N haploid individuals (chromosomes). Each chromosome in the current generation is formed as a mixture of the chromosomes of the previous generation. Ancestry changes occur as a Poisson process with rate 1 (per Morgan), and at each ancestry change, the ancestral chromosome is chosen randomly out of all 2N available chromosomes. In the pulse admixture model, each chromosome in the first generation is assigned to population A or B with probabilities q and 1 − q, respectively, and the evolution of the chromosomes is traced for t generations. The two-wave model is the same (with overall time t 1 ), except that at t 2 generations ago, each chromosome is replaced by a whole-A chromosome with probability µ.
Representative simulation results are shown in Supplementary Text Figure  1. It can be seen that our theory matches the empirical data very well. However, the empirical distribution can also be fitted well by a distribution corresponding to an admixture pulse model, with parameter q pulse close to the expected mean (µ + q(1 − µ)) and t pulse intermediate between t 1 and t 2 . This suggests that inference based on the more complex model may not have sufficient evidence, under some conditions, to justify the additional admixture event. We simulated a two-wave admixture model according to a Markovian Wright-Fisher model [3] with N = 2500. The other model parameters are indicated in the title of the figure. We recorded the fraction of each chromosome that descends from the A population, and plotted the histogram of the ancestry proportions (circles). The theory that we developed (Eqs. (1), (3), and (7)) is plotted as a solid (blue) line. We then fitted a pulse admixture model with just two parameters (q and t), by matching the mean and variance of the empirical data. The distribution of the ancestry proportions under the pulse model (Eq. (4) in the main text) is plotted as a dashed (purple) line. The best fit for t was 9.7, intermediate between t 1 and t 2 .