Inferring the ancestry of parents and grandparents from genetic data

Inference of admixture proportions is a classical statistical problem in population genetics. Standard methods implicitly assume that both parents of an individual have the same admixture fraction. However, this is rarely the case in real data. In this paper we show that the distribution of admixture tract lengths in a genome contains information about the admixture proportions of the ancestors of an individual. We develop a Hidden Markov Model (HMM) framework for estimating the admixture proportions of the immediate ancestors of an individual, i.e. a type of decomposition of an individual’s admixture proportions into further subsets of ancestral proportions in the ancestors. Based on a genealogical model for admixture tracts, we develop an efficient algorithm for computing the sampling probability of the genome from a single individual, as a function of the admixture proportions of the ancestors of this individual. This allows us to perform probabilistic inference of admixture proportions of ancestors only using the genome of an extant individual. We perform extensive simulations to quantify the error in the estimation of ancestral admixture proportions under various conditions. To illustrate the utility of the method, we apply it to real genetic data.

1. Delete the SNPs that have zero or close to zero recombination fractions with their immediate neighboring SNPs.
2. Choose an allele frequency difference threshold d f . Delete SNPs with population allele frequency difference between the two ancestral populations less than d f .

S1.1.2 LD pruning
Before LD-pruning, rare variants with combined minor allele frequencies in the two ancestral populations lower than f are removed. We use the correlation coefficient of linkage disequilibrium, r 2 , in the ancestral populations to measure the level of linkage disequilibrium between two SNP sites. We scan through the SNPs sequentially. If r 2 > c (with default value of c being 0.1) between the current SNP and the previous SNP within a window of length W = 10Kbp, in either of the two ancestral populations, then the current SNP is removed.

S1.2 Preprocessing for phasing error
Phasing error results in a switching between two haplotypes, which is similar to how recombination affects haplotypes. The difference is that it only occurs in the current generation. Phasing error adds more noise in our model, especially when phasing error occurs much more frequently than recombination. Empirically, it is known the recombination rate for human is approximately 10 −8 per generation between two adjacent base pairs. In most current data (e.g., haplotypes from the 1000 Genomes Project), phasing error occurs as frequently as once every 50 kb, which is three orders of magnitude larger than the recombination rate. So it is necessary to reduce the effect of phasing error. For this, we preprocess the haplotypes to reduce phasing error. Here are the steps that we use to remove likely phasing error for two extant haplotypes.
1. With the allele frequencies of two ancestral populations at each site, we first make a rough estimate of ancestry for the genotype G. For example, suppose the allele 2. For each site of genotype G, we assign a "dominating ancestry". A dominating ancestry is an ancestry with one of these four possible pairs, (A, A), (A, B), (B, A), or (B, B), that appears most frequently within a region of certain length.
Here we use the estimated number of SNPs between two phasing errors as the region length. We view (A, B) and (B, A) as type 1 ancestry, (A, A) as type 0, and (B, B) as type 2.
3. In the region that has the dominated ancestral type 1, we switch the two haplotypes if its ancestral painting (A, B) or (B, A) is different from its previous positions.
With the assignment of dominating ancestry, the two haplotypes phased from the genotype G can be viewed as blocks of different dominating ancestry. For example, Figure A shows the ancestry for two haplotypes with black blocks indicating ancestry A and white blocks indicating ancestry B. Figure A (A) provides an example on the ancestry of two phased haplotypes. Here we divide the whole region into three types of sub-regions: type 2 for (B, B), type 1 for (A, B) or (B, A) and type 0 for (A, A). Note that in type 0 and 2 regions, it is not obvious (and also not necessary) to detect and fix phasing errors. In a type 1 region, when we detect switch-overs between ancestry (A, B) and (B, A) within the region, we consider such switch-overs as the phasing error position and switch the suffix of two haplotypes from this point to make it consistent. This is because the probability that two recombination events happen at exactly the same place is 10 −8 × 10 −8 = 10 −16 , which is much smaller than the phasing error probability of 10 −5 .
We note that the three-steps strategy described above does not remove all phasing errors. In fact, it may even add switching errors in some rare cases. However, our simulations show that this procedure can reduce a significant amount of obvious phasing errors, and help to reduce the noise of data (see the Results section). Without preprocessing, phasing error rate for genotypes is approximately 1 over 50kb, which is p p = 0.00002. Experiments on some simulated data show that preprocessing for phasing errors can reduce approximately 2/3 phasing errors for admixed individuals. One can use a smaller phasing error rate p p = 0.0000066 in PedMix after preprocessing.

S1.3 Expected accuracy by random guess
In order to provide a baseline for the evaluation of the inference accuracy, we use a Bayesian model based random guess for estimating ancestry admixture proportions. Here we assume that the mean of admixture proportions of the ancestors is the admixture proportion of the focal individual. We treat each SNP position independently in the following. Given a genotype of a focal individual, we first sample ancestry for each SNP site based on the allele frequency of the SNP in the two ancestral populations. Ancestry of an allele (of some individual) refers to which of the two ancestral populations this allele originates from. With the sampled ancestry of this focal individual, we sample ancestry for his/her ancestors (parents, grandparents or great grandparents) following the posterior distribution. For example, the posterior probability of the parental ancestry is given in Equation 1.
Here A 0 is the sampled ancestry for one haplotype of the focal individual at a SNP site. A 1 and A 2 are the sampled ancestry of the two haplotypes from a single parent (which provides the allele for the focal individual). With the Mendelian segregation laws, p(A 1 ) = p(A 2 ) = 1 2 are prior probabilities. The grandparental posterior probability p(A 1 , A 2 , A 3 , A 4 |A 0 ) and great grandparental posterior probability p(A 1 , A 2 , A 3 , A 4 , A 5 , A 6 , A 7 , A 8 |A 0 ) can be derived similarly. The estimate by the random guess is then computed using the sampled ancestry of each ancestor.
Note that random guess doesn't use information from admixture tracts and their lengths. For example, given the focal individual's genotype of pedigree CCCY with no phasing errors (Results Section). The sampled ancestry of parents and grandparents all present ∼50% admixture proportions. Given genotype with no phasing error, random guess can still collect information for parents but fails to collect useful information for ancestors in grandparents and great grandparents. When adding phasing errors to genotype, random guess performs even worse in parents. For example, given the focal individual's genotype of pedigree CCYY (Results Section), random guess gets ∼50% ancestry for each parents while two parents are actually 100% and 0%.

S1.4 Inference of ancestral admixture proportions with composite likelihood
The inference framework in PedMix takes advantage of the distribution of admixture tracts. Here, an admixture tract refers to a segment of the genome where the ancestral origin remains the same (i.e. coming from the same ancestral population). However, parental admixture proportions can be estimated simply from the distribution of genotype frequencies using composite likelihood as follows. Let M = (m 1 , m 2 ) be the admixture proportions of the two ancestors, then the composite likelihood is defined as the product of likelihoods in individual sites: The sampling probability for each site, p(G i |M ), is calculated as a product of allele frequencies in the two parents using standard methods as follows. The probability of sampling an allele of type j from a parent with admixture proportion M and 1 − M from the population A and the population B respectively is M f A + (1 − M )f B if the allele frequencies of the allele j at the site i in the two populations are f A and f B respectively. We may infer admixture proportions by maximizing the composite likelihood. This can be done, for example, by performing a grid search over M .
This composite likelihood based method is computationally much faster than PedMix because it ignores linkage disequilibrium (LD). However, the method does not generalize to grandparents or more ancient ancestors as such models are not identifiable in the composite likelihood setting. To see this, let M = (m 1 , m 2 , m 3 , m 4 ) be the admixture proportions of the four grandparents, with (m 1 , m 2 ) being from one grandparental couple, and (m 3 , m 4 ) from the other. Let the allele j be one of the two alleles of the genotype G i . Without loss of generality, we further suppose this allele is from the parent (parent 1) descending from the grandparents with admixture proportions m 1 and m 2 . The sampling probability, p(G i |M ), is then obtained as a sum of products of terms like p(j, allele from parent 1|m 1 , m 2 ) by summing over both possible assignments of alleles to parents. Now, (3) Equation 3 shows that m 1 and m 2 in p(j, allele from parent 1|m 1 , m 2 ) appear only as the sum m 1 + m 2 . For any genotype G i , the composite likelihood p(G|M ) only contains information about m 1 + m 2 but not m 1 and m 2 individually. Therefore, m 1 and m 2 are not separately identifiable in the composite likelihood model.

S1.5 User guideline for using PedMix
Here is a list of user inputs needed by PedMix.
1. Phased haplotypes for the extant individual for whom we are to infer the admixture proportions of his or her ancestors. Haplotypes can be given in segments (or chromosomes), where segments are assumed to be independent.
2. For each SNP, allele frequencies of two ancestral populations.

Recombination fractions between adjacent SNPs along the haplotypes.
Often the input haplotypes may contain too many SNPs or less informative SNPs. In this case, the user may need to apply various data trimming techniques. We suggest to use frequency-based pruning with d f = 0.5. S1.6 Additional results S1.6.1 Details in comparison to ADMIXTURE, RFMix and ANCESTOR S1.6.1.1 Verification To verify that the average admixture proportions of ancestors of an individual provides a good estimate of the admixture proportion of focal individual, we simulate a sample of 100 individuals and compare the average values of admixture proportions of their ancestors with the admixture proportions of themselves. The sampled individuals are drawn either g = 10 generations or g = 5 generations after the time of admixture. The absolute difference (the error) between the true admixture proportions of one individual of the current generation and the average admixture proportions of his/her ancestors in the K th generation is computed as |m 0 − 1 2 K 1≤j≤2 K m j |, and we report the mean and variance of this as the mean error and the variance in the error among individuals (Table A). Here m 0 is the true admixture proportion of the sampled individual. m j is the true admixture proportion of the sampled individual ancestor j in the K th generation.
We find ( Table A) that the average ancestral admixture proportions indeed approximately match the admixture proportions of the individual, as the mean differences are fairly close to 0 in Table A. Several aspects on the results in Table A worth attention. First, the variance in the error increases with more ancestors. Second, the variance tends to be larger for individuals from generations that are closer to the admixture event. This is because when the time since admixture is short, the individuals tend to have more different admixture proportions than individuals from a generation that is more distant from the admixture event and is thus well mixed. S1.6.1.2 ADMIXTURE, RFMix and ANCESTOR setting We apply ADMIXTURE and RFMix to infer the current generation admixture proportions on the same datasets. RFMix is run with default parameters under the popphased setting. As suggested by ADMIXTURE, genotypes are preprocessed with LD pruning (see Section S1.1.2) with parameters c = 0.1 and W = 10Kbp. To achieve the best performance in ADMIXTURE and RFMix, we also include all ancestral genotypes from the two ancestral populations along with 20 individuals from the admixed population to these two tools. That is, ADMIXTURE is run on "supervised mode". The number of ancestral populations K is set to 2. ".bed" file is generated by PLINK. For real data, we use 170 haplotypes from CEU and 176 haplotypes from YRI as two ancestral populations in ADMIXTURE and RFMix to estimate admixture proportions for 61 genotypes from ASW. Here we compute the Pearson Correlation coefficient for admixture proportions estimates from ADMIXTURE, RFMix, and the average over parents by PedMix over 61 individuals in ASW population. The estimates by ADMIXTURE and RFMix show the highest correlation (0.9975, see Table B). The estimates by PedMix (average over parents) also have a high correlation with ADMIXTURE (0.9954) and RFMix (0.9945).
To estimate admixture proportions for a given individual, we use the averages of the individual's local ancestry estimates from RFMix across the genome.
ANCESTOR infers the admixture proportions of parents of a focal individual given the ancestry state of each position in the genome. ANCESTOR allows phasing error in genotypes and can be used for multiple ancestries. In this paper, we use the ancestry inferred by RFMix from the Viterbi decoding as the ancestry states in ANCESTOR.

S1.6.2 Data trimming
To investigate the effect of frequency-based pruning and LD pruning, we perform a small investigation of the relative effect of LD-pruning and frequency-based pruning on the same simulated dataset. To efficiently compare the two trimming strategies, we simulate a shorter length genome L = 5 × 10 8 . The simulation settings are chosen in order to better compare the two ways of trimming and also to ease the computational burden.
There are ∼ 2.44M SNPs simulated for the whole genome. Here we examine two cases of data trimming. In both cases, we use a window size of W = 10Kbp and c = 0.1 for LD pruning following the procedure described in Section S1.1.2. In the first case of LD-pruning, we use f = 0.05 to remove rare variants (i.e., SNPs with combined frequency in the two populations being smaller than f ), this results in 284K SNPs left. We compare frequency-based pruning with the threshold d f = 0.27 to match the similar amount of SNPs (283K) in this case. In the other case, we remove rare variants with f = 0.2, and compare frequency-based pruning with d f = 0.5, resulting in 97K and 94K SNPs respectively (Table C). In both cases inferences improve as more SNPs are removed. Overall the frequency-based pruning approach is slightly better than the LD pruning approach, at least in this simulation. We have, therefore, used frequency based trimming as the default strategy for data trimming.
To investigate the effect of frequency-based pruning, we simulate haplotypes for a small region of length 10 6 bp with 545,302 SNPs. The parameters used in the simulation are described in the caption of Figure B and we investigate different values of the previously explained allele frequency thresholds, d f . Notice that trimming results in a substantial reduction in mean error, particularly for inferences of admixture proportions in grandparents. However, when the trimming threshold is too large, an increase in the mean error is observed due to the reduction in number of SNPs. In this case, the optimal trimming threshold appears to be around d f = 0.5. Note that no-trimming (i.e., d f = 0) can lead to higher error. Also note that we use a relatively small genome here (L = 10 6 ). As shown in Figure B, trimming using larger values of d f value results in a substantial reduction in mean inference error, especially for inferences of admixture proportions in grandparents. However, when the trimming threshold is too large (e.g. d f = 0.7), an increase in the mean error is observed as too much information is now being lost by removing SNPs. The optimal trimming threshold depends on many parameters, such as mutation rate and the split time of the ancestral populations. It is challenging to determine the optimal value of d f for specific dataset. As a rule of thumb, it is desirable to have at least 100 SNPs per tract (from one ancestral Clearly, more work is needed to identify optimal pruning strategies for real data analyses, for the current method and for other methods that use HMMs for population genetic inferences. However, such investigations are not the main subject of the current paper.

Impact of data amount
We run PedMix on different amounts of data, by sub-sampling 5, 10, 15 and 20 chromosomes, to evaluate the effect of data amount on inference accuracy. The mean error is estimated over 10 samples. From Fig C, we can see a clear linear decrease of mean error for parents, grandparents and great grandparents inference, as more data are added. The highest mean error for five chromosomes is 18.07%, which is still much lower than the random guess (about 35%).

Fig C.
Impact of data amount We use 5, 10, 15, and 20 chromosomes respectively. For great grandparents inference, d f = 0.9 is used due to computational burden. S1.6.3 Impact of recombination rate on a larger genome As discussed in the main text, the mean error for parental and grandparental inferences decreases as recombination rate increases. Results are shown in Figure D. As we simulate genomes with longer length, the increase of recombination rate does not lead to the significant decrease in mean error. This happens especially when the simulation length is long (e.g., whole genome), the decrease of mean error is only about 1% as recombination rate increases 10 times.

Fig D.
Varying recombination rate using 10 chromosomes S1.6.4 Results on the 1000 Genomes Project data We run PedMix on the data from the 1000 Genomes Project . The 1000 Genomes Project recently released phased haplotypes on 22 chromosomes for 1,092 individuals. We analyze data from the CEU (Utah Residents with Northern and Western European Ancestry), YRI (Yoruba in Ibadan, Nigeria) and ASW (Americans of African Ancestry in SW USA) populations. The African-American population tends to have admixed European and African (primarily West-African) ancestry. Here we regard CEU and YRI as the two source populations for ASW and infer the admixture proportions of parents and grandparents of ASW individuals. For these three populations, there are 85 CEU individuals (170 haplotypes), 88 YRI individuals (176 haplotypes) and 61 ASW individuals (122 haplotypes) in total in the 1000 Genomes Project data.
We approximate the allele frequencies in the two hypothetical source populations using the average allele frequencies in the CEU and YRI populations. The original data has 1,060,387 SNPs in total. After applying the frequency-based pruning with d f = 0.5, there are 256,122 SNPs (about 24%) left. The recombination fractions are calculated based on the recombination hotspot map of the 1000 Genomes Project. We sample 7 individuals from the CEU, YRI and ASW populations and apply PedMix to infer admixture proportions in their parents and grandparents.
The inference results of admixture proportions for parents and grandparents for seven 1000 Genomes individuals are shown in Figure E. We infer the admixture proportions of the CEU individual ancestors to be 98% of CEU origin on average. Similarly, the admixture proportions of YRI individuals ancestors are 98% of YRI origin on average. The admixture proportions in the African-American ASW population vary considerably among individuals ( Figure E). Note that some ancestors for the CEU and YRI individuals have small (but non-zero) inferred admixture proportions. Since the proportions are very small (within the error margin of our method), we cannot determine whether these ancestors are admixed or not.
To further validate our results, we analyze 61 individuals from ASW population using ADMIXTURE and RFMix. Genotypes are pruned with the default LD pruning setting (see Section S1.1.2 of the Supplementary Material). Meanwhile, genotypes from CEU and YRI populations are provided as the two ancestral populations in both tools. Using PedMix, we compute the average admixture proportions from parents and grandparents to see if the results are consistent with those from ADMIXTURE and RFMix. The percentages of the CEU origin of the five ASW individuals are shown in Table D. We list five individuals from ASW with different level of admixture proportions (from 20% to 90%). Although the true admixture proportions of these individuals are unknown, the results of PedMix are consistent with those from RFMix and ADMIXTURE. Moreover, the estimate by PedMix is highly correlated with those by ADMIXTURE and RFMix. Pearson correlation coefficients are 0.9954 and 0.9945 respectively (Table B).

S1.6.5 Running Time
We now evaluate the computational efficiency of PedMix. PedMix is written in C++. To make the algorithm run faster, we not only adopt the divide-and-conquer strategy, but also make it run with multi-threads. Multi-threading can be useful when there are multiple chromosomes in the data. The best performance occurs when there are k chromosomes with similar number of SNPs using k threads in parallel. However, since it is an optimization problem, the convergence time is uncertain. In general, the running time increases exponentially with the number of generations inferred. Here we report the average running time of grandparent inference for 10 individuals and fix the number of threads to 1 as we increase the number of SNPs from 5,000 to 550,000 ( Figure F). As expected, we observe a clear increase in time when we use more SNPs.
For comparison, ADMIXTURE, RFMix and PedMix are run on same datasets in Results Section. To estimate admixture proportions over 20 individuals, ADMIXTURE takes 1.5 min using 20 threads, while RFMix takes 21.5 mins using one thread. On average, PedMix takes 3 mins for parental inference and 2.5 hrs for grandparental inference using 11 threads for each individual. ADMIXTURE and RFMix run much faster than PedMix. This is because the computation performed by ADMIXTURE and RFMix and the parameters to estimate are very different with those of PedMix. ADMIXTURE is somewhat faster than RFMix in our tests. ANCESTOR, which only allows parental inference, is much slower than PedMix. S1.6.6 Noise in population genetic data So far we have mainly focused on cases where the noise level in the population genetic data is assumed to be relatively low. In practice, however, there are various noises in population genetic data. For example, one source of errors comes from genetic map. This is because genetic map is usually inferred from genetic data and can contain noise. Moreover, allele frequencies of ancestral populations are usually estimated by extant populations and may contain noise due to genetic drift. Another possible error is that some ancestral population may not be known. In this case, a totally wrong ancestral population can be potentially used for inference. In the following, we provide empirical results to evaluate the impact of various population genetic noises on the performance of PedMix. S1.6.6.1 Noise in genetic map PedMix needs genetic distance data of SNPs as input. Genetic distance refers to the distance from the current SNP to the next SNP, in the unit of CentiMorgan. Accuracy of genetic distance data can affect the performance of PedMix. Therefore, it is useful to investigate the impact of noise in genetic map on the accuracy of PedMix. We now conduct experiments to evaluate the performance of PedMix with noisy genetic map.
We use 10 children from 10 trio families from the ASW (African ancestry in Southwest USA) population as the admixed individuals from the HapMap Project. The genotypes of these ten children are unphased. We use the program Beagle to phase these genotypes using the phased genotypes from parents (available from the HapMap Project) as the reference panel. We then apply the phasing error reduction approach in Section S1.2 on the children's haplotypes. We use the CEU and YRI populations as the reference panels of PedMix. We trim the input data using the frequency-based trimming approach (with the trimming threshold 0.4). Accuracy is evaluated using the same procedure as described in the Results section (for real data) in the main text. That is, the inference results for parents from RFMix are used as the ground truth. After running PedMix with these trimmed data (with original genetic map in HapMap), the average mean error of the inferred admixture proportions of parents is 3.37%.
We now perturb the genetic map. We experiment with two settings. We first add relatively small noise to genetic map: 20% error (positive or negative) is added to the genetic distance at each SNP. We run PedMix with the perturbed data. The average mean error is now at 3.73%, which is only slightly higher than that from the original genetic map. We now increase the error rate. For this, we generate random numbers as the added error values from a normal distribution with magnitude 0.1 to the genetic map. This step will add a random number within (−0.1, 0.1) to the genetic distance at each SNP. If the distance becomes negative, we assign a very small positive number instead. We run PedMix with this highly perturbed genetic map. The average mean error now is at 4.94%. Our results show that if there is significant noise in genetic map, accuracy of PedMix can be affected. S1.6.6.2 Noise in allele frequencies of ancestral populations As before, we sample ten children of ten trio families from the HapMap ASW population as the admixed individuals. CEU and YRI populations are used as the reference panels of PedMix. To test the effect of noise in ancestral allele frequencies, we add random noises in the allele frequencies of one reference population. Here, we add 20% (positive or negative) noise to each allele frequency of the CEU population. With this noisy data, the average mean error is now at 4.25%, which is somewhat higher than that for the original data (at 3.37%). This illustrates that the noise on allele frequency can affect the accuracy of PedMix, although the effect is not very large in our simulations. S1.6.6.3 Varying reference populations For HapMap data analysis, we have used CEU and YRI populations as the reference panels for families from the ASW population. Here, we assume that CEU and YRI are indeed good proxies for the true ancestral populations. To evaluate the performance of PedMix when a different reference population is chosen, we first replace YRI with LWK (Luhya in Webuye, Kenya), and then replace YRI with CHB (Han Chinese in Beijing, China) in new experiments. When LWK is used to replace YRI, our results show that the average mean error is now at 2.63%, which is somewhat lower than the original mean error (at 3.37%). This indicates that LWK may be just as good as YRI for being a reference population for these ten ASW families. When CHB is used, however, the average mean error is now at 53.33%, which is significantly higher than before. Our results indicate that reference populations should be chosen carefully. S1.6.7 Comparison with ANCESTOR on real data In the main text of the paper, we have compared PedMix with ANCESTOR (Zou, et al, 2015) with simulation data. We now compare PedMix and ANCESTOR in real data from the HapMap data. Here, we use the same 10 children of 10 trio families from ASW (African ancestry in Southwest USA) population as the admixed individuals as in Section S1.6.6. Recall that the average mean error in the inferred admixture proportions of parents by PedMix is 3.37%.
We now run ANCESTOR on these 10 children. ANCESTOR needs the inputs of the individual's ancestry information on each SNP. We use RFMix to infer the local ancestry information of the children, which are provided to ANCESTOR. Our results show that the mean error of ANCESTOR is 3.08%, which is slightly better in inferring admixture proportions of parents than PedMix. However, ANCESTOR needs more information than PedMix, and the running time of ANCESTOR is much longer than PedMix's. For example, in our experiments here, ANCESTOR takes around 20 hours for one individual, while PedMix only needs 60 seconds. S1.7 More details on likelihood computation on the two-stage perfect pedigree model For the ease of reading, we provide more explanations to the likelihood computation on the two-stage perfect pedigree model.
Equation (1). The sampling probability of an AC, h(AC i ), at the site i, is equal to the sum of the transition probability part (between site i − 1 and site i) times the emission probability p e (AC i ) (at site i). The transition probability part considers each AC at site i − 1, denoted as AC i−1 , and the transition probability between AC i−1 to AC i . This is exactly the "forward algorithm" in the standard hidden Markov model. Equation (2). T j is transition probability for founder j whose ancestry (0 or 1, corresponding to A and B respectively) changes within a genomic region of distance d (in bp). Here, P j A→B is the rate of founder j's ancestry changes from A to B for 1 bp. By taking the first order approximation, the probability of changing the ancestry from 0 (A) to 1 (B) is approximately dP j A→B . This explains the first line of Equation (2). The rest follow similarly. Equation (3). This equation is for the phasing change probability along the genome of the focal individual. That is, I is the probability of a phasing error occur within a distance of d bp. d p is the rate of phasing error within 1 bp. This equation is, again, due to the first order approximation. Equation (4). Recall the binary R i [k] value refers to whether parent the k-th meiosis inherits from (the left or right) at site i.Thus, R i−1 [k] = R i [k] suggests there is a recombination occurred in the k-th meiosis between site i − 1 and site i. Equation (5). This equation means that the overall transition probability between two ACs, AC i−1 (at site i − 1) and AC i , consists three events that may occur between site i − 1 and i for these two ACs: 1. I: does phasing error occur? 2. B k : recombinations may occur at the k-th meiosis.
3. T j : ancestry change for founder j.
Equation (6). Emission occurs at the focal individual, which has two haplotypes h 1 , h 2 . This equation says that the emission probability is equal to the product of the probabilities of emitting the two alleles of these two haplotype, which are in turn equal to the allele frequencies of the corresponding ancestral populations (as specified in AC i ).