Inferring the ancestry of parents and grandparents from genetic data

doi:10.1371/journal.pcbi.1008065

Fig 1.

The perfect pedigree model for g = 3 for a single site.

H: extant haplotype. Pedigree is haplotype-based, which models ancestry changes along the genome. Ancestry origins of 8 founders are listed above the perfect pedigree. At one site, A and B indicate which of the two ancestral populations of each founder’s haplotype. The combined vector of these values is C. Here, C = (ABBAAABB). Arrows: the recombination vector R. Here, R = 0111101, where meiosis is ordered the in reverse time order and also from left to right. The population A shown in red: the ancestry of H as traced back by the recombination setting. Arrows can change direction at the next site. Founder ancestry is at a specific sites (say s₁, s₂, s₃, …). Note that founder ancestry at the founders of the pedigree remains the same at different genomic position: these founders are the founding members of the admixed population and they are not admixed themselves.

More »

Expand

Fig 2.

The perfect pedigree model for genotype G = (H₁, H₂) at a site, and K = 2.

Two sources populations: A and B. (A) Outline pedigree in black: the perfect pedigree for genotype G. Two pedigrees embedded in red are for haplotype H₁ and H₂ respectively. Ancestral settings and recombination settings with the same label have the same meaning. (B) The simplified perfect pedigree for genotype G. Ancestral vector C: (ABAB). The arrows without label define recombination vector R. Each Rⁱ is for a meiosis in the pedigree. P: the phasing error setting. Note that different from Fig 1, the founders in this pedigree can be admixed themselves (i.e., ancestry of these founders can change along the genome) based on the two-stage Markovian pedigree model.

More »

Expand

Fig 3.

An example of -based HMM with 128 ACs as states for K = 2 generations (i.e. grandparents).

Arrows: possible transitions along the Markov chain from site i − 1 to i. The vectors under each pedigree provide the binary representations of P, C, and R, respectively, for the pedigree. The two top thick arrows and the lower thick arrow indicate the settings of R and P, respectively.

More »

Expand

Fig 4.

Faster calculation of the probabilities of ACs.

Red lines break the transition probability matrix into four smaller pieces (for ACs with length 2). The probability vector at the previous site is broken into two pieces. Multiplication of the matrix and the vector is faster due to shared parts between these pieces.

More »

Expand

Table 1.

A list of parameters and their default values used in the simulation.

More »

Expand

Fig 5.

Comparison between the accuracy of PedMix and a random guess for parents, grandparents and great grandparents.

About 570,000 SNPs are used for parent and grandparent simulations. About 26,000 SNPs are used for great grandparent simulations since this case needs more computing resources.

More »

Expand

Table 2.

Mean and standard deviation of the error in the estimate of admixture proportions for ADMIXTURE, RFMix and PedMix (in units of %).

Note the PedMix results are the average proportions from the estimated admixture proportions of parents (denoted as par.) or grandparents (denoted as grandpar.).

More »

Expand

Table 3.

Mean and standard deviation of the error in the estimate of admixture proportions of parents from ANCESTOR and PedMix (in units of %).

More »

Expand

Fig 6.

Mean error for different simulation parameter settings.

(A) Varying mutation rates. L = 5 × 10⁸. (B) Varying recombination rates. L = 5 × 10⁸. (C) Varying the time since admixture. (D) Ancestral population split time. For t = 0.01, there are no SNPs left after using the d_f = 0.5 cut-off. As a result, we use d_f = 0.2, leaving 74,227 SNPs for analysis. Default parameters are used except for the variable indicated by the X axis of each plot.

More »

Expand

Fig 7.

Inference error vs phasing error: Comparison among three different datasets.

These include: dataset without phasing errors, dataset with phasing errors and preprocessed datasets. (A) Samples from an admixed population. (B) Samples from an unadmixed population.

More »

Expand

Fig 8.

Six pedigrees used for the simulations scheme of semi-simulated data.

The percentage of CEU origin is shown in the pedigrees. Admixture proportions in black are estimates by RFMix using the ancestors’ genotypes directly and are assumed to be the true proportions. Admixture proportions in blue and red are estimates by PedMix using genotypes of the focal descendant individual with (red) or without (blue) phasing error. Mean error is the average over two parents and four grandparents.

More »

Expand

Fig 9.

Admixture proportion inference error for parents and grandparents for ten HapMap trios from the ASW population.

Left: inference error for ten HapMap trios individually or on average. Right: inference error with different amounts of data. X-axis: the number of chromosomes used. Y-axis: parental/grandparental inference error. Parental: the difference between PedMix’s parental inference for the child and RFMix for the parents’ haplotypes. Grandparental: the difference between PedMix’s grandparental inference for the child and PedMix’s parental inference for the parents’ haplotypes in the trio.

More »

Expand