Fig 1.
Problem statement illustration.
Squares represent males and circles represent females. Horizontal lines create couples and show sibling relationships. Parents and offspring are connected by vertical lines. Filled in symbols represent individuals who have been genotyped. Our aim is to reconstruct all ungenotyped individuals (orange question marks) who have genotyped descendants.
Fig 2.
In the first two steps we identify IBD segments and compile a list of potential sources for each one. In the iterative phase, we alternate between choosing sources for each IBD and grouping the IBDs that are placed within each individual. If the IBDs assigned to an individual can be arranged into two haplotypes meeting thresholds for coverage defined in Methods, then those haplotypes are considered strong. IBD segments that conflict with strong haplotypes are rejected and must be assigned a different source. When we are no longer building more haplotypes, we return the reconstructed chromosomes.
Table 1.
Individuals per generation, with the founders assigned generation 1.
The generation of a non-founder is defined to be one more than the maximum of the generations of their parents. Married-in individuals carry the generation of their spouse. The total number of individuals is 1338, with genotype information for 394 (largely in generations 9-12).
Fig 3.
Genetic similarity vs. kinship.
For each pair of genotyped individuals, we compute their genetic similarity (counting genotyped SNPs only) and plot this on the y-axis against their kinship coefficient on the x-axis. The expected linear trend is apparent, with an average sequence similarity of 72.5%. The minimum similarity out of all pairs was 70.5%, and the maximum was 99.999% (twins).
Fig 4.
(A) Let individuals 1–8 be the genotyped individuals of this pedigree. Let C = {1, 2, 5, 7, 8} (orange individuals) be the cohort sharing IBD segment I. Note that this pedigree contains two loops, since c and f share recent ancestors p and q, and d and e share recent ancestor ℓ. The multiset Mp for each ancestral individual p is shown below the node name. Mp is formed by concatenating the multisets of p’s children, and it represents the number of paths from ancestor p to each member of the cohort. (B) After trimming redundant ancestors and merging couples, we obtain a set of putative sources for the IBD segment. In this case, we have three potential sources: S = {gh, ℓ, pq}. We begin the iterative phase by selecting the source with the fewest descendance paths, which in this case is gh (starred). We place the IBD segment in individuals that are on all paths from gh to the cohort. In this case we would add the IBD segment to individuals b, c, and d (light orange).
Fig 5.
Given a cohort of five individuals sharing an IBD segment (orange), we often obtain multiple sources (blue nodes) and multiple descendance paths (blue lines) from each source. In this example we have 11 total paths from three sources. After we choose a source, we assign the IBD segment to ancestors along all descendance paths (light orange). (A) One path from source ℓ. (B)-(C) Two different descendance paths from the same source pq. We would not assign the IBD to d and e since they are not on all paths from this source.
Fig 6.
Grouping algorithm illustration.
Each horizontal line represents one IBD segment that we placed within a specific individual (highlighted in the pedigree inset). Each vertical line indicates a difference (heterozygous site) between groups. In this case, the orange IBD segment conflicts with both the blue and green groups, so we would reject its source and attempt to find a new one in the next iteration.
Fig 7.
Example of the grouping algorithm on a genotyped individual.
Each horizontal line represents one IBD segment shared with a cohort of other genotyped individuals. IBD segments of the same color represent haplotypes, and have a consistent sequence along the chromosome. Small vertical lines represent heterozygous sites between the two haplotypes. (A) Chrom 8: very occasionally we merge groups incorrectly and obtain three groups. (B) Chrom 21: we almost always see two clear haplotypes (here we also see a large stretch of homozygosity).
Table 2.
For each chromosome 18-22, we left out one genotyped individual in turn and attempted to reconstruct their haplotypes. The second row shows how many individuals (out of 89) met our criteria for reconstructed. The third row shows the average sequence identify of the individuals we we were able to reconstruct, measured against their true sequences.
Table 3.
Whole-genome ancestral reconstruction results: Simulated data.
The second column shows the total number of IBD pairs identified between genotyped individuals. The third column shows the number of unique IBD segments per chromosome. The fourth column shows how many iterations the algorithm needed to converge. The fifth column shows the number of ancestral (ungenotyped) individuals we were able to successfully reconstruct. The sixth column shows the average sequence similarity of the individuals we were able to reconstruct, as compared to their true genomes. The rightmost two columns show the number of individuals that we predicted would be very well reconstructed, along with their average accuracies.
Fig 8.
Individual results: Simulated data.
The same results as Table 3, but shown on the individual level. The top set of figures shows reconstruction completeness as measured by the number of reconstructed chromosomes. The bottom set of figures shows reconstruction accuracy as measured by sequence identity averaged over the reconstructed chromosomes. These two metrics are plotted against three statistics about each individual: the generation number (lower is more ancient), the number of genotyped direct descendants (children and grandchildren), and the inbreeding coefficient as calculated by PedHunter using the entire AGDB comprised of more than 500,000 individuals. Correlation coefficients are shown for each relationship.
Fig 9.
Varying population size and source-finding approach: Simulated data.
The top panel shows the average reconstruction accuracy of chromosome 21 as a function of pre-migration population size. The bottom panel shows the number of reconstructed individuals for the corresponding scenarios. The greedy source-identification algorithm is denoted “min path” and the probabilistic algorithm is denoted “max prob”. There is a clear tradeoff between accuracy and the number of individuals reconstructed.
Table 4.
Whole-genome ancestral reconstruction results: Amish data.
The second column shows the total number of IBD pairs identified between genotyped individuals. The third column shows the number of unique IBD segments per chromosome. The fourth column shows how many iterations the algorithm needed to converge. The fifth column shows the number of ancestral (ungenotyped) individuals we were able to successfully reconstruct. We require a successfully reconstructed chromosome to have two haplotypes that cover at least half the chromosome, with sufficient IBD support for each haplotype. Finally, the last column shows the runtime.
Fig 10.
Left: IBD length distribution for the real data for chromosome 21. Right: IBD length distribution for the simulated data for chromosome 21. x-axis units are 10Mbp.
Fig 11.
The blue and green groups are removed, since they are less strong than the cyan and red groups. In the next iteration, we retain only strong groups and consider the individual reconstructed. Newly sourced IBDs after this point may not conflict with these reconstructed haplotypes.
Fig 12.
Successful ancestral reconstructions.
Ancestral reconstructions of ungenotyped individuals, from a variety of chromosomes and generations (back in time). As we go back in time, we generally have fewer IBD segments to group.
Fig 13.
Each node represents a nuclear family (parents and children). When a child of one family becomes the parent of another, we draw an edge. Black nodes have at least 80% of the family genotyped. Gray nodes have at least 80% of the family without genotyped descendants. Yellow (fewer)—Red (more) colors represent the average number of chromosomes reconstructed for the individuals in the family.