Ancestral haplotype reconstruction in endogamous populations using identity-by-descent

In this work we develop a novel algorithm for reconstructing the genomes of ancestral individuals, given genotype or sequence data from contemporary individuals and an extended pedigree of family relationships. A pedigree with complete genomes for every individual enables the study of allele frequency dynamics and haplotype diversity across generations, including deviations from neutrality such as transmission distortion. When studying heritable diseases, ancestral haplotypes can be used to augment genome-wide association studies and track disease inheritance patterns. The building blocks of our reconstruction algorithm are segments of Identity-By-Descent (IBD) shared between two or more genotyped individuals. The method alternates between identifying a source for each IBD segment and assembling IBD segments placed within each ancestral individual. Unlike previous approaches, our method is able to accommodate complex pedigree structures with hundreds of individuals genotyped at millions of SNPs. We apply our method to an Old Order Amish pedigree from Lancaster, Pennsylvania, whose founders came to North America from Europe during the early 18th century. The pedigree includes 1338 individuals from the past 12 generations, 394 with genotype data. The motivation for reconstruction is to understand the genetic basis of diseases segregating in the family through tracking haplotype transmission over time. Using our algorithm thread, we are able to reconstruct an average of 224 ancestral individuals per chromosome. For these ancestral individuals, on average we reconstruct 79% of their haplotypes. We also identify a region on chromosome 16 that is difficult to reconstruct—we find that this region harbors a short Amish-specific copy number variation and the gene HYDIN. thread was developed for endogamous populations, but can be applied to any extensive pedigree with the recent generations genotyped. We anticipate that this type of practical ancestral reconstruction will become more common and necessary to understand rare and complex heritable diseases in extended families.


Algorithm 1: Overview
Input: G = genotyped individuals, N G = non-genotyped individuals, P = pedigree tree relating all individuals in G and N G Output: R = reconstructed individuals, G p = groups for each individual p ∈ R find IBDs shared between G using GERMLINE for I k ∈ IBDs do C k = cohort of individuals from G sharing I k S k = sources of C k (Algorithm 2) d k (s) = number of descendance paths for each s ∈ S k (Algorithm 2) end R = G IS = list of IBDs to source while R not changing and IS not empty // this loop tracks iterations do for I k ∈ IS do while assignment unsuccessful and S k is not empty do selected source s * = arg min s d k (s) if d k (s * ) > path threshold then ignore I k end else individuals D k (s * ) = all individuals lying on each path from s * to C k assign I k to all individuals in D k (s * ) if I k conflicts with reconstructed individual in D k (s * ) then remove I k from all D k (s * ) remove s * from S k assignment is unsuccessful end end end end reset IS to empty list for individual p ∈ N G do G p = reconstructed haplotype groups (Algorithm 3) if exactly 2 strong groups in G p then add p to R end if 2 strong groups and one or more weak groups in G p then remove weak groups from G p add all IBDs from weak groups to IS add p to R end end end return R, G p for each p ∈ R

S2
Algorithm 2: Source and Descendance Path Finding Input: C = a cohort of individuals sharing a single IBD, P = pedigree tree containing relationships between individuals Output: S = a list of possible non-redundant sources for cohort C for IBD I ∈ I p do add I to one or both groups in G p depending on zygosity end end for individual p ∈ A do find any homozygous groups G remove all IBDs from S p that were used to build groups in G p for pairs of groups G i , G j ∈ G p and remaining IBD I ∈ I p do if I overlaps G i and G j sufficently then merge G j into G i and delete G j end end for pairs of groups G i , G j ∈ G p do if G i and G j overlap or "line up" then merge G j into G i and delete G j end end end

Complexity Analysis
The worst-case time complexity analysis involves several incomparable quantities. We use: Each row of GERMLINE output reports two individuals sharing one IBD. Therefore the size of the GERMLINE output file is O(|I||G| 2 ), which we simplify to O(|I||P | 2 ). These rows have to be processed to assign to each individual their corresponding IBDs and compile lists of individuals sharing each IBD. The two gathering tasks can be implemented to take time O(|I||P | 2 ) and in a memory efficient manner if we assume that the GERMLINE output is sorted so that all rows for the same IBD are consecutive. However, the current implementation takes time O(|I||P 3 |) because it uses a general and inefficient implementation of union-find to determine the set of individuals carrying each IBD. In this context, the sets are finite and we know that the final result will be a single set for each IBD, so union-find can be implemented in linear O(|P |) time using bit arrays.
The time-consuming steps in Algorithm 3 compare IBDs for overlap and conflicts. In the worst case one might have to compare every SNP in one IBD to every SNP in another IBD. Algorithm 3 is run for every ungenotyped individual, so the worst case running time for all uses of Algorithm 3 in one iteration is O(|N G||J| 2 ), which we simplify to O(|J| 2 |P |). The dominant time is to build all the paths from one source to one target, which O(|P | 2 · 2 |A| ), where A is the number of generations. Overall, this is

In
There are three time-consuming parts of Algorithm 1 with different complexities. The assignment steps require O(I max |I||P |) time. The costs of identifying conflicts and overlaps to make the groups for each individual is O(|H||P |) with a careful implementation of these two tests. As indicated above the cost of all the calls to Algorithm 3 is O(|J| 2 |P |). Putting the three terms together, we get O(I max |I| + |H| + |J| 2 )|P |. For this data set and any interesting endogamous data set, we would expect the O(|J| 2 |P |) term to dominate.
Since the cost of Algorithm 3 is subsumed in Algorithm 1, we do not include it in the final tally. Algorithms 1 and 2 are used for r iterations, while the processing of GERMLINE output is done only one time. Putting together the costs of preprocessing and Algorithms 1 and 2, we arrive at a complexity of In a typical dataset, we would expect |P | to be far smaller than |H|, |I|, or |J| and that is true for this dataset. Thus, we expect that the dominant would be O(r|J| 2 |P |) or the incomparable exponential term O(r|P | 2 · 2 |A| ). Since these two terms are incomparable, we performed runtime profiling with the python module cProfile. These experiments validated that the procedures taking O(r|J| 2 |P |) and O(r(|P | 2 ) · (2 |A| )) dominate. For the pedigree structures tested, on the shortest chromosomes (21, 22), these two parts of thread take similar amounts of time, within a multiplicative factor of 2. On most of the (longer) chromosomes the O(r|J| 2 |P |) term dominates because |J| increases with the length of the chromosome, but the costs of finding paths from sources O(r|P | 2 · 2 |A| ) does not depend substantially on the chromosome length; on the longer chromosomes, one might need to consider more potential sources, but this consideration does not affect the asymptotic upper bound on running time.

Probabilistic source identification
Given an IBD segment I and associated cohort C (genotyped individuals that carry I), we wish to approximate the probability that each potential source (s 1 , s 2 , · · · , s k ) is the true origin of I. At a high level, we compute this by compiling the probabilities of transmitting the IBD from the source to each member of the cohort.
Let the genetic distance of I be d cM. Then the probability of a recombination event within the IBD segment during one meiosis is: r = 1 − e −2d/100 2 First we will compute the probability that a child receives 0, 1, or 2 copies of the IBD from its parents. Let [f 0 , f 1 , f 2 ] be the probabilities that the first parent has 0, 1, or 2 copies of the IBD, and similarly for the second parent [m 0 , m 1 , m 2 ]. Then we can compute the child probabilities as