PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population

Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm), a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD) segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.


Phasing
Our phasing method is similar to the long-range phasing algorithm by Kong et al. [9] and to our earlier approach for phasing Hutterite genotype data [15], but introduces several key improvements that boost its quality. In our method, phasing consists of six steps ( Figure S3) that incrementally phase more single nucleotide variants (SNVs) and including a larger sample, starting with individuals then moving to parent-offspring trios to nuclear families and eventually to the entire pedigree. The later steps are more computationally costly.
In the first step we apply Mendelian logic to phase all SNVs for which the proband ( Figure S3, Step 1) or one of his/her parents is homozygous (Step 2).

Step 3: Parent-Child Pairs in Nuclear Families
Next, we consider parent-child pairs, in whom > 50% of the parent's heterozygous SNVs are already phased so that his/her two haplotypes can be delineated ( Figure S3, Step 3). The corresponding child's chromosome is a recombinant of these haplotypes. We search for long (≥ 400 SNVs) identity-by-state (IBS) segments between each parental haplotype and the child's haplotype, allowing for localized differences due to potential genotyping errors. First, in each window of 5 consecutive SNVs, if there exists only one non-IBS SNV, it is ignored. Long IBS segments are then declared IBD, and used to phase the parent and child. This allows us to phase triply-heterozygous SNVs that were phased in a parent (because the parent is itself a child in another trio). Our segments are shorter than in previous works [9,6] (400 SNVs vs. 1000 SNVs) because we only rely on IBS to identify IBD segments in first-degree relatives, where the mean IBD segment length is longer (25cM [4]); for more distant relatives, we employ a more precise Hidden Markov Model (HMM) strategy, cf. Sec. IBD Estimation. Moreover, IBS segments may not cover some short regions around recombination sites because not all SNVs are informative (either the parent is homozygous, or the child or parent has not been phased yet). After all parent-child pairs are processed, we apply the same procedure to nuclear families to identify all IBD haplotype segments between parents and their children. At each SNV, we thereby determine which family members are IBD, then phase the unphased members using a majority vote of the phased members ( Figure S4), as follows. Given a parent P (the father in this example) who is phased at most heterozygous SNVs, call his partially-known haplotypes P 1 and P 2 . Let C 1 , . . . , C 4 be the partially known paternal haplotypes of his four children. We first determine IBD segments between P and each child using long IBS stretches ( Figure S4, step 1), from which we can identify child recombination sites at segment ends (denoted by vertical dashed blue lines; step 2). In the interval between each two sites, we determine the two sets of children that are IBD with P 1 and P 2 , respectively (step 3). By transitivity, those children are also IBD among themselves (dashed lines). For each SNV in the interval and each IBD set, if n 1 phased members of the set have one allele (say, the A-allele in P 1 , C 1 ) and n 2 have the other (say, the T-allele in C 3 ) with n 2 < n 1 , then the n 2 alleles are flagged as errors and corrected to allele A (for instance, the T-allele in C 3 is corrected to A), and the majority allele is copied to unphased set members (C 4 ). If n 2 = n 1 , all members of the set are zeroed out and flagged as errors, since there is no majority vote. These sets are special cases of IBD cliques described in the main text section "IBD Segment Indexing into Cliques": in both cases we build an IBD-sharing graph between several haplotypes; by transitivity, this graph must be a clique.

Step 4: Children Comparison in Nuclear Families
If a parent is not phased in > 50% of his/her heterozygous sites, but has K ≥ 3 children who are > 90% phased at all sites, we instead use the most-phased child haplotype as a template [6], and examine the IBS state between the template haplotype and the haplotypes of the other K − 1 children ( Figure S3, Step 4). If a single child changes IBS state with respect to the template at a SNV, we infer that a recombination occurred in that child; if K − 1 children change state at a SNV, it is likely a recombination in the template child (because otherwise K − 1 children would have a recombination at the same SNV, an event of negligible probability). Knowledge of the template child's haplotype and recombination sites allows us to reconstruct the two parental haplotypes, as well as phase more SNVs in the children as in Step 3. Note that the parental origin of the parent's haplotypes cannot be deduced from their children.

Step 5: IBD between Children in Nuclear Families
In families in which both parents are not genotyped, some of the (quasi-founder) children might now be phased using information on their siblings, but we do not know which of their haplotypes are paternal and which are maternal in origin. To accomplish this, we first classify the haplotypes of phased children into two groups (those inherited from one parent, and those inherited from the other, although we do not yet know their parental origin): 1. We identify IBD segments between each two haplotypes A and B (belonging to different children or to the same child) using a standard HMM described in detail in Sec. IBD Estimation in a Segment, Haplotype vs. Haplotype. Rather than explicitly modeling LD as in IBDLD [7], our strategy is to prune the framework SNVs used in the HMM for LD using a greedy strategy that maximizes the number of SNVs while keeping their remaining pairwise r 2 < 0.3 (we pick the first available SNV, remove all SNVs that are in LD with it, then add the closest SNV to the frame, and repeat). The remaining SNVs are treated as probabilistically independent, i.e., the observed genotype at SNV i is assumed to be independent of the IBD states at SNVs 1, . . . , i − 1. The input to the HMM consists of • The genotypes of A and B at all framework SNVs.
• The allele frequencies at each SNV (estimated from the genotypes).
• Estimates ∆ 1 , . . . , ∆ 9 of the prior probabilities for each condensed identity state that depend only on the known pedigree (∆ 1 is the probability that all four alleles of A and B at a SNV are IBD; etc. [7, Fig. 1]), and an estimate of the transition rate parameter λ, all of which have been previously computed for all the Hutterites individuals in this sample [7].
The output of the HMM is the Viterbi path with the values of the hidden state IBD ∈ {0, 1}, indicating whether the haplotypes are IBD or not.
2. Let r 0 , . . . , r n be the set of recombination points, that is, the set of all segment endpoints, sorted by location, and let m the number of phased children (the chromosome endpoints are included as r 0 , r n ). Order haplotypes so that 2i − 1 and 2i correspond to child i, i = 1, . . . , n.
3. We "color" haplotypes with at most 4 colors representing the four parental haplotypes 1, . . . , 4. That is, we determine an (2m) × m matrix P with integer entries 0 ≤ p i,j ≤ 4 (0 denotes an unknown region not covered by any paternal haplotype): • Start with the initial guess p i,j = j; we identify cliques in each P -column (sets C of rows i that are IBD in the chromosomal region [r j , r j+1 ]) as in Step 3, and within each clique C replace p i,j by min i∈C p i,j . This reduces the number of colors.
• Recode the colors so that 1 is the most abundant, 2 is the second most abundant, and so on.
• For each j = 1, . . . , n, swap color 0 with a color k ≤ 4 in the jth P -column (i.e., update all zero entries p i,j to be k) that minimizes (for j = 1, only the second sum is included; for j = m, only the first sum), if there exists a k for which D(k) < D(0).
• Recalculate cliques in each P -column and swap color 0 with a color k > 0 whenever both belong to the same clique. If multiple k's exist, choose the one that minimizes D(k).

4.
There are three possible assignments of the 4 paternal haplotypes to two parents A and B. Let A i be the fraction of genetic distance of haplotype i covered by A's paternal haplotypes; similarly B i , for i = 1, . . . , 2n. Let We call M the child's separation measure. M ranges between −1 and 1; when M is far from 0, one parental haplotype (color) covers most of a child's haplotype while the other haplotype is mostly covered by a haplotype of the other parent. We choose the assignment that maximizes the minimum separation M := min i |M (i)|.
If M > 0.25, we can assign the parental origin of all child haplotype to either A or B based on M (i)'s signs. The parental origin provides useful information in itself, but also allows us to phase the unphased children next.
If parental origin assignment is successful (M > 0.25), we use an analgous HMM based on genotypes (see Sec. IBD Estimation in a Segment, Genotype vs. Genotype) to identify IBD segments between an unphased child and a phased child. The output of the HMM is the Viterbi path with the values of the hidden state IBD ∈ {0, 1, 2}, indicating whether A and B share 0, 1 or 2 alleles IBD. Segments of length ≥ 1 cM in the Viterbi path (Sec. Single-Locus IBD Estimation, Haplotype vs. Haplotype) with hidden state IBD = 1 are identified; in each segment, we determine which of the two haplotypes of B fits with A's genotype. When different haplotypes fit different sub-segments of an IBD segment, subsegments of length ≥ 0.4 cM are output as the final IBD segments and used to phase the unphased child. Haplotype classification into the two parental groups is essential to patching sub-segments into the correct long-range phasing result.

Step 6: IBD between Proband and Surrogate Parents
Finally, for each individual that are < 95% phased, we search the pedigree for phased surrogate parents [9,15], first within close relatives and gradually considering more distant relatives up to seven meioses away, invoke the HMM to identify IBD segments, and subsequently phase the proband ( Figure S3, Step 6).

IBD Estimation Approach
Our goal is an Identity-by-Descent (IBD) estimation approach that strikes a balance between accuracy and complexity. On one extreme, the paper [15] estimates the IBD of a segment assuming that the IBD state at a SNV is independent of all other SNVs. This single-SNV model is simple and fast, but very crude. On the other extreme lie Hidden Markov Models (HMMs) such as IBDLD [7], which allows the IBD state to change from SNV to SNV and models LD more explicitly.
The key points of our approach are: • We distinguish between two IBD ascertainment scenarios: genotype vs. genotype (GG) (used during phasing and analogous to [1]), and haplotype vs. haplotype (HH) (used to obtain the full IBD dictionary after phasing). In each case, our goal is to calculate P (IBD|data).
• We first remove SNVs with missing data in either of the compared subjects.
• Estimation is performed in a frame, a subset of pruned SNVs whose pairwise LD r 2 < 0.3. Frame genotypes are assumed to be independent. (In practice, the frame with the least amount of missing data is used; IBD segments obtained from different frames can be compared for validation and robustness analysis, but frames cannot be directly combined, as that would violate the independence assumption.) • We first derive single-locus posterior IBD probability. In the HH case, we improve upon Uricchio et al. [15] by taking into account which allele is shared (Sec. Single-Locus IBD Estimation, Haplotype vs. Haplotype). The GG case is identical to Abney's calculation [1].
• To determine whether a prospective segment is IBD, we need the to define the prior IBD probability for every possible combination of IBD states at the frame SNVs. Uricchio et al. assumed a constant state over the frame, which yields a simple formula, yet this is not a valid assumption in either the HH or the GG case. We also model genotype errors here. Standard HMM tools [12] are applied to calculate the posterior IBD at each marker given the segment data in linear time. The final segments comprise of SNVs whose hidden IBD state is the same.
Our IBD estimation methods are based on Bayes' theorem [8]. In this section, the variable IBD refers to the number of alleles shared between the two individuals compared and can be 0, 1 or 2.

Single-Locus IBD Estimation
Consider a single biallelic SNV with alleles 0, 1 whose frequencies are p, q, respectively (frequencies are typically estimated from the sample; however, we neglect finite-sample effects here). We write a ≡ b if the alleles a and b are IBD, and a = b if they are IBS.

.1 Haplotype vs. Haplotype
For a single locus, the term "haplotype" is synonymous with "allele". Let H A , H B be two alleles randomly drawn from subjects A, B, respectively (or randomly drawn from the same subject with replacement). Suppose the observed haplotypes are IBS, namely, where OR : (6) and (7) are identical (a) when the two alleles are equally prevalent (p = 0.5) -by symmetry; (b) when this is the only allele (p = 1), in which case IBS provides no additional information, and P (S|H A = H B ) = P (S) = f ; or (c) when f → 1, i.e., the two haplotypes are always identical regardless of the allele. On the other hand, for p < 0.5, (6) becomes much larger as f → 0, as a rare allele occurring twice in otherwise-unrelated subjects is a strong evidence for IBD. Finally, (6) is smaller than (7) when 1 is the major allele, since IBS is less informative when both alleles tend to be equal in all subjects ( Figure S11).
We define three IBD states: 0,1 and 2. Of the nine possible states S 1 , . . . , S 9 [10, Table 5.3], 1, 7 are sub-cases of IBD = 2, 3, 5, 8 are sub-cases of IBD = 1, and 2, 4, 6, 9 are sub-cases of IBD = 0. In Table S2, second column, the top two dots represent A's alleles and the bottom two represent B's alleles, where the maternal and paternal origins of alleles are ignored. Dots are linked if the corresponding alleles are IBD. For instance, S = 3 indicates that both of A's alleles are IBD, one of B's alleles is IBD with them, while the other is not. The prior probability of each condensed identity state is given by the vector ∆ := (∆ 1 , . . . , ∆ 9 ) T of identity coefficients ∆ i := P (S i ), i = 1, . . . , 9, which is assumed to be known, where T denotes the transpose operator. Note that while the kinship coefficient can be inferred from identity coefficients [10, p. 85], the individual ∆ i 's cannot be derived from f . Similarly, let IBS = 0, 1, 2 be the number of alleles shared by G A , G B . Of the nine possible observed genotype states for G, seven indicate IBS ≥ 1 and two indicate IBS = 0 (Table S2). Our objective is to compute P (IBD ≥ 1|G), expanded to [1, p. 1582], for each genotype pair j : depend on the allele frequencies and are given in Table S2. This a special case of [1, Table 1] with no missing data. For instance, because we can draw a single allele from A's alleles, which determines A's other allele and one of B's. This allele equals 0 probability p. The other B-allele is 1 with probability q. Substituting Table S2 into (9), with M (j) := (2 − j A , 2 − j B ) (see Table S3).
The genotypes corresponding the HH case with shared 0-allele are j = (0, 0), (0, 1), (1, 0), and (1, 1). Figure S12 illustrates that IBS between two genotype subjects (GG) is usually a stronger evidence for IBD than a pair of haplotypes (HH). This happens because we have assumed that the haplotypes are randomly drawn from the subjects, which provides no extra information over genotypes; furthermore, we have implicitly assumed that we do not have genotypes available. Had we used the genotypes, HH and GG would have been identical. In particular, P (IBD|G = (1, 1)) → 1 as p → 1 (two hets with one rare allele each are always IBD for the rare allele) while in the HH case the posterior probability tends to f . On the other hand, all posterior probabilities are always equal for p = 0, 1, because IBS for a very rare allele always implies IBD, while IBS for a very common allele has no effect on our IBD belief. The HH evidence would have been stronger than GG had we also assumed the knowledge of which parent each haplotype was inherited from, and used the corresponding detailed identity coefficient [10, Fig. 5.2] to express P (IBD). Unfortunately, the uncondensed coefficients are not yet available for the Hutterites data set.

IBD Estimation in a Segment
While a single IBS SNV generally provides some evidence of IBD, as evidenced by Figure S12, a sufficiently long, nearly uninterrupted segment of IBS SNVs makes a much stronger evidence. Starting from the set of all SNVs in the chromosome, we remove SNVs in which either subject has a missing genotype, and further restrict ourselves to a frame of nearly-independent SNVs.
Let n be the the frame size; number the SNVs in the frames from 1 to n with genetic positions x 1 , . . . , x n . Let p k,l be the frequency of allele l at SNV k, for l = 1, 2.

.1 Haplotype vs. Haplotype
We consider two haplotypes randomly drawn from A and B.
, and S k = 0 means "not-IBD". Our goal is to calculate the posterior IBD probability at each marker, P (S k = 1|O).
We assume that frame SNVs are independent; thus We approximate the IBD state vector along the genome as Markov [2, p. 924]. Thus To define an HMM, we define the transition probabilities a k i,j := P (S k+1 = j|S k = i). In analogy to [2,7], the transition matrix is an infinitesimal rate matrix model  λ(A, B) in a pair of subjects was available to us for all pairs of Hutterites [7]. In particular, if A and B have children, the genotype of each child C comprises of two randomly drawn haplotypes (generally, two recombinant haplotypes) from the four parent haplotypes. Thus λ can be approximated by the mean λ(C, C) over all children. Since all children share the same inbreeding coefficient f , we further approximate λ to be a function of f only ( Figure S13b). Considering all subjects, we calculate the mean λ over bins of similar f values, obtaining a lookup table for λ(f ) ( Figure S13a, red line), and linearly interpolating to any f -values in between. This table only includes small f values, since spouses are typically at least four meioses apart, but the standard deviation in λ decreases as f increases, and the mean curve is smooth enough to allow a reasonable extrapolation to larger f 's. Note that λ is also a decreasing function of f , as distantly-related spouses result in shorter IBD segments., i.e., a higher reomcbination rate.
Order the possible observed haplotypes as the vector O := ((0, 0), (0, 1), (1, 0), (1, 1)) and hidden states by i = 0, 1. Let b (k) i,j := P (O k = j|S k = i). We assume a two-step Markov process S k → H k → O k . Therefore, the emission probability matrix is the product where are the conditional probabilities of the observed haplotypes given the true ones, is a genotype error model that assumes errors occur in the alleles independently of each other, δ t,s is Kronecker's delta, and ε = 0.01 is the expected genotype error rate. The initial state distribution vector is the unconditional probability of IBD, f . Note that the state transition from SNV k to k + 1 is a function of the genetic distance between the markers, while the emission at SNV k depends on the allele frequencies thereof. Notwithstanding, the same forward-back algorithm [12] can still be applied to calculate The values {γ k (1)} n k=1 are the desired IBD posterior at the markers. In practice, deciding on IBD via thresholding the posterior frequently breaks a longer IBD segments into shorter streteches, if the posterior fluctuates about the threshold. Thus, we instead use the Viterbi path (computed using the Viterbi algorithm [12]) to detect IBD segments. The path seems to be more robust to changes in λ and does not require a threshold. However, within the IBD segment, we quantify the local certainty of IBD at SNV k by γ k (1), which is subsequently used in deciding which segment is used for phasing and imputation.

.2 Genotype vs. Genotype
We define the analogous HMM to the HH case, which is similar to IBDLD without LD modeling [7]. Here O l k , H l k have three possible values, 0, 1 or 2, and S k is the condensed identity state.
Uricchio et al. [15] assumed a constant state S over the entire frame to simplify the posterior estimation. However, it is important to allow local state transitions, because S k may change even within a true (IBD ≥ 1)-segment in the presence of recombinations, as illustrated by Figure S14.
While this cannot occur in the HH case, the joint evidence provided by all SNVs is always more accurate than single SNVs and will likely produce better IBD estimation (at an extra cost, of course).
The initial state distribution is ∆ := (∆ 1 , . . . , ∆ 9 ). The transition matrix is again defined by the infinitesimal transition rate stationary distribution model [7] A(x) = ∆1 T + e −λx ( where 1 = (1, . . . , 1) 9 T and I n is the n × n identity matrix. The parameter λ ≥ 0 is specific to the subjects A, B. We use the value 0.4λ, where λ is the quantity computed for the Hutterites subjects by Abney et al. as the best-fit in Monte-Carlo gene dropping simulations [7]. This seems to reduce undesired state transitions for short SNV stretches within true IBD segments between close relatives. The emission probabilities are given by (16), where now {b i,j } i,j are the values in Table S2; the error model is still (18b), but here j l , i l range over 0, 1, 2 for each l = A, B, and {h s,t } s,t are given by Table S4 [7, Table S2].
The Viterbi path is used as the indicator of IBD (i.e., if the state is 1,3,5,7, or 8, we infer that the SNV is IBD ≥ 1, otherwise IBD = 0). Once an IBD segment is determined, we do quantify the local IBD certainty at each SNV by which is computed by the forward-backward algorithm (19). Figure S15 illustrates both indicators.

IBD Cliques at a SNV
After phasing, we apply the HMM of Sec. IBD Estimation in a Segment, Haplotype vs. Haplotype to each pair of haplotypes among the 1415 Hutterites and identify pairwise IBD segments. We then organize segments in an IBD segment index data structure, which consists of a set of IBD cliques at each SNV. In this section we describe a method for defining IBD cliques from pairwise IBD segment at a certain SNV k.
We build a weighted, undirected pairwise IBD graph G = (N , E, w), where N is the set of n = 2830 haplotypes and m := |E| edges. An edge (u, v) ∈ E exists if and only if haplotypes u and v are IBD at the SNV. The weight function w : E → [0, 1] is the posterior HMM IBD probability γ k (1) (19c).
Because IBD is a transitive relation, G must be a union of disjoint cliques (i.e., fully connected sub-graphs), one for each ancestral haplotype present in the population. In practice, G is a perturbation of a clique union due to very low HMM certainty near segment ends and genotyping errors, and we would like to recover a "reasonable" set of cliques from it. Cluster editing methods (see for instance [13], but the literature is quite rich) solve the problem of finding the minimum number of edges (or total edge weight) that need to be added or removed to transform G to a clique union. This is an NPhard problem, and practical heuristic-based algorithms run in superlinear time (O(m 2 ) or slower). We choose a different heuristic inspired by the graph algebraic multigrid literature [14,11,3] that resulted in good imputation cross-validation accuracy and requires only O(m) time and storage. in step 3. We chose them based on specific SNV data in which we manually identified the cliques from the pedigree structure, and our ultimate performance measure was the imputation cross-validation accuracy. In general, the threshold should depend on the level of smoothness of the test vectors (i.e., on K and ν) but should apply to the entire class of IBD graphs that have a two-scale structure.

Algorithm
Step

Variants with rs Numbers
Variants without rs Numbers Figure S1: Generalized Mendelian error rate (the total number of discordances divided by the total number of IBD2 segments) vs. call rate (fraction of called genotypes among the 98 whole genome sequences) for non-singleton variants, shown by variant type (SNVs, insertions, deletions) and novelty (with or without rs numbers). Figure S2: Distribution of variants present in 98 Hutterite genome sequences by frequency and functional class. Top: all variants. Bottom: variants in coding regions. Figure S3: Phasing algorithm. First, we phase homozygous SNVs (1) and SNVs that are homozygous in an individual's parent (2). In trios with phased parents, we phase more SNVs using parent-child IBD segments (3). In large families with genotyped parents, we compare offspring haplotypes to obtain child-child segments and phase any unphased sibling. For the remaining subjects, we use a Hidden Markov Model (HMM) to identify IBD segments between the individual's genotype and a surrogate parent's haplotype (5-6). Figure S4: Phasing Step 3. Given a father P who is phased at most heterozygous SNVs, call his partially-known haplotypes P 1 and P 2 and C 1 , . . . , C 4 be the partially known paternal haplotypes of his four children. We first determine IBD segments between P and each child using long IBS stretches (1), identify child recombination sites (vertical dashed blue lines) (2), and determine which set of children is IBD with P 1 and P 2 in each interval between recombination sites (3). Finally, at each SNV, we apply a majority vote (4) within each IBD haplotype clique to phase unphased members of the clique (C 2 and C 4 in this case) and correct errors in others (such as in C 3 ). Figure S5: Untyped ancestor imputation. (1) Using IBD segments discovered among quasi-founder siblings, we color their haplotypes with at most four colors representing the four parental haplotypes, so that the number of recombinations is minimized. (2) Of the three possible assignments of color pairs to parents, we pick the one that yields the clearest separation of offspring haplotypes into paternal and maternal. This induces the consistent ordering of offspring haplotypes ("PO phase alignment") as paternal first, maternal second, for example (3). (4) Segments of the same color are patched to reconstruct each parental haplotype. (5) Parent gender is determined by the PO clique method ( Figure 9). Finally, the same methodology can be applied iteratively to impute members of earlier generations (6). 1000Genomes EUR subjects. Figure S9: IMPUTE2-Pedigree-based concordance. After filtering variants by the criterion het concordance ≥ 98% and MAF ≥ 1.5%, the LD-based imputation call rate is 56% out of the remaining 23% not imputed by PRIMAL, providing a combined call rate of 89% at the same accuracy of the pedigree-based method (≥ 99%).  Figure S14: An IBS segment that is partially IBD in the paternal haplotype and partially IBD in the maternal haplotype, causing a transitions in the identity state. A line indicates that the corresponding event occurs in that SNV range. Figure S15: HMM results for a pair of Hutterite subjects. Only states i with significant γ · (i) values are included. The Viterbi path is recoded from states 1, 2, . . . , 9 to the values 1/9, 2/9, . . . , 1, so that it is more clearly overlayed over the same range. Figure S16: Scatter plots of relationship between concordance or call rates and empirical kinship metrics are shown. Panels A and C show the call rates for the 1317 imputed Hutterites and two measures of relatedness with the 98 sequenced individuals: average empirical (from observed IBD sharing) kinship (in panel A) and average of the closest five (with respect to kinship) sequenced subjects (in panel C). Panels B and D show the concordance rates for the 14 subjects who were validated using sequencing with an independent platform. As expected, Hutterites who are more related to the sequenced individuals tend to have higher call rates and better imputation accuracy.