Leveraging Identity-by-Descent for Accurate Genotype Inference in Family Sequencing Data

Sequencing family DNA samples provides an attractive alternative to population based designs to identify rare variants associated with human disease due to the enrichment of causal variants in pedigrees. Previous studies showed that genotype calling accuracy can be improved by modeling family relatedness compared to standard calling algorithms. Current family-based variant calling methods use sequencing data on single variants and ignore the identity-by-descent (IBD) sharing along the genome. In this study we describe a new computational framework to accurately estimate the IBD sharing from the sequencing data, and to utilize the inferred IBD among family members to jointly call genotypes in pedigrees. Through simulations and application to real data, we showed that IBD can be reliably estimated across the genome, even at very low coverage (e.g. 2X), and genotype accuracy can be dramatically improved. Moreover, the improvement is more pronounced for variants with low frequencies, especially at low to intermediate coverage (e.g. 10X to 20X), making our approach effective in studying rare variants in cost-effective whole genome sequencing in pedigrees. We hope that our tool is useful to the research community for identifying rare variants for human disease through family-based sequencing.


Author Summary
To identify disease variants that occur less frequently in population, sequencing families in which multiple individuals are affected is more powerful due to the enrichment of causal variants. An important step in such studies is to infer individual genotypes from sequencing data. Existing methods do not utilize full familial transmission information and therefore result in reduced accuracy of inferred genotypes. In this study we describe a new Introduction DNA sequencing is being routinely carried out to identify genetic factors, rare variants in particular, associated with human disease. It has been successful in identifying causal variants for Mendelian disease [1,2], and continues to be a powerful approach to uncovering the genetic basis of rare disease [3]. For complex traits, however, detecting rare variant associations is challenging due to reduced power of statistical tests when the allele frequency is low [4,5]. Although large-scale sequencing of unrelated individuals has identified associated rare variants for some complex traits, such as lipid traits [6], this approach often revealed greater challenges in finding causal genes for complex traits [7,8]. Family sequencing provides a promising alternative for identifying rare variant associations due to the enrichment of causal variants in pedigrees. Recent studies demonstrated the effectiveness of sequencing families and identified associated rare variants for a variety of traits, including schizophrenia [9], Alzheimer [10], and hypertriglyceridemia [11]. These lines of evidence show that family studies are emerging as a powerful approach in the sequencing era to localize genetic factors for human disease, and will play a key role as a complementary approach to the population based design to help understand the genetic basis of complex traits.
A critical step in genetic analysis of family sequence data is to infer genotypes of individuals in pedigrees. For next-generation sequencing data, this is challenging due to base call error, alignment artifacts, possible allele dropout during library preparation and sequencing, especially at low coverage, among others. Although variant calling algorithms developed for unrelated individuals can be applied to family sequencing data, the accuracy is compromised due to the ignorance of family relatedness. Family-aware calling algorithms, e.g. Polymutt [12] and Fam-Seq [13], have improved accuracy over the standard methods but assume the same pedigree correlation structure for all sites and therefore ignore the actual identity-by-descent (IBD) sharing. For example, in a nuclear family with two siblings and their parents, the IBD sharing between the two siblings can be 0, 1 or 2 at a particular genomic region, with the probabilities being 0.25, 0.5 and 0.25 respectively, a priori. Polymutt and FamSeq assume such a priori probabilities for all genomics regions and thus inefficiently model the data when the actual IBD can be inferred. As a concrete example, assuming we know that at a particular position the two siblings share 2 alleles IBD, then their genotypes are identical at this locus and can be inferred with improved accuracy by merging their sequencing data, essentially doubling the sequencing depth. In general, knowledge of IBD sharing helps confine genotypes to be compatible to the IBD patterns in pedigrees and a variant calling framework that models IBD is expected to deliver improved performance over existing methods. Such a framework makes it feasible to design studies with reduced coverage, since data from shared haplotypes in a pedigree are efficiently combined to make reliable genotype calls. This will be particularly beneficial for whole genome sequencing, which is still prohibitively expensive for large-scale sequencing studies. Linkagedisequilibrium (LD)-based methods such as Beagle4 [14], Thunder [15] and SHAPEIT [16] have been extensively used for inferring genotypes from low-depth sequencing utilizing extensive LD among variants. Due to much reduced LD (in terms of r 2 ) among rare variants as well as between rare and common variants, however, LD-based methods are expected to have reduced accuracy for rare than for common variants.
In this study, we develop and implement a variant calling framework that infers the IBD sharing (through the inheritance vector, see Methods) among family members directly from sequencing data, and utilizes the IBD sharing to jointly infer individual genotypes. The new software, Polymutt2, provides a complementary tool to our prior work (Polymutt) with improved performance for small to moderate pedigrees. By directly modeling the sequencing data, the IBD can be reliably inferred, even for extremely low coverage (e.g. 2X or below), making it a robust tool for sequencing data. In addition to unphased genotypes, when parental data are available haplotypes can be directly constructed from the sequencing data based on the best inferred IBD sharing with little compromise of accuracy compared to unphased genotypes obtained by incorporating the uncertainties of IBD inference. Mendelian error is extremely rare for unphased genotypes, and is completely eliminated for haplotype calls. Through both simulations and real data, we show that Polymutt2 significantly outperforms other tools, including GATK, Beagle4 and Polymutt, for genotype calling on pedigree data, especially for rare variants in low coverage data.

Methods
The input to Polymutt2 is a variant calling format (VCF) [17] file, which contains candidate variant sites and genotype likelihood (GL) values, defined as the probability of observing the reads given a specific underlying genotype (see [12,18] for details). For family members that were not sequenced, all GL values are set to 1 for all underlying genotypes. Standard approaches to variant calling are likelihood-based methods, both in unrelated individuals [19][20][21] and pedigrees [12,13]. All these methods calculate GL values and call an initial set of variant sites and individual genotypes. Our framework first infers IBD sharing in pedigrees along the genome based on the GL values, and then uses the inferred IBD sharing to assess variant quality, to refine individual genotypes, and to generate haplotypes along the genome. In the current implementation, we assume that all variants are bi-allelic and that the two alleles are known.

Inference of the distribution of inheritance vectors
Suppose we have a pedigree with f founders and n non-founders, with sequencing data on M variants across a chromosome. Without loss of generality, we also arrange the pedigree such that the first f members in the pedigree are founders. Define a binary inheritance vector [22] at variant j as I j = (p 1 ,m 1 ,. . .,p n ,m n ) for the n non-founders in this pedigree. Each of the entries describes the transmission of the paternal (p i ) or the maternal (m i ) allele, with 0 (or 1) indicating the grand-paternal (or the grand-maternal) allele being transmitted. Therefore an inheritance vector completely determines which of the 2f founder alleles were inherited by each nonfounder. There are N = 2 n possibility inheritance vectors and let v k , k = 1,..,N, denote individual vectors. Let R denote data of all members in a family across all M variants, R j be all reads at variant j, and R ij be the reads in family member i at variant j. Similarly, let G j denote the vector of genotypes at variant j and its ith entry G ij = (A 1 , A 2 ) be the ordered genotype of the ith member in the pedigree, where A 1 and A 2 represent the paternally and maternally transmitted alleles respectively. Assuming that recombination events are independent between all chromosome intervals, i.e. no crossover interference, the likelihood can be framed as a Hidden Markov Model [23], similarly as that in the Lander-Green algorithm [22]. Specifically, the likelihood of reads across all M variants in the pedigree can be calculated as The initial probability, P(I 1 ), is assumed uniform across all N = 2 n inheritance vectors. The transition between adjacent inheritance vectors, P(I j |I j-1 ), is calculated according to the recombination rate between the jth and j-1th variant, which can be calculated using the HapMap Phase II [24] genetic map. For variants not in the HapMap genetic map linear extrapolation will be used to approximate the genetic distance. The emission probability, P(R j |I j ), which is the probability of reads in all family members at locus j given the inheritance vector I j , can be calculated as Here we assume as in other methods that the sequencing reads depend only on the underlying genotype so that P(R j |G j ) can be factorized into the product of individual genotype likelihoods.
Since an inheritance vector specifies precisely how the alleles were transmitted from founders to non-founders, the genotypes of the entire pedigree are determined by the ordered founder genotypes when the inheritance vector is known. Therefore the emission probability involves the summation of only the ordered founder genotypes, whose prior probabilities P(G ij ) can be either obtained from the external sources, e.g. the 1000 Genome Project [25], or estimated based on the pedigree data using for example Polymutt [12], assuming Hardy-Weinberg equilibrium.
Since the inheritance vectors usually cannot be determined unambiguously, the goal here is to infer the posterior distribution of the inheritance vectors at each variant using the sequencing data from all M variants; that is, we aim to calculate P(I j |R). This can be achieved efficiently using the forward-backward procedure in HMM [23]. Let α j (k) denote the forward variable at variant j for v k and β j (k) be the corresponding backward variable [23]. Then the posterior probability of v k at variant j is From the marginal distribution, the inheritance vector with the maximum posterior probability at variant j, denoted as I j marg , can be used to represent the inferred inheritance vector for each variant; the IBD sharing can be directly derived from I j marg for any pair of family members.
However, since I j marg only maximizes the likelihood marginally at variant j, we infer a global optimal path of inheritance vectors along the genome through the Viterbi algorithm [23]; we use I j best to denote the optimal inheritance vector at variant j.

Selection of variants to construct the genetic map
The Lander-Green algorithm requires that variants are independent, i.e. not in linkage disequilibrium (LD). For sequencing data variants are usually correlated. Since only a limited number of recombination events are expected in a pedigree, it is neither feasible nor necessary to use all data. We built a companion tool to automatically select a subset of independent variants by LD pruning, a similar approach used in Plink and others [26,27]. In addition, we filtered variants in genomic regions that are prone to alignment artifacts, including segmental duplications, simple repeats and low complexity regions, and 50bp up and downstream of known insertions and deletions; these data were downloaded from UCSC genome browser (http://www.genome. ucsc.edu) and the 1000 Genomes Project. The final set of selected variants is used to contrast a sparse genetic map with high-quality variants for the inference of inheritance vectors, and the genetic distances of these variants are linearly extrapolated based on the HapMap Phase II genetic map [24]. The overall strategy is to build a sparse scaffold of inheritance vectors along the genome using the selected set of variants and utilize the scaffold to boost the variant calling accuracy for all variants. We use the term "scaffold variants" to refer to the sparse set of variants in the map file used to construct the inheritance vectors.

Variant calling-Refining variant sites
We evaluate for each variant the evidence supporting the alternative allele in the data by calculating the posterior probability of being polymorphic. Specifically for each variant we calculate two probabilities, P(poly|R j ,R) and P(mono|R j ,R), representing the likelihood of polymorphism and monomorphism respectively given the data at the jth site and scaffold variants. We assume that for poly the two alleles are A ref and A alt and for mono only A ref is present in the data. The posterior probability of polymorphism given the data is calculated as The term P(R j |I j, poly) is calculated based on Eq (1), and P(R j |I j, mono) is simply the product of genotype likelihoods of homozygous reference allele across all family members at variant j. The prior probability of polymorphism, P(poly), is calculated as in Polymutt [12] and P (mono) = 1-P(poly). Briefly, in a sample with N founders in the absence of natural selection, according to coalescent theory the prior probability that a site includes non-reference alleles is y where θ is the population scaled mutation rate per site and is set to 1/1000 in this study. When variant j is one of the scaffold variants, P(I j | R j , R) = P(I j | R), which was obtained in (2). Then the Phred-scaled variant quality is calculated as VQ = -10 Ã log 10 (1-P(poly|R j ,R)). By construction, only a sparse subset of variants is included in the map file, and the vast majority of variants are located in the intervals of scaffold variants. In an interval within which a crossover occurred, the inheritance vectors on the two sides of the recombination point are different. Assigning wrong inheritance vectors to variants will not only produce wrong IBD sharing among family members but also greatly reduce variant calling accuracy. However it is unknown a priori in which intervals crossovers occurred and where exactly the breakpoint is if a crossover occurred. To address this issue, Polymutt2 calculates for each variant in scaffold intervals the posterior probabilities using the inheritance vectors on the left and right boundary separately, and takes the maximum value, P max (poly| R j ,R), as the posterior probability of polymorphism. The Phred-scaled variant quality is calculated as VQ = -10 Ã log 10 (1-P max (poly|R j , R)). The inheritance vectors are accordingly assigned to each of the variants in scaffold intervals. This assumes that there is at most one crossover event in any interval, which is reasonable given the limited number of expected recombination events per generation. As a result, the crossovers can be precisely located in intervals in which crossovers occurred.

Variant calling-Inferring genotypes
After quantifying the variant quality as described above, the most likely inheritance vectors are assigned to each of the variants. The posterior probability of the genotypes for individual i for variant j can be calculated as For a specific genotype G ij = g, the term P(G ij ,R j |I j ) can be calculated using Eq (1) by considering only the terms where G ij = g. For variants in intervals of scaffold variants, P(I j |R j ,R) was obtained in calculating the variant quality as described in the previous section. The genotype with the maximum posterior probability P max (G ij |R j ,R) is assigned to the individual, and the corresponding genotype quality is calculated as GQ = -10log 10 (1-P max (G ij |R j ,R)).
Since the calculation is repeated for all individuals in the pedigree, the computation can be intensive for larger pedigrees. One remedy is to use inheritance vectors with the largest posterior probabilities in the calculation. Specifically, the top inheritance vectors with cumulative probabilities greater than a cutoff, e.g. 0.99, can be used in (3). At the extreme, a single best inheritance vector, I j best or I j marg , can used to minimize the computation. Given the high accuracy of the inheritance vector inference (see Results), the increase of speed greatly outweighs the negligible loss of accuracy.

Variant calling-Inferring haplotypes
When parental data available, we generate haplotypes along a chromosome by reconstructing the optimal ordered genotypes jointly for all family members at each position assuming that the inheritance vector is known. We use I j best as the optimal inheritance vector for variant j inferred using the Viterbi algorithm as we described before. The posterior probability of each configuration of ordered genotypes at variant j given the sequencing data and I j best is calculated as Here P(R ij |G ij ) is the genotype likelihood calculated before, and the term PðG founder j Þ since given an inheritance vector the probability depends only on the ordered genotypes of founders. The terms PðG founder j Þ and PðR j jI best j Þ were calculated in Eq (1). The goal is to obtain the posterior probability of each configuration of founder ordered genotypes, and assign the configuration with the maximum posterior to founders as well as nonfounders according to I j best . This is repeated for all positions and the haplotypes are automatically constructed by stitching the paternal and maternal alleles at each position along a chromosome. By construction, Mendelian error in haplotype calling is completely eliminated due to the Mendelian transmission dictated by the inheritance vector.
Note that the construction of haplotypes is based on the transmission of alleles from parents to offspring. When parental data are not available, such as in sibships, it is not possible to deduce the parental origin of the alleles and therefore haplotypes cannot be reconstructed.

Simulations
We utilized the 1000 Genome Project [25] data to effectively capture the sequencing and mapping error. We generated the founders' genomes by randomly selecting the CEU phased genotypes (March 2012 Phase 1 release). For non-founders, we simulated cross-overs in the parental haplotypes based on the genetic map in the Phase II HapMap data, and then generated offspring genotypes by randomly selecting one haplotype from each parent. To simulate realistic reads, we first generated paired-end 100bp fragments according to Poisson distribution on the genome, with the insert size following a Gaussian distribution with a mean of 400bp and a standard deviation of 50bp, and then simulated reads based on these fragments assuming a sequencing error rate of 0.01 per base. We used BWA [28] to align simulated reads to the reference of hg19 and carried out standard procedures for variant calling, including Indel-realignment and base quality recalibration using GATK and duplication removal using Picard (http:// picard.sourceforge.net). The list of known Indels from the 1000 Genomes Project was provided to GATK for Indel re-alignment. We used GATK UnifiedGenotyper to infer variants and genotypes from sequencing. We then applied Polymutt, Polymutt2 and Beagle4 on the GATK-generated VCF files to refine the genotypes utilizing the GL values calculated by GATK and stored in the VCF file.
Pedigrees we investigated in this study include sibships of size 2 (Sib2), 4 (Sib4) and 6 (Sib6), nuclear families with 4 (Nuc4) and 6 (Nuc6) members, and an extended pedigree with 10 individuals, which is the same as the pedigree investigated in Polymutt [12]. For each pedigree structure, we simulated 20 families at coverage ranging from 2X to 30X. For the Nuc6 we simulated additional 50 and 100 pedigrees to investigate the trend of the genotype calling accuracy of rare variants for increasing numbers of sequenced families. Genotype calling was performed using GATK, Polymutt, Polymutt2 and Beagle4 for each simulated dataset. Note that for trios Polymutt2 and Polymutt are equivalent, and therefore we omitted the investigation of trios in this study.

Performance evaluation metrics
We used two metrics to measure the accuracy of genotype calling. The first is the false negative rate (FNR), defined as the percentage of true genotypes that are called into incorrect genotypes; this measures the sensitivity of the calling and is equal to 1-sensitivity. The second metric is the false discovery rate (FDR), defined as the percentage of called genotypes that are different from the true genotypes; this measures the specificity of the calling; this measures the specificity of the calling and corresponds to 1-precision. A good algorithm is expected to have low values of both FNR and FDR. We used GQ to filter low quality genotype calls and specifically we used GQ = 3 for Polymutt2 and Polymutt, GQ = 5 for Beagle4 and GQ = 10 for GATK; due to different calculations of GQ in these algorithms we found that these filtering criteria have reasonable FNR and FDR values. These criteria were used for all simulated data. Note that when no filtering is used FNR and FDR are the same for overall genotypes, and the difference is due to differential filtering based on GQ cutoffs. For heterozygotes, which are of particular interest in studying rare variants, however, both FNR and FDR are critical metrics to evaluate, as FNR can be made artificially low by aggressive calling of heterozygotes, which will results in high FDR, and conversely conservative calling of heterozygotes can lead to low FDR and high FNR.

Accuracy of inferred inheritance vectors
We derived the IBD sharing at each position between a pair of family members in a pedigree based on the inferred inheritance vector obtained via the Viterbi algorithm. Fig 1A) shows the simulated true IBD sharing of the two siblings in the Nuc4 pedigree along chromosome 1 and Fig 1B and 1C and 1D) show the inferred IBD sharing of the same two siblings at coverage of 30X, 15X and 2X, respectively. From the comparison, we can see that the inferred IBD is extremely close to the true IBD at various coverage, indicating the high accuracy of the inference of inheritance vectors based on sequencing data. Interestingly, for coverage as low as 2X, the accuracy of inferred IBD is not jeopardized (Fig 1D). The high accuracy of inheritance vector inference warrants the increased accuracy of genotype calling when the IBD sharing is utilized to infer genotypes. Genotype calling accuracy-Overall ) on overall genotypes for various pedigrees and sequencing coverage. Polymutt2 significantly outperforms Polymutt and GATK, in terms of both FNR and FDR, and the advantages are more pronounced when pedigree members are more related or coverage is low (Fig 2). For example, at 10X, the FNR values for Polymutt2 for sibships of size 2, 4 and 6 are 0.71%, 0.32% and 0.17%, respectively, while the FNR values for GATK are similar across all pedigree types with a mean value of 1.65% (Fig 2A). The FDR follows the same patterns (Fig 2D). On the other hand, the relative performance of Polymutt2 vs. Beagle4 depends on pedigree types, and for pedigrees with limited IBD sharing Beagle4 outperformed Polymutt2. For example for Sib2 Beagle4 calls have smaller FNR and FDR for all sequencing coverage investigated (Fig 2). For pedigrees with increased IBD sharing, Polymutt2 has either comparable (e.g. for Sib4, Nuc4 and Ext10) or better (e.g. for Nuc6 and Sib6) genotype calling accuracy, and the advantage of Polymutt2 over Beagle4 becomes more manifest with increasing IBD sharing in pedigrees such as Sib6 and Nuc6 (Fig 2). If we compare callers without Polymutt2, Beagle4 consistently outperformed GATK and Polymutt in terms of both FNR and FDR for all pedigrees and sequencing coverage (Fig 2). It is worth noting that although Beagle4 does not explicitly model family inheritance the algorithm is able to leverage the IBD sharing implicitly so that the genotype accuracy is improved for pedigrees with more IBD sharing. For example, the error rates of Beagle4 calls in Nuc6 are lower than those in Nuc4 calls (Fig 2). For all algorithms it is clear that sequencing coverage is the key factor influencing the calling accuracy (Fig 2), and for coverage of 30X the genotype calls are accurate to an extent that the differences among all callers become noncritical (Fig 2C and 2F). In the following sections we only presented results on 10X and 20X simulated data representing intermediate sequencing coverage to investigate the gain of explicit modeling of IBD sharing for genotype calling in such settings.

Genotype calling accuracy-Heterozygotes
We next investigated the accuracy of the heterozygous genotypes, which are of particular interest for rare variants. S1 Fig shows the error rates for various pedigrees at different coverage. Consistent with the accuracy of overall genotypes (Fig 2), Polymutt2 and Beagle4 dramatically reduce error rates across all pedigrees and coverage, compared to both GATK and Polymutt, and the reduction is more dramatic when more related individuals are sequenced (S1 Fig). For example, the FNR at 10X is 1% and 0.8% for Polymutt2 and Beagle4 respectively, and is increased to 1.8% for Polymutt and 4.4% for GATK (S1A Fig). The same magnitudes were observed for FDR at 10X (S1D Fig). Consistent with the overall genotypes, both Polymutt2 and Beagle4 achieved better accuracy for pedigrees with more IBD sharing (S1 Fig). Polymutt2 outperformed Beagle4 on pedigrees of Sib6 and Nuc6 due to explicit modeling of the extensive IBD sharing in such pedigrees (S1 Fig).

Genotype calling accuracy-Heterozygotes of rare variants by allele frequency
The major interest of sequencing studies, especially in family designs, is to identify rare variants associated with disease. Accurate heterozygote calling is of particular importance due to the challenges associated with rare variant inference from sequencing. We specifically investigated the heterozygote accuracy across different bins of alternative allele frequencies, in the range of Polymutt2 has lowest error rates in terms of both FNR and FDR across all bins for variants with frequency below 0.1 (Fig 3 and S2 Fig). For Sib2, which is the simplest pedigree with limited IBD sharing, although Beagle4 achieved better accuracy on overall genotypes and heterozygotes (Fig 2 and S1 Fig), Polymutt2 outperformed Beagle4 for variants with frequencies below 0.05 (Fig 3A and S2 Fig). Consistent with overall genotypes, the advantage of Polymutt2 increases for pedigrees with more IBD sharing (Fig 3 and S2 Fig).
To investigate the effect of increasing numbers of sequenced families on rare variant calling, we simulated additional 50 and 100 Nuc6 families at 10X coverage and carried out genotype calling for both Polymutt2 and Beagle4. It is evident that the accuracy of Beagle4 heterozygous calls improves with increasing numbers of families for variants with MAF<0.02 (Fig 4). The improvements, however, do not seem to be dramatic, probably due to the limited LD among rare variants even for data with 100 families. In comparison, Polymutt2 achieved superior accuracy than Beagle4 for heterozygotes with MAF<0.02, for both FNR and FDR (Fig 4), indicating the advantages of Polymutt2 over Beagle4 for calling rare variants.

Genotype calling accuracy-Phased genotypes
The genotype accuracy of phased genotypes (haplotypes) is similar to the unphased genotypes, although on average the error rates are slightly higher for phased genotypes. For example, for Nuc4 pedigrees at 15X, the FNR of overall genotypes is 0.30% for phased genotypes and is  0.28% for unphased genotypes; the corresponding FDR is 0.09% and 0.10% respectively. For heterozygotes, the FNR is 0.53% and 0.51% for phased and unphased genotypes respectively, with corresponding FDR values being 0.10% and 0.08%.

Mendelian inconsistency
We calculated the Mendelian inconsistency (MI) rate as the percentage of parent-offspring trios in which the genotypes violate the Mendelian transmission law. Pedigrees were divided into individual trios for the calculation. We used the minimum GQ of the genotypes in a trio as the filtering criteria to calculate MI rates on relatively high quality genotype calls. When either GQ 5 or 10 was used, both GATK and Beagle4 calls showed considerable Mendelian inconsistencies across various sequencing coverage (Fig 5). For example, at minimum GQ of 5, the MI rate is 0.76% for GATK at 10X, and 0.15% when coverage was increased to 20X. Although Bea-gle4 achieved reduced MI rates than GATK, there are still noticeable Mendelian inconsistencies in Beagle4 calls (Fig 5). When the minimum GQ of 10 was used there are still appreciable Mendelian inconsistencies in both GATK and Beagle4 calls (Fig 5). On the other hand, the MI rates for Polymutt were extremely low, e.g. <10 -6 for all scenarios shown above, consistent with the previous report [12]. Strikingly, no Mendelian inconsistencies were observed in Polymutt2 calls in the same settings.

Application to real data
We downloaded the whole genome sequencing data in CEPH pedigree 1463 generated on the Illumina HiSeq platform (http://www.illumina.com/platinumgenomes/). We selected a Nuc6 sub-pedigree for the analysis, which consists of four siblings (NA12879, NA12880, NA12881 and NA12882) and their parents (NA12877 and NA12878). The sequencing coverage of these samples is~50X. We followed the best-practice procedure for variant calling as we did for simulated data. Since there is only a single family with a few individuals, we used the 1000 Genomes Project reference panel when running Beagle4 (downloaded from Beagle4 website) on this pedigree to leverage the extensive LD in the panel. To have a fair comparison of Polymutt2 with Beagle4, we ran Polymutt2 using the allele frequencies derived from the same reference panel. Based on simulation results it is clear that at high coverage over 30X the accuracy measures of all callers are satisfactory. Our major goal here is to investigate to what extent genotype calls from various callers with a subset of data can recover the original high depth sequencing data. We first created a gold-standard callset from the original high-depth data by taking the consensus of genotype calls from GATK, Polymutt, Polymutt2 and Beagle4; this call set contains genotypes that are agreed by all 4 callers. GATK and Polymutt infer allele frequencies from the sequence data only, and due to the small sample size of the pedigree the estimates are not reliable. Here we focused only on the comparison of Polymutt2 and Beagle4, two competing methods based on simulated data. Specifically, we randomly extracted 30% and 15% of the reads from the original alignment, corresponding to~15X and~7.5X of coverage, and carried out variant calling using both Polymutt2 and Beagle4. For each of the two callers we calculated FNR and FDR using the gold-standard callset. We also compared their performance stratified by allele frequencies, which were calculated based on the same reference panel used in Beagle4.
Since genotype filtering has a strong impact on FNR and FDR, e.g. aggressive filtering results in low FDR and high FNR and vice versa, we calculated the two measurements using GQ values from 3 to 30 and plotted them in FNR-FDR curves to represent genotype accuracy with a wide range of filtering. This is an objective way of comparing genotype accuracy and a curve completely underneath the other indicates consistent high accuracy of genotype calls for all GQ cutoffs in the range of 3 to 30.
First we evaluated the inference of IBD using the full and partial data. Fig 6A shows the IBD of NA12879 and NA12889 across chromosome 1 using full data, and Fig 6B shows the corresponding IBD when 30% data were used. The IBD sharing is very similar using full and partial data with only a few discrepancies (Fig 6A and 6B), indicating the robustness of the inheritance vector inference.
For overall genotypes with 30% of the data (~15X), Polymutt2 calls achieve greater concordance with the gold standard callset than Beagle4, as manifested by the reduced error rates in the FNR-FDR curves (Fig 7A). When we focused on variants with low frequencies, the advantage of Polymutt2 over Beagle4 is more pronounced (Fig 7B and 7C). For example, with allele frequency <0.1, the FNR-FDR curve of Polymutt2 is more separated from that of Beagle4, and with allele frequency <0.05 we observe further decreasing error rates in Polymutt2 calls than in Beagle4 calls. Interestingly, when 15% of data (~7.5X) were used, Beagle4 calls have better overall accuracy than Polymutt2 (Fig 7D), probably due to the increased contribution of LD relative to sequencing data on the genotype calls. However, when we focused on low frequencies variants with allele frequency <0.1 and <0.05, Polymutt2 still greatly outperformed Beagle4 ( Fig  7E and 7F).
When we focused on heterozygotes, Polymutt2 and Beagle4 calls have similar accuracy when all variants were considered with both 30% and 15% of the data (S3A and S3D Fig).  We observed considerable MI rates in Beagle4 calls with both~15X and~7.5X data. For example, at~15X the MI rates are 0.14% and 0.09% when the minimum GQ was set to 5 and 10, respectively. The corresponding MI rates at~7.5X are 0.15% and 0.1%. When we focused on low frequency variants, the MI rates in Beagle4 calls are noticeably increased. For example, for variants with allele frequency <0.1 at~15X, the MI rates are 0.27% and 0.21% for GQ cutoffs of 5 and 10, respectively, indicating that increased genotype error rates associated with low allele frequencies in Beagle4 calls resulted in higher MI rates. On the other hand, as a direct comparison, we did not observe Mendelian error in Polymutt2 calls in all of these scenarios investigated, indicating the extremely low Mendelian error rate in Polymutt2 calling in real data.

Discussion
Sequencing pedigrees has shown its effectiveness in identifying rare variants associated with human disease, and is expected to continue in gene mapping for complex traits in complement to population-based designs. In addition, family designs are not prone to population stratification, which may be more challenging to control for rare variants [29]. In this study we developed a new tool, Polymutt2, for accurate inference of inheritance vectors and genotype calling for pedigree sequencing data. Through both simulations and application to real data, the new tool achieves markedly improvement of genotype calling accuracy compared to the standard method (GATK) and a family-aware algorithm (Polymutt), as well as an LD-based caller (Bea-gle4), especially for low frequency variants. The advantages are mainly due to the explicit modeling of the IBD among family members and then the incorporation of the IBD information in genotype calling. This framework efficiently utilizes the relatedness by combining sequencing data from shared haplotypes among all family members across the genome. For the inference of inheritance vectors, which is critical for genotype and haplotype calling, we directly model the sequencing data in an effort to increase the robustness via the incorporation of sequencing error and depth of coverage in the likelihood calculation. Additional increase in performance comes from the careful selection of the scaffold variants in modeling the inheritance vectors. We plan to refine the selection of scaffold variants to further minimize the inadvertent effect of alignment artifacts on the inference of the inheritance vector, e.g. by exploring the alignment files to filter sites with nearby Indels, homopolymers, allelic imbalances, strand and cycle bias, among others.
Compared to Polymutt and GATK, Polymutt2 has increased accuracy of genotype calling from all aspects. This is rather unsurprising given that Polymutt2 uses extra information than the other two callers. On the other hand, Polymutt2 and Beagle4 use orthogonal information, i.e. the explicit modeling of IBD sharing in Polymutt2 vs. the utilization of LD among variants in Beagle4 for variant calling. Since the LD (r 2 in this context) between rare variants and between rare and common variants is low, the effectiveness of LD-based calling for rare variants is reduced. Although for pedigrees with limited IBD sharing (e.g. sibpairs) Beagle4 outperformed Polymutt2 when considering all genotypes, Polymutt2 still achieved increased accuracy in calling rare variants. In addition, Mendelian inconsistency in LD-based calls, especially for rare variants, which are usually analyzed in groups, may have inadvertently impact on association analysis since the effect of Mendelian error in individual variants may be aggregated and amplified. As the major focus in sequencing is to identify rare variants we hope that Polymutt2 is useful for gene mapping of rare variants for complex disease.
Although most current studies focus on exome sequencing, multiple lines of evidence indicate the need for whole genome sequencing to identify risk factors for complex disease. Given the current cost, it is still not practical to carry out large-scale high coverage whole genome sequencing studies. Our tool makes it feasible for whole genome sequencing of pedigrees with reduced coverage. On the other hand, Polymutt2 is equally effective in targeted sequencing of small genomic regions, such as peaks revealed in linkage analysis, since the inheritance vectors are expected to be reliably inferred by modeling the shallow off-target sequences across the genome.
Since the haplotype calling in Polymutt2 is based on inheritance vectors only, the phase cannot be determined for some variants in which parents and offspring are heterozygotes. In such case, the phases are randomly assigned and should not be used without further information. Although LD can be used to phase such variants in trios [30], the simultaneous modeling of LD and inheritance vectors in complex pedigrees is computationally challenging. On the other hand, this limitation in Polymutt2 has little impact on the analysis since for rare variants, which are the major focus of sequencing studies, such situations are extremely uncommon. Note that for such variants only phasing is affected but the accuracy for both phased and unphased genotypes benefits equally from IBD modeling as other variants.
Since the Lander-Green algorithm is the major component for the inference of the inheritance vectors, the computation is linear with respect to the number of variants but can be explosive when pedigrees get large. For a pedigree with f founders and n nonfounders, the possible number of inheritance vectors is 2 2n . Due to the lack of phasing information of founder alleles, these inheritance vectors are organized into 2 f equivalent classes so that only 2 2n-f inheritance vectors are required to model, a factor of 2 f reduction in terms of computation and storage [31]. Furthermore, we implemented the Fast Fourier Transformation in the Lander-Green algorithm [32], which reduces the computation from O(N 2 ) to O(NlogN)in the HMM, where N is the number of inheritance vectors. Even with these speedup techniques, however, the computation can be still very challenging. To further mitigate the problem, we implemented the software using multi-threads so that the computation can be parallelized when possible. The current implementation can handle simple pedigrees efficiently. For example, for sibpairs and sibships of size 4, and nuclear families of size 4 and 6, the average time per family using 8 threads for chromosome 1 whole genome sequencing is on the scale of minutes. For sibships of size 6 the time is significant increased and it took over an hour to finish variant calling per family. For pedigree of Ext10 the time is even further increased to over 10 hours to get marginal calls. If computing is limited an option is to use only inheritance vectors with highest posterior probabilities for such pedigrees; for example using the single best inheritance vector the computing is a few minutes. For pedigrees beyond the exact calculation of the likelihoods, Monte Carlo approaches [33][34][35] are necessary, which is beyond the scope of the current study and will be explored in the future.
In the inference of inheritance vectors, we selected the scaffold variants by LD pruning. The results reported in the article were based on the maximum correlation coefficient of R 2 = 0.2. We also investigated other thresholds to evaluate the sensitivity of the results to the LD pruning. Specifically we used cutoffs of 0.1 and 0.5 and observed similar results as 0.2, with the difference below 0.01% for most of pedigrees and coverage investigated in Figs 1 and 2, indicating the robustness of the framework to LD. This robustness makes it flexible to select scaffold makers without comprising the genotype calling accuracy.
With the comprehensive catalog generated by the 1000 Genomes Project, identifying known variants in study sample is generally very accurate. However calling novel variants for pedigrees is usually of particular interest. This remains challenging due to potential alignment artifacts. Unannotated structural variants are a major source of alignment artifacts, and when such artifacts do not follow Mendelian transmission laws the variant quality is expected to be dramatically reduced for such sites when the IBD sharing is imposed in the calculation of the likelihood. We believe that Polymutt2 is effective in filtering false novel variant candidates given its efficient use of allele sharing.
In our framework the increased accuracy of variant calling is due to the efficient use of the Mendelian inheritance. De novo mutations, however, violate the rule and make the inference of inheritance vector inaccurate. Although it is unlikely to include de novo mutations in scaffold variants, accidental inclusion of such variants makes the results not reliable. To avoid this situation, Polymutt2 internally checks the likelihood of de novo mutations during the calculation and if a strong violation of Mendelian inheritance is detected the algorithm ignores these variants so that the inheritance vectors can be robustly inferred. The current version of Polymutt2 is not designed to call de novo mutations and other methods (e.g. Polymutt and DeNovoGear [36]) should be used for that purpose.
Our tools were implemented in C++. The source code and company resources can be downloaded from the authors' website (https://medschool.vanderbilt.edu/cgg). We hope that our user-friendly software packages are useful to the research community for pedigree sequencing studies to facilitate the identification of rare variants for human disease.