A Bayesian method to estimate variant-induced disease penetrance

A major challenge emerging in genomic medicine is how to assess best disease risk from rare or novel variants found in disease-related genes. The expanding volume of data generated by very large phenotyping efforts coupled to DNA sequence data presents an opportunity to reinterpret genetic liability of disease risk. Here we propose a framework to estimate the probability of disease given the presence of a genetic variant conditioned on features of that variant. We refer to this as the penetrance, the fraction of all variant heterozygotes that will present with disease. We demonstrate this methodology using a well-established disease-gene pair, the cardiac sodium channel gene SCN5A and the heart arrhythmia Brugada syndrome. From a review of 756 publications, we developed a pattern mixture algorithm, based on a Bayesian Beta-Binomial model, to generate SCN5A penetrance probabilities for the Brugada syndrome conditioned on variant-specific attributes. These probabilities are determined from variant-specific features (e.g. function, structural context, and sequence conservation) and from observations of affected and unaffected heterozygotes. Variant functional perturbation and structural context prove most predictive of Brugada syndrome penetrance.

Introduction A major barrier to integrating genotype information into clinical care is accurately linking genetic variants to disease risk. As cheap whole genome, exome, and gene panel sequencing becomes more widely used, the genetics community frequently observes novel, ultra-rare variants-ones carried by a single or few (often related) individuals. Indeed, most variants found in large population genome sequencing efforts are novel or ultra rare [1][2][3][4]. The number of possible single nucleotide variants in the human genome is in the billions; the number of variants becomes uncountable if insertion and/or deletions (indels) are included. The majority of these discovered variants will never be observed in a sufficient number of heterozygotes to ascertain a causal link with disease. In addition to finding rare variants, large-scale genetic sequencing efforts taking place around the world are identifying greater numbers of individuals, ostensibly unaffected, who carry variants previously thought to be disease-inducing [5,6]. As a consequence of insufficient heterozygote counts and conflicting annotations, many diagnostic laboratories annotate such variants as "Variants of Uncertain Significance" (VUS), despite more confident past assessments of "Likely Pathogenic" or "Pathogenic" [7][8][9][10].
To help assess the impact of genetic variants, the American College of Medical Genetics and Genomics (ACMG) suggests integrating multiple sources of information including population, functional, computational, and segregation data to classify variants [11,12]. This is consistent with a continuous, Bayesian framework where each additional satisfied classification criterion modifies the probability a variant is causative for disease (pathogenic) or not (benign) [12]. Given the resulting probabilities, a final classification can be made into one of the five categories commonly used to distinguish variants-benign, likely benign, variant of uncertain significance, likely pathogenic, or pathogenic. However, a remaining challenge even after classification is that the clinical implications for definitively pathogenic variants can vary strikingly across individuals, including variable expressivity and incomplete penetrance [13]. We attempt here to address one aspect of this clinical variability by developing a method to estimate variant-induced disease risk.
In this study, we sought to develop a method to estimate the probability of disease given variant-specific information-which we refer to as the penetrance of a variant-and we also provide the uncertainty for that estimate. The pathogenicity of a variant for a specific individual at a given point in time is binary but unknown. This pathogenicity may have a time dependence such as for diseases which present later in life. Penetrance is one metric that captures the degree to which the pathogenicity will manifest as a human phenotype such as a disease or a trait. We provide posterior probability estimates of the penetrance, asymptotic with respect to age, which can be thought of as the positive predictive value of disease given the known variant information. We also provide a 95% credible interval that represents the uncertainty in that estimate. Our method relies on "borrowing strength" or sharing information across variants to produce variant-specific, quantitative penetrance estimates even in the absence of a large number of heterozygotes. These estimates can be especially informative for interpreting rare and novel variants.
We illustrate our approach using the rare cardiac arrhythmia disorder Brugada Syndrome (BrS1 [MIM: 601144]), which is linked to rare loss-of-function variants in the cardiac sodium channel SCN5A [14]. These variants most commonly act by altering peak sodium current, a parameter of sodium channel function that is readily assessed using in vitro methods. By quantitatively integrating multiple features, including in vitro functional experiments, information about the three-dimensional protein structure, and previously published variant-classifiers, such as PolyPhen-2 and PROVEAN, we estimate the BrS1 penetrance attributable to individual SCN5A variants. The resulting priors, imputed from these predictive features, can be readily interpreted as hypothetical observations of unaffected and affected heterozygotes.

Results/Discussion
Variants in SCN5A have been associated with BrS1 since 1998, [15] some variants affecting almost all known heterozygous individuals, some variants conferring only modestly increased risk, and others have no influence on arrhythmia presentation [14,16,17]. SCN5A variants that do not influence the gene in any way do not predispose or protect against BrS1, e.g. many synonymous variants. These variants therefore have a relatively low penetrance of the arrhythmia, similar to the general population. SCN5A variants that produce no sodium current result in a higher fraction of heterozygotes presenting with BrS1, much higher than in the general population [18]. However, BrS1 presentation, as for nearly all inherited diseases, is not homogeneous even amongst heterozygotes of SCN5A haploinsufficiency alleles. In fact, even highly penetrant variants such as p.Glu1784Lys and p.Glu1784Lys still leave some heterozygotes unaffected: 100% penetrance is extremely rare [18].
Our hypothesis is that variant-specific features (e.g. variant-induced changes in function and location in structure) contain information equivalent to clinically phenotyping heterozygotes and can therefore be used to inform the prior distribution in a Bayesian framework. This prior distribution is combined directly with clinically phenotyped heterozygotes (the likelihood function) to produce more accurate estimates of disease risk probability (posterior penetrance; Fig 1) via Bayes theorem. To demonstrate this approach, we developed an expectation maximization approach (EM), detailed in the Materials and Methods section, and applied it to a previously generated dataset of SCN5A features and BrS1 phenotype counts [18] (supplemented with reports published within the last year) to estimate BrS1 penetrance using SCN5A variant-specific features. This process yielded a total of 1,439 unique variants with at least 1 observed heterozygote, BrS1 was diagnosable in 857 individuals heterozygous for 387 unique variants (S1-S3 Figs). BrS1 penetrance priors informed by the predictive features listed in S1 Table adjust

Precision and accuracy of BrS1 penetrance priors
To evaluate performance over the distribution of BrS1 prior penetrances (S5 Fig), we plotted the difference between prior mean and posterior mean BrS1 penetrance as a function of the average between the two estimates (Fig 2). The resulting Bland-Altman difference plot seen in Fig 2 indicates scatter evenly distributed with under and over predicted BrS1 penetrance as a function of prior mean penetrance. This suggests the predictive priors are reasonably calibrated and have no systematic biases in the range of BrS1 mean penetrance estimated. We additionally compared linear regression models trained on a limited subset of features/covariates with the BrS1 mean posterior, BrS1 casesþa prior total heterozygotesþa prior þb prior (where α prior and β prior are the tuning parameters for the beta-binomial distribution and are set equivalent to the number of affected and unaffected individual heterozygotes in the prior), as the dependent variable; both empirical and EM priors were evaluated as indicated in Table 1. Peak current and penetrance density (a modification of a structure-derived feature we developed previously [19]; see S1 Text) contain orthogonal information as can been seen by the differences in coefficient of determination, R 2 , for models built using each or both predictors ( Table 1). The relatively small improvement in R 2 when all predictors are included suggests most information contained in the sequence-based predictive features is recapitulated by both peak current and penetrance density.

Inclusion of individuals from gnomAD
Individuals in gnomAD are mostly unaffected, given the rarity of BrS; however, the data available from that resource could be contaminated with individuals presenting with BrS, though likely at or near the rate in the general public. To test the sensitivity of our results to this type of misclassification, we randomly switched individuals from unaffected (gnomAD) to BrS cases for each variant and examined the change in penetrance due to misclassification. We did this with 24 and 240 misclassified cases. With 24 misclassifications, the median rate of penetrance change is 0.4% and the expected number of variants with a penetrance change is 6. The average mean absolute difference in penetrance change is 0.02% (first quartile of 0.0014% and

PLOS GENETICS
third quartile of 0.02%). With 240 misclassifications, the median rate of penetrance change is 2%, and the expected number of variants with a penetrance change is 28. The average mean absolute difference in penetrance change is 0.2% (first quartile of 0.1% and third quartile of 0.3%). These results suggest minimal influence of small or modest misclassification rates on penetrance estimates.

Fig 2. Bland-Altman plot between EM prior and EM posterior mean penetrances for all SCN5A variants.
To assess the performance of the EM prior, we used a Bland-Altman plot to compare the mean BrS1 penetrance estimated from the EM prior and from the EM posterior, the y-axis is the difference between the two and the x-axis is the average between the two. For each plotted point, both color and radius indicate the log 10 of the total number of heterozygotes present in the dataset. The relatively consistent scatter about y = 0 suggests no systematic biases present in the EM prior mean BrS1 estimates.
https://doi.org/10.1371/journal.pgen.1008862.g002 Table 1. Weighted R 2 from EM prior means to Empirical/EM posterior means. Models trained with displayed subsets of features using the same subset of variants, where covariates listed in S1

Structure and peak current improve prediction of penetrance
The resulting prior BrS1 mean penetrance estimates reflect the known topology of Na V 1.5 (protein product of SCN5A; Fig 3), with the sodium channel pore and selectivity filter inducing a greater disease burden as previously observed [18,20].  Prior mean BrS1 penetrance reflects the protein topology of Na V 1.5. The predicted mean BrS1 penetrance from the converged expectation maximization (EM) algorithm. The line across the plot is a predicted mean BrS1 penetrance averaged over 30 neighboring variants. Topology diagram is shown above with transmembrane helices indicated by yellow lines and membrane indicated as a grey rectangle. Note the four largest, distinct peaks correspond to the four structured, transmembrane domains of the channel, with an especially steep peak at the selectivity filter and pore. Though estimated distances in three-dimensional space between residues is used to construct the BrS1 penetrance density, structural data are not explicitly used in the BrS1 penetrance prior and so the recapitulation of the structure is not assured. https://doi.org/10.1371/journal.pgen.1008862.g003

PLOS GENETICS
clinically phenotyping individuals heterozygous for these variants. The interchangeability of this information was additionally demonstrated recently by taking the reverse approach: functionally characterizing variants with different estimates of BrS1 penetrance [21]. In these experiments, Glazer et al. found that variants with a higher estimated BrS1 penetrance had a higher probability of producing a variant-induced loss-of-function protein phenotype (Fig 3A in reference 21). Left: SCN5A variants with more than one heterozygote in our dataset are plotted with prior 95% credible intervals (colored bars) and mean posteriors (black rectangles) with posterior 95% credible intervals (black lines). Right: a model of the SCN5A protein product, Na V 1.5, is shown with the regions highlighted in blue, green, gold, and red, corresponding to the colors of the variant prior 95% credible intervals shown to the left, which are analogous to the penetrance probability distributions shown on the y-axes in Fig 1. Variants near the D-III pore selectivity filter have a much higher prior and posterior BrS1 penetrance compared to residues near the D-III/D-IV linker. This is expected since the selectivity filter pore helices contain the most compacted region of the protein and also are responsible for the ion conduction and are therefore most sensitive to substitution. In fact, the highest density of variants with non-zero BrS1 penetrance lie at this depth in the membrane (S6

A modified Bayesian approach to estimate BrS penetrance
A typical Empirical Bayes approach combines information across all variants to estimate a single prior distribution and estimate a variant-specific posterior penetrance from that prior. These estimates assume all variant effects have the same prior and therefore shrink towards a global mean across all variants. Here we put forward a method to model the penetrance for each variant using variant-specific predictive features. The resulting penetrance and uncertainty estimates yield a posterior that can be re-used as variant-specific prior (interpretable as equivalent to hypothetical observations of affected and unaffected heterozygotes) in a classical Bayesian updating scheme. This information is accessible before clinically phenotyping a single heterozygote; example estimates of high BrS1 penetrance

Comparison between penetrance prediction and ACMG variant classification
We put forward a method to estimate the probability that an SCN5A variant will manifest in BrS1 for a given patient (our 'risk score'), and uncertainty for that score, conditioned on variant attributes. We are not assessing the causality of the variant and its attributes on the manifestation of disease, but rather their association. Hence, our framework diverges from that of the ACMG, quantitated by Tavtigian et al. 2018. For example, in our formulation, a VUS with many affected heterozygotes would have the same probability distribution as a pathogenic variant with many affected heterozygotes [provided the number of observations of cases and controls is the same and the other predictive covariates (variant attributes) are the same]. If there are comparatively few heterozygotes of the VUS, given the same predictive covariates, greater uncertainty would be reflected by a wider distribution of penetrance probability (Fig 1). In addition, our calculation is agnostic to origin, de novo or inherited, and therefore does not consider this evidence (though this information may additionally inform an estimate of penetrance and therefore warrants further investigation). We also do not treat null variants here. For our purposes of building variant-specific, data-driven penetrance priors, null variants have relatively little variance in the predictive covariates and therefore contribute less to our analysis. In future work we will additionally attempt to include these features.

Prospects for applications of this method
Our approach provides a risk score for disease, in this case, for BrS1. However, Brugada syndrome has degrees of electrophysiologic phenotypes and symptoms. We envision being able to predict these degrees of clinical phenotype from variant-specific properties in the future by integrating electronic health records with linked genetic data. However, at present, these granular electrophysiologic and symptom data are not available for a number of unique heterozygotes and unique variants sufficient for statistical analysis. Beyond SCN5A and BrS1, a reasonable next step would involve the 59 genes for which the ACMG recommends clinical diagnostic laboratories report secondary variant discovery. Of these, 36 have greater than or equal to 20 missense "pathogenic"/"likely pathogenic" variants in ClinVar, [22] suggesting that many variants are described in the literature and can be curated in a similar manner to SCN5A. It is also important to note that the penetrance estimates derived in our approach are not static and will continue to be refined as additional data become available, i.e. phenotype data from case reports and large biobank projects, additional in vitro functional studies, and improved computational and structural predictors [13,[23][24][25][26].

Limitations
Our approach provides a risk score for disease, in this case BrS1, analogous to a diagnostic test (might patient X develop BrS1 given they have variant Y). If we know patient X already has BrS1, we can use their data to inform other individuals' risk scores, but we cannot use our approach to absolutely determine the role of variant Y manifesting disease. One application of our approach is that we can examine the ratio P(BrS1|SCN5A Variant X)/P(BrS1|wild-type SCN5A) to see if the data better support that variant X is on the causal pathway to disease. But we caution that this approach is imperfect; it does not allow for variants to interact, for example. Additionally, while clinical evidence affirms a strong relationship between SCN5A variants and BrS1, many genetic and environmental factors influence the ultimate presentation of BrS1 in an individual [13,27,28]. Not accounting for additional demographic, genetic, or environmental factors certainly increased the noise in our analysis. To counter this as best as possible, we included the maximum number of carriers for the maximum number of unique variants. Finally, we recognize the likely bias intrinsic to compiling a list of affected and unaffected heterozygotes in the manner outlined in the methods section above; however, the most probable manifestation of these biases would be the loss of an observable relationship between the predictive features and penetrance, not the creation of a spurious relationship.

Conclusions
We advance a method to estimate a degree of clinical heterogeneity in variant impact, incomplete penetrance. Here we have demonstrated how BrS1 penetrance can be estimated with high accuracy and precision. Using a Bayesian framework to estimate penetrance allows us to quantitatively integrate clinical phenotypic data with variant-specific functional measurements, variant classifiers, and sequence-and structure-based features to accurately estimate penetrance. This method can be extended to other genes and disorders in order to enable quantitative interpretation of variants probabilistically and quantitatively [24,29].

Materials and methods
These analyses focus on the SCN5A gene, where individual variants are known to influence the clinical presentation of the autosomal dominant arrhythmia Brugada Syndrome (BrS1) [16,17]. We define cases as individuals with either a spontaneous or drug-induced ECG BrS1 pattern, ST-segment abnormalities, as reported in each publication [18,30]. Penetrance is defined as the fraction of individuals who carry a variant that also present with a disease. This can be extracted from literature reports when multiple variant heterozygotes have been reported. We do not observe the actual penetrance for any given variant; however, we can estimate BrS1 penetrance for each variant as the average posterior penetrance denoted as the following: Where α is the number of variant heterozygotes diagnosed with BrS1 (or BrS1 cases) and β is the number of unaffected heterozygotes of the same variant (or controls). As the total number of observed heterozygotes increases, the estimated penetrance converges to the traditional PLOS GENETICS definition. The mean posterior penetrance can be thought of as a shrunken estimate of the observed penetrance [31], especially for variants with small numbers of known heterozygotes.
To generate priors from our available data, we use a variation of the expectation maximization (EM) algorithm [32]. Our modified EM algorithm is an iterative technique composed of three steps: 1) calculate the expected penetrance from an empirical Bayes penetrance model, 2) fit a regression model of our estimated penetrance on variant-specific characteristics by maximum likelihood (Eq 2, below) and 3) revise our estimate of the BrS1 penetrance prior using the fit from step 2 then iterate steps 2-3 until convergence criteria are satisfied (S7 Fig).
Here peak current is an in vitro measurement of the maximum current through a channel (normalized to wild type), penetrance density is a structure-based metric [19] detailed in the S1 Text, and in silico variant-classifiers is a vector populated with commonly used variant classification servers such as PROVEAN and PolyPhen (see below); all predictors used are continuous, not categorical or binary (S1 Table). The fitted model is then used to generate an updated prior distribution and, by addition of observed cases and controls for each variant, a subsequent posterior expected penetrance. The updated posterior penetrance is then used to build a new fitted model and further refine the posterior expected penetrance. This procedure is iterated until it converges to the maximum likelihood solution (S7 Fig). Using a beta-binomial model to estimate penetrance, the prior parameters (α prior, EM and β prior, EM , both functions of the features listed in S1 Table) are identifiable from a predicted penetrance point estimate and its associated variance. For comparison, we generated predicted penetrance values using a standard empirical Bayes method which generated a single empirical prior for all variants, α prior, empirical and β prior, empirical equal to 0.45 and 2.73, respectively (called empirical prior throughout the text, S8 Fig). To test our predictions, we compare our EM penetrance priors, a prior;EM a prior;EM þb prior;EM , to the posterior mean penetrance derived by adding BrS1 cases and controls for each variant to the empirical prior, BrS1 casesþ a prior;empirical Total heterozygotesþ a prior;empirical þb prior;empirical , or the EM prior, BrS1 casesþ a prior;EM Total heterozygotesþ a prior;EM þb prior;EM .

Collection of the SCN5A variant dataset
The dataset was curated from 711 papers in a previous publication [18], to which we added an additional 45 papers on SCN5A that had been published since the previous dataset was constructed. Briefly, we searched publications for the number of heterozygotes of each variant mentioned, the number of unaffected and affected individuals with diagnosed BrS1, and variant-induced changes in channel function, if reported; all recorded values of channel function were normalized to wild-type values reported in the same publications. We supplemented this dataset with all SCN5A variants in the gnomAD database of population variation (http:// gnomad.broadinstitute.org/; release 2.0) [33]. Due to the rarity of BrS1 (~1 in 10,000) [34], all heterozygotes found in gnomAD were counted as unaffected. An interactive version of the dataset, the SCN5A Variant Browser, is available at https://oates.app.vumc.org/vancart/ SCN5A/. We further collected in silico pathogenicity predictions from three commonly used servers: SIFT [35], Polyphen-2 [36], and PROVEAN [37]. We also include basic local alignment search tool position-specific scoring matrix (BLAST-PSSM) [38] for SCN5A and the per residue evolutionary rate [39], previously shown to have predictive value for predicting functional perturbation for the cardiac potassium channel gene KCNQ1 [40], and point accepted mutation score (PAM) [41]. Additionally, we leveraged structures of the SCN5A protein product and derived a penetrance density as previously described (see S1 Text for details) [19]. In-frame indels are treated as missense variants. We include these variants as variations at a residue where the indel starts, and only note whether they are an insertion or deletion. Some of these variants have functional data available and their penetrance densities are calculated from the residue starting the indel. These are simplifications to enable an analysis of as many variants and heterozygote individuals as possible. For these variants, we did not include in silico pathogenicity predictions. We included compound heterozygotes (individuals with more than one SCN5A variant) as separate records when these data are available, though these were very rare. Additionally, our inclusion criteria are not modified by relatedness. We did not include intronic variants in our analysis. The dataset is available in S2 Table.

Initial Empirical Bayes beta-binomial prior penetrance calculation
Using the data from the aforementioned literature curation [18], we estimated the penetrance for each observed variant using a beta-binomial empirical Bayes model. To calculate the empirical BrS1 penetrance prior, we calculated α prior, empirical and β prior, empirical by finding the weighted mean penetrance over all variants in the dataset and estimating the variance. Weighting was done using the following equation: Eq 3 ensures variants with a greater number of total heterozygotes (and therefore higher confidence in penetrance estimate) had a greater weight in the preliminary analysis. We then estimated the variance in penetrance as the mean squared error (MSE) between the estimated penetrance mean and the observed penetrance from Eq 1 with α prior and β prior equal to zero. With these estimated mean and MSE-derived variance, the empirical prior penetrance was calculated to be an α prior and β prior equal to 0.45 and 2.73, respectively. The variant-specific empirical posterior for each variant was then calculated by adding observed heterozygote counts of affected (BrS1 cases) and unaffected to α prior, empirical and β prior, empirical , respectively, and the resulting posterior mean penetrance was used as the dependent variable of the subsequent regression model (Eq 2). The inverse variance of the estimated posterior beta distributions capped at the ninth decile determined in this step were used to weight subsequent regression models and Pearson R 2 calculations.

Expectation maximization Bayesian beta-binomial penetrance predictions
To deal with missing data in a prediction model, we followed the approach outlined in Mercaldo and Blume [42] which avoids multiple imputation but guarantees maximum predictive accuracy across missing data patterns. In short, for every missing data pattern, we estimate a separate prediction model. For example, p.His558Arg, where penetrance density, in silico predictors, and functional data are all available, the estimate of penetrance is regressed on all other variants where all of these covariates are available (n = 238). For p.Try1449Cys, however, only penetrance density and in silico predictors are available, so only those covariates are used in the regression (n = 1,382; much higher since functional data have been collected for relatively few variants). The models were built with a linear regression pattern-mixture algorithm, updating posterior mean penetrances iteratively until the resulting estimated mean penetrance, m ¼ a prior;EM a prior;EM þb prior;EM , changed by < 0.01% from the previous iteration. This process PLOS GENETICS typically converged within eight iterations. For variant, i, the variance was estimated from this converged EM mean penetrance according to (Eq 4): We then adjusted ν, equivalent to the number hypothetical observations of clinically phenotyped heterozygotes, to balance overcoverage of variants with low to moderate BrS1 penetrance and poorer coverage of variants with high estimated mean penetrance, resulting in a range of ν, from approximately 15 to 20 (see S2 Text for details; S9-S12 Figs). All analyses were performed using the datasets provided in S2 Table and   Rate of variants with high BrS1 penetrance (>20%, blue) or low BrS1 penetrance (<10%, red) in a model of the SCN5A protein product. Each bar represents a histogram of variants associated with each disease within a 5Å slice within the membrane (divided by the total number of residues within the slice), boxes at each of the four corners represent residues not modeled (only 33 residues were not modeled in the extracellular loops). There is a relative paucity of low BrS1 penetrance variants within the structured transmembrane region and the relative abundance of high BrS1 penetrance in the same region. The rate of high BrS1 penetrance variants is higher in the extracellular half of the protein molecule likely due to more compacting of residues in the top half of the pore domain as well as proximity to the ion selective element (selectivity filter). Amino acid substitutions in these regions therefore more often have a disruptive influence. (PNG)

S7 Fig. Generation of empirical and EM priors.
The modified EM algorithm is an iterative technique composed of two steps: 1) calculate the expected penetrance from an empirical Bayes penetrance model and 2) fit regression of our estimated penetrance on variant-specific characteristics by maximum likelihood. The fitted model is then used to generate an updated, imputed prior and subsequent posterior expected penetrance and this process is iterated until it converges to the maximum likelihood solution, when the new mean penetrance changed by less than 1% from the previous iteration. The variance is then estimated according to Eq 4 as explained above. Coverage rate was calculated as defined above. Color and radius indicate the log 10 of the total number of heterozygotes present in the dataset. The tuning parameter Eq 4 was set to ν = 7. There is overcoverage (greater than 95%) for variants with high and low BrS1 penetrance indicating an overestimate of the variance. (PNG) S10 Fig. Estimated coverage rates for each SCN5A variant versus sampled true penetrance. Coverage rate was calculated as defined above. Color and radius indicate the log 10 of the total number of heterozygotes present in the dataset. The tuning parameter Eq 4 was set to ν = 14. There is overcoverage for the majority of variants, though some variants are now outside the 95% credible interval. Coverage rate was calculated as defined above. Color and radius indicate the log 10 of the total number of heterozygotes present in the dataset. The tuning parameter Eq 4 was set to ν = 19. Overcoverage is reduced especially for residues with very low or very high BrS1 penetrance, indicating an appropriate estimate of variance. Coverage rate was calculated as defined above. Color and radius indicate the log 10 of the total number of heterozygotes present in the dataset. The tuning parameter Eq 4 was set to ν = 99. Variant undercoverage is much more prevalent and distributed evenly across variants with low to high BrS1 penetrance indicating an overestimate of variance. (PNG)