A Bayesian framework for efficient and accurate variant prediction

There is a growing need to develop variant prediction tools capable of assessing a wide spectrum of evidence. We present a Bayesian framework that involves aggregating pathogenicity data across multiple in silico scores on a gene-by-gene basis and multiple evidence statistics in both quantitative and qualitative forms, and performs 5-tiered variant classification based on the resulting probability credible interval. When evaluated in 1,161 missense variants, our gene-specific in silico model-based meta-predictor yielded an area under the curve (AUC) of 96.0% and outperformed all other in silico predictors. Multifactorial model analysis incorporating all available evidence yielded 99.7% AUC, with 22.8% predicted as variants of uncertain significance (VUS). Use of only 3 auto-computed evidence statistics yielded 98.6% AUC with 56.0% predicted as VUS, which represented sufficient accuracy to rapidly assign a significant portion of VUS to clinically meaningful classifications. Collectively, our findings support the use of this framework to conduct large-scale variant prioritization using in silico predictors followed by variant prediction and classification with a high degree of predictive accuracy.


Introduction
The recent surge of sequencing-based clinical genetic testing has put a spotlight on associated challenges in data interpretation. While advances in genomics allow for the development of new genetic tests at an unprecedented pace and complexity, the interpretation of results has remained a largely manual and time-consuming process that is not scalable to the volume and diversity of available data [1]. In particular, the rich evidence in well classified variants is not effectively incorporated in classification schemes relying on manual processing of large-scale information.
The pathogenicity of a genetic variant can be assessed by various evolutionary, functional and structural in silico scores and a range of evidence from clinical, family history, co-occurrence and co-segregation data, as well as the published findings of case-control, cohort or family-based studies [2,3]. One approach to variant classification is to follow a rule-based scoring system, in which each line of evidence is converted into a score and the summary score from all available evidence is used to determine the classification [4][5][6]. This rule-based approach PLOS  assumes variants have strong and consistent evidence that is equally applicable across the genome. It also makes use of only qualitative evidence and does not leverage existing knowledge pertaining to other well classified variants. Another approach to variant classification is the multifactorial likelihood method, in which the prior probability of pathogenicity is obtained by using a genome-wide in silico (S1 Table) or ensemble (S2 Table) prediction method and the posterior probability is derived by aggregating the prior probability over multiple quantitative evidence [7,8]. While this multifactorial method is computationally simple, it does not account for gene-specific differences and variability of the estimated prior probability. A third approach uses allele frequencies and multiple in silico predictors in a Bayesian logistic regression model that includes priors based on case-control proportions of carriers reported in the literature or public databases, a categorical covariate for gene-specific effects, and fixed model terms for each gene tested [9]. Such models improved prediction accuracy over alternative approaches, but did not allow for all parameters to be freely estimated for each gene or make use of the wide range of available evidence.
To efficiently and accurately determine the pathogenicity of genomic variants, there is a growing need to develop data-driven tools that are capable of assessing a wide array of evidence associated with each variant while leveraging information that is readily available for well classified variants. To address this need, we developed a Bayesian framework for variant prediction that aggregates multiple in silico scores and evidence statistics in both quantitative and qualitative forms, and validated the models in genes associated with hereditary cancer syndromes. Our approach improves upon existing methods by leveraging the vast information available for classified variants, quantifying gene-specific in silico effects while incorporating both quantitative and qualitative evidence, and predicting the pathogenicity of each variant using a probability distribution to account for uncertainty.

Analytical framework
Our Bayesian framework is a data-driven model-based tool for variant prediction and classification analysis (Fig 1). Initially, a set of gene-specific in silico predictors is selected from publically available in silico scores. An in silico variant prediction (IVP) model is used to predict a preliminary level of variant pathogenicity based on a training dataset that contains variants with known classification and their accompanying 16 in silico predictors. Next, the prior probability distribution of pathogenicity for each variant is estimated from a corresponding IVP model followed by a rescaled transformation, and the posterior probability distribution is estimated from a multifactorial variant prediction (MVP) model that aggregates the prior distribution over multiple evidence predictors ( Fig 1A). Finally, variant classification is assigned to 5-tiered classes of benign, variant of likely benign (VLB), variant of uncertain significance (VUS), variant of likely pathogenic (VLP) and pathogenic based on the probability distribution of pathogenicity (Fig 1B).

In silico model analysis
For each of 10 genes that collectively constitute the multigene panel test (MGPT) data containing missense variants with known ClinVar consensus classification outcomes of benign, VLB, VLP and pathogenic, we constructed a training dataset and derived an IVP model that retained 1 to 4 in silico predictors from the 16 candidate standalone scores tested (S3 Table). When these gene-specific IVP models were evaluated in all 1,161 class-known variants using a leave-one-out cross validation (LOOCV) method and the 5-tiered classification scheme, 360 (31.0%) were predicted to be concordant with their known classes, 277 (23.9%) were categorized one level above or below their known classes (e.g., benign as VLB, pathogenic as VLP, etc.), 520 (44.8%) were classified as VUSs, and 4 (0.3%) were discordant (e.g., pathogenic/VLP classified as benign/VLB or vice versa) and therefore noted as false negatives or false positives (Fig 2A; S4 Table). Thus, while the IVP model yielded high positive predictive  value (PPV), negative predictive value (NPV) and accuracy (100.0%, 99.2% and 99.4%, respectively), sensitivity and specificity were modest (35.7% and 65.5%, respectively) ( Table 1, lower section).
We next compared the predictive performance of the IVP model to that of each individual in silico predictor. Using the same 1,161 class-known variants, the gene-specific IVP models had the highest sensitivity (35.7%), PPV (100.0%), NPV (99.2%), accuracy (99.4%), and AUC (96.0%), and lowest proportion of VUS (44.8%) compared to each of the 16 standalone and 6 meta-predictor models (Table 1). Among all 23 in silico models, the AUC statistics were highest for IVP model and followed by REVEL and MetaSVM (Fig 3A and 3B; Table 1  The in silico variant prediction analyses were evaluated from gene-specific prediction models from each of 16 standalone predictors and 7 meta-predictors, respectively, in which the IVP predictor was derived from the 16 standalone predictors. Each analysis was evaluated in MGPT data containing 1,161 missense variants (747 negatives, 414 positives) using LOOCV. Results are listed in descending order of AUC values among models using standalone predictors and meta-predictors, respectively. b Predicted outcomes were derived from the predicted positive/negative categories and the known ClinVar consensus classes. TN = true negative, TP = true positive, FN = false negative, FP = false positive.
c Performance statistics were reported as Sen = sensitivity, Spe = specificity, PPV, NPV, Acc = accuracy, AUC, and P VUS = proportion of variants classified as VUS. NA = not able to calculate. The best performance statistics among comparison in silico prediction methods are highlighted in bold.

Multifactorial model analysis
Among the 1,161 variants included in this analysis, 1,016 (87.5%) had data available for at least one evidence statistic from qualitative and/or quantitative sources. When applying our MVP model analysis to these variants, predictive performance improved as expected when compared to that of the IVP models ( Table 2, upper section vs. lower section). The proportions of variants classified as concordant with their known classes, categorized one level above or  Given that some evidence may lend itself to automated computation, while others may require manual examination or adjustment [10], we also applied MVP model analysis using a subset of auto-computed predictors for which data were readily available: co-occurrence, family history and mutation hotspot. Among the 1,161 variants initially included, 873 (75.2%) had data for at least one of these 3 predictors. Results of the constrained MVP model for these variants were highly accurate, with 98.6% AUC (Table 2, lower section); the proportions of variants classified concordant with their known classes, categorized one level above or below their known classes, as VUSs, and discordant between benign/VLB and pathogenic/VLP were 260 (29.8%), 122 (14.0%), 489 (56.0%) and 2 (0.2%), respectively ( Table 2, lower section and S4  Table). Overall, 32.9% (382/1,161) of variants were appropriately categorized as benign/VLB or pathogenic/VLP. Among variants with strong evidence for a benign or pathogenic classification (total LR statistic <0.01 or >100, respectively), 100.0% (182/182) of variants were correctly classified as benign/VLB and pathogenic/VLP ( Table 2, lower section).

Discussion
We present a Bayesian framework consisting of IVP models that assess variant pathogenicity using a subset of gene-specific in silico predictors, and an MVP model that aggregates this result with information from a variety of qualitative and quantitative evidence sources to accurately and robustly predict the pathogenicity classes of missense variants. The performance of MVP model analysis demonstrates that this approach is capable of leveraging a vast constellation of pathogenicity information available in large-volume data, and has important implications for increased clinical utility given its high predictive accuracy, which was cross validated in over 1,000 classified missense variants.
The IVP and MVP model analyses have several distinct features that allow for improved prediction accuracy and practical utility. First, IVP model prediction is conducted with datadriven model terms derived from gene-specific training data, and is supported by a data expansion procedure to accommodate a sparse data scenario in which the training data contains an insufficient number of class-known variants. Notably, our IVP model quantifies the pathogenicity of each variant with a probability distribution, instead of a probability value, which allows for improved prediction robustness and accuracy by accounting for estimation uncertainty. Second, an exact and fast Bayesian sampling procedure using data-independent Pólya-Gamma distributions is adapted for model estimation [11,12], which facilitates analysis without the estimation of data-dependent priors on a gene-by-gene basis. Third, the incorporation of evidence statistics in qualitative forms, which are typically collected for rule-based classification schemes, makes the MVP model analysis capable of incorporating broader types of evidence statistics for improved practical utility, although the MVP model is not limited by the use of these data. We demonstrated that MVP model performance is highly accurate even when only three auto-computed quantitative evidence predictors are included in the model. Lastly, the use of 95% PCI is designed to reduce the potential misclassification between benign/VLB and pathogenic/VLP by accounting for the variability in the predicted probability of variant classification.
When applied to the 10 genes with varying numbers of class-known missense variants, the gene-specific IVP models outperformed 16 standalone and 6 meta-predictors each based on a genome-wide in silico score. These results support the notion that universal prediction models tend to have varied performance across different genes, and gene-specific classifiers incorporating phenotype data and established disease-causing evidence can improve prediction accuracy [13][14][15]. For multifactorial variant classification, the MVP model correctly assigned 77% of variants to its precise or closely matched class and misclassified only 0.2% of variants between benign/VLB and pathogenic/VLP. In addition, we show that using auto-computed evidence statistics derived from commonly collected and readily available phenotype and genotype data, 33% of evaluated variants can be correctly classified to their precise or closely matched classes. These results highlight the practical utility of applying IVP model analysis for large-scale variant prioritization and MVP model analysis for variant classification.
Despite the fact that gene-specific models outperform those based on genome-wide information, IVP models can be unstable when the gene-specific training data is of small sample size and/or contains many ambiguously classified variants. Other approaches, such as a weighted average of gene-specific and panel-specific models might improve model robustness and prediction accuracy, and remain to be investigated. The IVP model presented here is limited to continuous in silico predictors, whereas categorical features such as domain effects and variant types may also be informative [9]. Availability of evidence statistics is also a limiting aspect of MVP model analysis; although the Bayesian MVP model we presented classified a large proportion of variants with a high degree of accuracy, the remaining VUS were due to either no (145/1,161 = 12.5%) or insufficient (232/1,161 = 20.0%) evidence statistics. In addition, models that account for correlations among evidence statistics [16], incorporate interaction and non-linear effects [9], and/or integrate LR statistics using distribution-based sampling approaches have the potential for improved variant prediction and classification.
Our Bayesian IVP and MVP model analyses form a data-driven framework for variant prediction and classification in aggregating a broad spectrum of pathogenic information. These model-based approaches are adaptive to the complexity of large-scale data, and are applicable to a wide variety of genes and phenotype conditions, provided that suitable training data are available. Importantly, these models afford an opportunity to accurately and efficiently reclassify VUS, and as such have the potential to improve the information on which clinical decisions are based.

Data
To assess prediction performance, we obtained 1,161 classified missense variants identified in 10 MGPT genes (BRCA1, BRCA2, CDH1, PALB2, PTEN, TP53, MLH1, MSH2, MSH6 and PMS2) from ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/, downloaded in April 2017), compiled in silico predictor information from dbNSFP (https://sites.google.com/site/jpopgen/ dbNSFP, downloaded in June 2017) and AGVGD (http://agvgd.hci.utah.edu/, released Sept 2014) databases, and collected variant-specific evidence statistics from Ambry Genetics databases, respectively (S1 Data: MGPT data with all relevant variables). Variants obtained from ClinVar were those with classifications established by expert panel review or deposited by any of the 6 submitters that consistently provide assertion criteria: Ambry Genetics, Emory, GeneDx, InSiGHT, InVitae and SCRP. For each variant, we defined its consensus class per the following hierarchy: 1) we selected the most supported category among negative, positive and VUS, where negative and positive are designated as benign/VLB and pathogenic/VLP, respectively (i.e., positive if N positive > max(N negative , N VUS ) or negative if N negative > max(N positive , N VUS ); and 2) if the number of submitters is tied, and there was no conflict between positive and negative classes, we assigned positive or negative as appropriate (i.e., positive if N positive = N VUS and N negative = 0 or negative if N negative = N VUS and N positive = 0). Variants with conflicting classes (i.e., N positive = N negative ), or classified as VUS by all submitters, were excluded from analysis. The number of variants with ClinVar consensus classes varied from 22 to 385 per gene (S5 Table, upper section).

Predictors and their derivation
The dataset used for IVP model training contained a set of class-known variants which are designated benign, VLB, VLP and pathogenic. The response variable y is a binary outcome for pathogenicity, with 1 for pathogenic or VLP and 0 for benign or VLB.
The predictor variables for MVP model analysis included 7 evidence predictors quantified as LR statistics for a) frequency and association, b) co-occurrence, c) co-segregation, d) family history, e) functional evidence, f) structural evidence, and g) other supporting data, respectively (description in S6 Table). For each of the 7 evidence predictors, we extracted qualitative evidence from variant-level classification records characterized by two parameters of p cut and p frac . Here p cut is a threshold probability of 0.001, 0.1, 0.9 or 0.99 for assigning a variant to a targeted class of benign, VLB, VLP or pathogenic, respectively, as recommended in the American College of Medical Genetics and Genomics (ACMG) guidelines [2]. Also, p frac is a fraction of evidence needed to reach a targeted class (e.g., 1, 0.5 and 0.25 are example values for evidence predictors in qualitative form; see S6 Table for numerical illustration). Applying the Bayes rule equation under a null probability of p null = 0.5, the LR statistic for qualitative pathogenicity evidence is For auto-computed predictors, we derived LR statistics directly from real-time in-house subject-level genotype and phenotype data. The LR statistic of co-occurrence evidence was estimated from a binomial likelihood model [35] under the rationale that a pathogenic variant is less likely to co-occur with a known pathogenic mutation in trans. The LR statistic of family history evidence was derived from a history weighting score model [36] based on the premise that pathogenic mutations are more likely to occur in high-risk individuals while the presence of benign variants is unrelated to personal and family history. The mutation hotspot evidence was filtered out for the existence of at least one pathogenic variant at the same amino acid residue. For evidence statistics derived from two sources, denoted LR 1 and LR 2 , the combined evidence of two correlated LR statistics is quantified as Thus, the available pathogenicity evidence for each variant was summarized into 7 LR statistics, where LR statistic was set to1 for each missing evidence predictor.

Training data for IVP model
To build a gene-specific IVP model for gene G, we constructed a training data A containing all class-known variants in that gene under a minimal sample size requirement of n negative ! 5 and n positive ! 5. Here we define n negative = n benign + 0.5× n VLB and n positive = n pathogenic + 0.5× n VLP , where each nÃ represents the number of class Ã variants in training data A. The variant classes of benign, VLB, VLP and pathogenic are based on ClinVar consensus classifications, as previously described. For a sparse data scenario in which the variants in training data A satisfy n negative + n positive ! 5 and either n negative < 5 or n positive < 5, we implemented a data expansion procedure based on the assumption that variants with similar in silico scores residing in two genes known to influence the same phenotype should have a more similar degree of pathogenicity than those from two randomly selected genes. This procedure resulted in a training data A for gene G that includes additional variants from other similar genes, and was implemented as follows:

Computed the mean distance for all variant pairs between gene G and each other gene
H & G Ã , where G Ã is a gene set for all genes from the same panel or pathway except gene G. The distance between variants g 2 G and h 2 H was defined as the Euclidean distance of d gh = sqrt(S i = 1,Á Á Á,I (x gi −x hi ) 2 ), where vectors (x g1 ,Á Á Á,x gI ) and (x h1 ,Á Á Á,x hI ) are the I = 16 unit scaled individual in silico predictor values of variants g and h, respectively.
2. Chose a most similar gene G' & G Ã that showed the minimal mean distance with analysis gene G and merged the negative and/or positive variants one by one from G' to A in descending order of between-variant distance until both n negative ! 5 and n positive ! 5 were met or all variants in G' were merged into the training data A.
3. If n negative < 5 or n positive < 5 remains true, repeated step (2) by choosing another gene G" & (G Ã − G') and merging variants from G" to A until n negative ! 5 and n positive ! 5 were met.
Thus, this data expansion procedure provided a means by which gene-specific training data may be constructed for genes that contain only a few classified variants.

Identification of IVP model
We derived a gene-specific IVP model by selecting a subset of in silico predictors using stepwise logistic regression (SLR) and quantified the probability distribution of pathogenicity using Bayesian logistic regression with coefficients sampled from data-independent Pólya-Gamma distributions. The set of predictors yielding the minimal cross validation error rate among penalty coefficients (ranging from 2 to 8) were retained. The derived IVP model took the form where Z = (z 1 , . . ., z K ) is a subset of gene-specific in silico predictors retained from SLR analysis, B = (b 0 , b 1 , . . ., b K ) are regression coefficients for intercept and slopes, and logit(y) is the logit transformation of response variable y for negative and positive variants. The distributions of regression coefficients B of an IVP model were estimated by Bayesian Markov chain Monte Carlo updated from Pólya-Gamma distributions [11], as implemented by the logit function in R package BayesLogit version 0.5.1. For variant prediction purposes, the distributions of K + 1 regression coefficients, denoted B ¼ ðb kn ; 0 k K; 1 n NÞ, are estimated jointly using N = 1,000 Gibbs samples after a burn-in period of 20,000 samples. Thus, the IVP probability distribution of each variant was estimated by its logistic transformation, denoted

Estimation of IVP prior probability distribution
As a principle of multifactorial variant classification analysis, at least two lines of evidence are required for assigning a variant class to benign, VLB, VLP or pathogenic (i.e., classification based on the probability distribution from in silico analysis alone is not possible) [2]. To accommodate this framework, the IVP prior distribution of each variant was derived from a rescaled IVP probability distribution in the range of 0.1 to 0.9 with standard deviation proportional to the expected value. Specifically, the IVP prior distribution of a variant, denoted P ¼ ðp n ; 1 n NÞ, was quantified by a 2-sided shifted probability function where € P ¼ ð€ p n ; 1 n Nj€ p n ¼ 0:8 Â _ p n þ 0:1Þ is a linearly shifted probability distribution in range of 0.1 to 0.9, _ p med and € p med are the medians of IVP distribution _ P and its shifted distribution € P, respectively, and sdð _ p med Þ ¼ sqrtð _ p med Â ð1 À _ p med ÞÞ and sdð€ p med Þ ¼ sqrtð€ p med Â ð1 À € p med ÞÞ are the expected standard deviations of _ p med and € p med , respectively. The use of a rescaled prior distribution, as quantified by Eq (4), ensures the variability of IVP distribution is maintained in the prior distribution.

MVP model analysis
For each variant, the Bayesian MVP model analysis employed a distribution-based formation of Bayes rule to quantify the posterior probability distribution of pathogenicity. Such a posterior distribution, denoted Q ¼ fq n ; 1 n Ng, was computed by aggregating the prior probability distribution P from an IVP model analysis and the available LR statistics from 7 evidence predictors through the equation: where p n is a sample value from the prior distribution, and LR total = LR FAA × LR COC × LR CSG × LR FHX × LR FUN × LR STR × LR OTH is the total LR statistic calculated under the assumption that the 7 evidence predictors are statistically independent (see S6 Table for definition of each LR statistic).

5-tiered variant classification scheme
We employed a 5-tiered variant classification scheme to assign the predicted classes of benign, VLB, VUS, VLP and pathogenic at the targeted probability thresholds of 0.001, 0.1, 0.9 and 0.99, respectively, following the ACMG guideline ( Fig 1A) [2,37]. The predicted class of each variant was assigned based on the 95% probability credible interval (PCI) of pathogenicity obtained from an IVP model or MVP model (Fig 1B). Here the 95% PCI is defined as the 2-sided 95% range of a probability distribution estimated from a corresponding prediction model using Bayesian sampling from Pólya-Gamma distributions [11]. The use of 95% PCI, instead of a point estimate of probability value, was designed to control the occurrence of false events by accounting for the variability of probability distribution.

Performance evaluation
To evaluate the performance of variant prediction, we assessed predicted outcomes including true positives (TP; predicted pathogenic/VLP concordant with known class), true negatives (TN; predicted benign/VLB concordant with known class), false positives (FP; benign/VLB predicted to be pathogenic/VLP), and false negatives (FN; pathogenic/VLP predicted to be benign/VLB). Performance statistics such as sensitivity, specificity, PPV, NPV, accuracy, AUC and proportion of predicted variants of uncertain significance (P VUS ) were assessed [38] (S7 Table). All performance statistics of IVP and MVP model analyses were evaluated using LOOCV to control model overfitting.