Figures
Abstract
Recent advances in experimental and computational protein structure determination have provided access to high-quality structures for most human proteins and mutants thereof. However, linking changes in structure in protein mutants to functional impact remains an active area of method development. If successful, such methods can ultimately assist physicians in taking appropriate treatment decisions. This work presents three artificial neural network (ANN)-based predictive models that classify four key functional parameters of KCNQ1 variants as normal or dysfunctional using PSSM-based evolutionary and/or biophysical descriptors. Recent advances in predicting protein structure and variant properties with artificial intelligence (AI) rely heavily on the availability of evolutionary features and thus fail to directly assess the biophysical underpinnings of a change in structure and/or function. The central goal of this work was to develop an ANN model based on structure and physiochemical properties of KCNQ1 potassium channels that performs comparably or better than algorithms using only on PSSM-based evolutionary features. These biophysical features highlight the structure-function relationships that govern protein stability, function, and regulation. The input sensitivity algorithm incorporates the roles of hydrophobicity, polarizability, and functional densities on key functional parameters of the KCNQ1 channel. Inclusion of the biophysical features outperforms exclusive use of PSSM-based evolutionary features in predicting activation voltage dependence and deactivation time. As AI is increasingly applied to problems in biology, biophysical understanding will be critical with respect to ‘explainable AI’, i.e., understanding the relation of sequence, structure, and function of proteins. Our model is available at www.kcnq1predict.org.
Author summary
Heartbeat is maintained by electrical impulses generated by ion-conducting channel proteins in the heart such as the KCNQ1 potassium channel. Pathogenic variants in KCNQ1 can lead to channel loss-of-function and predisposition to fatal life-threatening irregularities of heart rhythm (arrhythmia). Machine learning methods that can predict the outcome of a mutation on KCNQ1 structure and function would be of great value in helping to assess the risk of a heart rhythm disorder. Recently, machine learning has made great progress in predicting the structures of proteins from their sequences. However, there are limited studies that link the effect of a mutation and change in protein structure with its function. This work presents the development of neural network models designed to predict mutation-induced changes in KCNQ1 functional parameters such as peak current density and voltage dependence of activation. We compare the predictive ability of features extracted from sequence, structure, and physicochemical properties of KCNQ1. Moreover, input sensitivity analysis connects biophysical features with specific functional parameters that provides insight into underlying molecular mechanisms for KCNQ1 channels. The best performing neural network model is publicly available as a webserver, called Q1VarPredBio, that delivers predictions about the functional phenotype of KCNQ1 variants.
Citation: Phul S, Kuenze G, Vanoye CG, Sanders CR, George AL Jr, Meiler J (2022) Predicting the functional impact of KCNQ1 variants with artificial neural networks. PLoS Comput Biol 18(4): e1010038. https://doi.org/10.1371/journal.pcbi.1010038
Editor: Joanna Slusky, University of Kansas, UNITED STATES
Received: December 7, 2021; Accepted: March 18, 2022; Published: April 20, 2022
Copyright: © 2022 Phul et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Training data is available in S1 Data. All additional materials are available at https://github.com/sakshamphul/KCNQ1_ML_Model. Our best performing model is available at www.kcnq1predict.org.
Funding: ALG, CS, JM received National Institutes of Health Research Project Grant (https://grants.nih.gov/grants/funding/r01.htm) under the grant number NIH R01 HL122010, NIH R01 GM080403. Additionally, this work in the meiler laboratory received by JM was also supported by National Institutes of Health S10 Instrumentation Program under the grant number: NIH S10 OD016216, NIH S10 OD020154 (https://orip.nih.gov/construction-and-instruments/s10-instrumentation-programs) and National Institutes of Health Research Project Grant under the grant number: NIH R01 DA046138, NIH R01 GM129261(https://grants.nih.gov/grants/funding/r01.htm). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Congenital long QT syndrome (LQTS) is a genetic disorder of heart rhythm caused by mutations in cardiac ion channel genes [1,2] that predisposes to potentially life-threatening cardiac arrhythmia. It is among the most common genetic disorder, afflicting 1:2500 people [3]. The most prevalent subtype, LQT1, is associated with genetic variants in the KCNQ1 gene [4,5] that encodes the pore forming subunit of the voltage-gated K+ channel KV7.1 (referred to as KCNQ1) [6]. In the heart, KCNQ1 forms a channel complex with KCNE1 to generate the slow delayed rectifier current, IKs, which is an essential driver of myocardial repolarization during the cardiac action potential [7,8]. Pathogenic variants that cause KCNQ1 loss-of-function (LOF) lead to diminished IKs and impaired repolarization that is manifest by prolongation of the QT interval on surface electrocardiograms [9].
LQT1 is among the most common inherited disorders [10]. More than 1000 genetic variants of KCNQ1 have been identified [11,12], but for many variants there are insufficient data to classify each as either pathogenic or benign. Correlating these variants of uncertain significance (VUS) to their clinical outcomes and determining the risk of LQTS remain major challenges [13,14]. Large-scale functional characterization of KCNQ1 variants has been made feasible by using automatic patch-clamp recording [15] and this strategy helped reclassify variants with previously conflicting or unknown interpretations according to the ClinVar database [11]. Moreover, the mechanistic basis underlying mutation-induced KCNQ1 dysfunction has been investigated [16–20]. For instance, Huang et al. [19] studied the impact of mutations in the KCNQ1 voltage-sensing domain (VSD) on protein cell surface expression, trafficking, protein folding, and structure. More than half of LOF mutations examined were found to destabilize the VSD structure resulting in impaired trafficking and lower cell surface expression. This observation underscores the growing notion that mutation-induced destabilization and mis-trafficking of the KCNQ1 protein are common disease mechanisms in LQT1. However, this study also identified LOF variants that did not exhibit trafficking and folding defects, indicating heterogeneity in the molecular mechanisms responsible for KCNQ1 LOF that cause LQT1. The molecular function of many variants in other regions of the KCNQ1 channel have yet to be characterized, and it is expected that these investigations will reveal additional pathogenic mechanisms [21].
Despite this progress in functional characterization of KCNQ1 variants, experimental assays remain labor-intensive, and this limits their applicability in a clinical setting. Computational approaches can support experimental testing and have the potential to help elucidate the molecular function of KCNQ1 variants as well as predicting associated clinical outcomes [22–25]. Computational methods trained on information from genome-wide genetic variation data are commonly used for protein variant effect prediction [26–30], but these tools have limited applicability for KCNQ1 (see Table S4 in reference [23]). Prediction accuracy of genome-wide methods is low and varies between targets. This reflects that development of these methods was based on heterogenous datasets including a wide range of proteins with diverse functions and associated diseases. Furthermore, these methods fail to establish the precise effect of a variant on KCNQ1 function parameters [23]. To overcome these difficulties, specific machine learning models tailored to predict the functional effects of KCNQ1 variants were developed [23,24]. Similar approaches have also been applied to other cardiac ion channels [31,32].
KCNQ1 is most often associated with autosomal dominant LQTS, and rarely with recessive LQTS. Dominant-negative loss-of-function (LOF) mechanisms have been implicated in autosomal dominant LQT1. Patients with heterozygous mutations are associated with autosomal dominant LQTS (Romano–Ward syndrome) whereas patients with homozygous or compound heterozygous KCNQ1 mutations have a more severe clinical outcome and are associated with recessive LQTS (Jervell–Lange–Nielsen syndrome)[33]. Even for the common heterozygous forms of LQTS, it is valuable to predict the function of a variant in the homozygous state; mainly for discriminating benign from pathogenic variants. This knowledge contributes to determining the disease-causing propensity of variants found in recessive LQTS and identifies variants with potential to cause autosomal dominant LQTS.
Q1VarPred [23] is a KCNQ1-specific channel function predictor. Criteria for dysfunction were calibrated by examining experimentally determined electrophysiology parameters for KCNQ1 (i.e., peak current density, voltage of half-maximal activation) that were then used to train a neural network with input features derived from protein sequence. Q1VarPred achieved greater accuracy than genome-wide tools, which perform poorly for membrane proteins [34]. Additional predictive power may be gained by analyzing the spatial clustering of variants in the 3-dimensional structure of KCNQ1 [31]. Functionally critical channel regions, such as the ion selectivity filter and cytosolic gate in the pore domain (PD) and the S4 helix in the VSD, are “hotspots” for variants causing the greatest perturbations in peak current density and voltage of activation, respectively. This suggests that protein structure features can aid variant prediction.
In this study, we used machine learning to develop an KCNQ1 variant prediction tool called Q1VarPredBio (www.kcnq1predict.org). The functional classification categories of Q1VarPred were expanded by a scheme that predicts variant-specific changes in four electrophysiological KCNQ1 parameters: peak current density, voltage of half-maximal activation (V1/2), and activation and deactivation time constants (τact, τdeact). We evaluated the performance of artificial neural networks (ANNs) trained on evolutionary and biophysical features for KCNQ1 and observed that a combination of both features produced a model with optimal predictive accuracy. Our machine learning approach can be useful to obtain insights into basic sequence-structure-function relationships for the KCNQ1 channel. Moreover, Q1VarPredBio may help differentiate between potential pathogenic dysfunctional KCNQ1 variants from those with normal channel function.
Results
We developed three types of ANN models: one trained with only evolutionary features, one trained with biophysical features, and a third one with both evolutionary and biophysical features. All three models were trained to predict the mutation-induced change in four functional parameters (peak current density, voltage of half-maximal activation, activation, and deactivation time constants) of the KCNQ1 channel as either normal (label 0) or dysfunctional (label 1) phenotype. All ANN models had an input layer, two hidden layers, and an output layer with four neurons. A schematic representation of our model development workflow is shown in Fig 1. These multitask ANN models were trained to improve accuracy of predicting all parameters combined. All models predicted four outputs between 0 and 1. A decision boundary was identified at the best possible accuracy and presented our threshold to classify a variant parameter as being either normal or dysfunctional. Accurate prediction based on this decision boundary was a criterion for determining the model performance. To measure the performance of the model, we adopted a 25-fold cross validation technique wherein performance was evaluated on a variant test set that was not included in model training and monitoring. Further details on ANN architecture, training, and model performance are described in the methods and materials section.
Dataset and criteria
The dataset for this work contained electrophysiological data for 125 KCNQ1 variants that were generated, tested, and functionally analyzed using the same approach [15,35]. The KCNQ1 variants were tested in the homozygous state transiently co-expressed with wild type KCNE1 in CHO cells. The electrophysiological data measured for each variant consists of four biophysical parameters: peak current density (designated as IKs), voltage of half-maximal channel activation (V1/2), activation time constant (τact), and deactivation time constant (τdeact). In order to compare the functional properties of variants tested across many months, the values for each parameter were normalized to the values obtained from cells expressing the wild type channel that were transfected and tested in parallel. Normalized values equal to 1 (or 100% WT) were considered wild-type-like. A parameter phenotype was classified as dysfunctional if it satisfies the criteria in Table 1. These thresholds were derived collectively from values defined in Li et al [23], disease-causing variants in literature and by evaluating model performance at different thresholds.
For training and testing of our model, both gain-of-function and loss-of-function (according to Table 1) were classified as dysfunctional. Biophysical parameters that could not be determined for some variants (e.g., voltage-insensitive [no V1/2] variants that do not deactivate [no τdeact]), were defined as dysfunctional and for variants with peak current density ≤ 17% WT, all four biophysical parameters were considered dysfunctional.
Due to the scarcity of the functional data from certain protein regions, we introduced 345 ‘non-perturbing’ variants, one for each of the 345 amino acids included in the KCNQ1 structural model [36] thereby increasing the size of our dataset to 470 (see S1 Data). All four functional parameters for the non-perturbing variants were considered WT. These non-perturbing variants expose ANNs to all the structural regions of the protein during training. This helps the model to recognize changes in structure and physicochemical properties at the site of mutation for all neighborhoods that exist in the structure. The extent of these changes helps ANNs to classify the phenotype of a mutation (i.e., benign or pathogenic). These non-perturbing variants create a ‘baseline’ for the protein region where data was limited allowing the ANNs to train on a greater number of instances. In summary, there were 345 non-perturbing, 39 benign and 86 pathogenic variants based on the peak current criteria given in Table 1.
Identification of biophysical features
In total, we used 14 structural and physicochemical properties of KCNQ1 to develop an ANN model based solely on biophysical features. These features were extracted by importing the KCNQ1 structure into Biochemical Library (BCL) software. The KCNQ1 cryo-EM structure model utilized for this work, had bound calmodulin (CAM), no PIP2 and represented a decoupled state with activated voltage sensor domain and a closed pore domain [36]. We explored different biophysical features based on existing understanding of molecular mechanisms underlying KCNQ1 function and the location of critically significant regions in the KCNQ1 structure important for protein stability and channel gating. These biophysical features inform about the amino acid local environment, exposure to solvent, burial in the membrane, change in amino acid physicochemical properties at the mutation site, steric hindrances near to the α-carbon atom, and the mutation-induced change in water-membrane transfer free energy. Distance from the KCNQ1 channel pore axis was used to help the model distinguish variants in the channel pore domain from those in the voltage sensor domain (see S1 Fig). Furthermore, it is more likely that a variant residue buried in the membrane will negatively impact protein function. Thus, the degree of burial of a variant in the membrane was assessed by using a three-layered membrane model and calculating a membrane-depth dependent weight calculated with a distribution function (S1 Eq 1). A significant weight was given to variants embedded inside the membrane (see S2 Fig). Steric hindrances near the α-carbon atom were examined using a steric parameter (see S1 Table), which is a graph shape index that encodes complexity, branching and symmetry of amino acid side chain [37].
Transfer free energy of an amino acid between a hydrophilic and hydrophobic environment plays a crucial role for protein folding and stability. Thus, we used hydrophobicity of native and variant amino acids to investigate the mutation-induced changes in the free energy for transfer into the membrane (see S3 Fig). The mutation-induced change in water-membrane transfer free energy at an amino acid site was examined using the hydrophobicity scale reported by Koehler et al [38]. Highly hydrophobic or hydrophilic amino acids are usually surrounded by similarly hydrophobic or hydrophilic amino acids. Thus, we calculated the polarizability and hydrophobicity of amino acids at and around the variant site by functional density [31]. Functional density is based on k-nearest neighbors’ algorithm, wherein the average physiochemical property around the site of variant is weighted by the inverse of their distance from the site of mutation (see S4 Fig). These features examined hydrophobicity and polarizability of the neighborhood around the site of variants. We found that there exists a correlation between high polarizability regions in the protein and dysfunctional peak current density (see S5 Fig). More details on the implementation of these biophysical features are described in the S1 Text.
We also introduced changes in the physicochemical properties [number of hydrogen bond donor sites, number of hydrogen bond acceptor sites, and van der Waals volume [37]] of the amino acid at the site of mutation, in order to help the ANN model learn whether an amino acid substitution represents a missense or non-perturbing mutation. These properties also improved the functional outcome predictions for missense mutations.
Exposure of variant sites to solvent was quantified using the neighbor vector method [39] (see S6 Fig). Neighbor vector [39] improved the predictions especially for peak current density whereas neighbor count [39] did not. Backbone conformation (Phi (φ), Psi (ψ), and Omega (ω) angles) for the native amino acid and other descriptors like the location of a mutation on a helix, mutation-caused change in amino acid polarizability as well as change in hydrophobicity and solvent accessible surface area for amino acids did not improve prediction accuracy.
Evolutionary feature: PSSM-based amino acid substitution score
We used PSI-BLAST search to calculate a position-specific scoring matrix (PSSM) which measures the likelihood of amino acid substitution at a mutation site [40]. PSSM was created by searching UniRef50 [41] and the NCBI non-redundant sequence databases [42]. The difference of PSSM scores between variant and WT amino acid from these two databases was utilized as evolutionary features. We found that these evolutionary features were solely sufficient in predicting the functional properties of non-perturbing mutations. More details can be found in the S1 Text.
Biophysical features outperform PSSM-based evolutionary features in predicting activation V1/2 and τdeact
Model accuracy was evaluated using Matthew’s correlation coefficient (MCC) and receiver operating characteristic (ROC) plots by testing a variant set that was omitted from model training and monitoring. A decision boundary was identified at the best possible accuracy (MCC value) whereas ROC plots were independent of decision boundary. These MCC values and ROC plots for different ANN models are reported in Figs 2 and 3 respectively. More details on MCC and ROC are described in Methods and Material section.
The error bar represents 1x standard deviation.
Different color scheme is used to depict performance of the three feature sets.
We were able to model V1/2 and τdeact using PSSM-based evolutionary features, biophysical features, and both features combined. MCCs for these feature sets are reported in Fig 2. Although evolutionary features achieved satisfactory performance in predicting V1/2 with a MCC of 0.56, biophysical features perform better by attaining MCCs greater than 0.62. The area under the curve (AUC) for V1/2 determined for biophysical features was greater than that for PSSM-based evolutionary features (Fig 3). For τdeact, evolutionary features achieved satisfactory performance in predicting τdeact with a MCC of 0.50, biophysical features perform better by attaining MCCs of greater than 0.56. Area under the curve (AUC) for τdeact determined for biophysical features was greater than that for evolutionary features (Fig 3). Biophysical features clearly dominate in predicting the activation V1/2 and τdeact suggesting that the ANN model could determine and distinguish structure-activity relationships underlying the voltage-dependence of KCNQ1 activation and deactivation of KCNQ1 kinetics.
Biophysical features can predict current density but do not outperform PSSM-based evolutionary features
We were able to model peak current density (IKs) using evolutionary and biophysical features. For peak current density (IKs), biophysical features attained MCC close to 0.38 whereas PSSM-based evolutionary features achieved MCC ≥ 0.43, suggesting that PSSM-based amino acid substitution scores perform better in predicting peak current density. Moreover, the AUC for biophysical features is 0.68 whereas for evolutionary features is 0.69 (see Fig 3). This suggests that biophysical features are comparable with PSSM-based evolutionary features in distinguishing between normal and dysfunctional variants. BLAST based PSSM-derived amino acid substitution scores, which is our evolutionary feature, has a significant association with peak current density (IKs) as previously reported by Kroncke et al [31].
PSSM-based evolutionary features predict τact better than biophysical features
We modeled τact using both biophysical and evolutionary features. Evolutionary features yielded a MCC of 0.44 whereas biophysical features yielded a MCC of 0.40, suggesting evolutionary features perform better than biophysical features. Similarly, AUC for evolutionary features was close to 0.75 better than that for biophysical features (AUC:0.73). We observed that the ANN model makes better predictions for peak current when activation time labels are simultaneously present in the training dataset.
Performance of biophysical features is comparable with PSSM-based evolutionary features
To compare models using a common performance metric, the MCCs calculated for four functional parameters were averaged (MCCaverage). The MCCaverage for biophysical features was 0.49 wheras MCCaverage for evolutionary features was 0.48 (Fig 2). Thus, the two models are comparable in combined performance, irrespective of their performance for the individual functional parameters. The average AUC for biophysical features was 0.77 whereas evolutionary features was 0.75, suggesting biophysical features are comparable with PSSM-based evolutionary features.
Most accurate predictions are achieved by combining biophysical and PSSM-based evolutionary features
For training the ANN model with both biophysical and evolutionary features, we used eleven biophysical features and the difference of PSSM score determined with NCBI non-redundant sequence database (ΔPSSM(NR)). We found that three [Neighbor Vector, Mutant steric parameter, and Native steric parameter] out of 14 features used for the biophysical model and uniref50 based PSSM scores failed to improve the predictions on the unseen dataset, therefore these features were excluded (see S9 Fig). This could be due to redundancy in the information carried by these three biophysical and evolutionary features. The eleven biophysical features included were: hydrophobicity of mutant amino acid, polarizability of mutant amino acid, functional density of amino acid polarizability with neighborhood sizes of 6.5 Å and 12 Å, functional density of amino acid hydrophobicity with neighborhood sizes of 1 Å and 6.5 Å, change in number of hydrogen donor sites, change in number of hydrogen acceptor sites, change in Van der Waals volume, distance from the pore axis, and depth of the mutation site in the membrane.
MCC > 0.44 and an AUC > 0.75 for peak current density suggest that although evolutionary and biophysical features combined do not improve prediction accuracy (MCC) for peak current density they do improve the spread (AUC) between normal and dysfunctional variants predicted by the model. Moreover, significant improvement in performance was observed for V1/2 and τdeact by combining evolutionary and biophysical features. MCC increases by 15% for V1/2 and by 10% for τdeact when compared with the single-feature best model, i.e., biophysical model. The results also suggest that biophysical features have a significant association with τdeact and V1/2 making the performance of biophysical features for these parameters better than that of evolutionary features. For τact, we observed no improvement in MCC or AUC (Figs 2 and 3) for the combined model when compared with evolutionary features, indicating the redundant nature of evolutionary and biophysical features in capturing sequence-structure-function relationships about KCNQ1 τact. Overall, in the combined features ANN model, MCCaverage increases to 0.54, which corresponds to a 12% improvement in MCCaverage relative to the best single-feature model.
Wildtype like variants were slow to activate
We observed that variants H105N, T118S, V129I, and E146G had peak current density, V1/2, and τdeact parameters close to WT values but significantly larger activation time constants. Additionally, it was difficult to model τact with any of the ANN models using a threshold 80%–120% of normal. Based on those observations, we adjusted our threshold for normal τact to 70%–170% of wild type. This change improves MCC for PSSM-based evolutionary features from 0.22 to 0.44 and yielded 10% improvement in average MCCs across all models.
Performance of ANN models on non-perturbing variants
The MCC and AUC reported in Figs 2 and 3 were evaluated on 125 experimentally validated variants. The correct classification of non-perturbing mutations signifies that ANNs can distinguish no change in protein structure results into a benign variant. This is an important feature as the wild-type sequence is not directly input into the ANN but encoded as changes in evolutionary and biophysical parameters. Thus, these non-perturbing mutations ensure that a change of zero in all parameters are understood as wild-type. All models were able to classify roughly 98% of the 345 non-perturbing variants accurately as benign. Features like change in number of hydrogen donor sites, change in number of hydrogen acceptor sites, and change in amino acid volume were required for an ANN model trained with only biophysical features to help model learn non-perturbing mutations. When both features were combined, evolutionary features were found to be sufficient in correctly predicting non-perturbing mutations. The accuracy of the combined ANN model for non-perturbing, benign, and pathogenic variants is shown in S7 Fig. This figure depicts that ANN can recognize these three types of variants and predict majority of them at different regions between 0 and 1. Additionally, to quantify the separation between non-perturbing and benign mutations, we calculated entropy at 0.05 as the decision boundary to separate non-perturbing from benign variants. The lower the entropy (best~0, worst~1), the better is the separation of the two classes. Entropy for all the functional parameters was less than 0.35 suggesting the model can distinguish between non-perturbing and benign variants.
Predictive ability based on the function of the variants
Using peak current density to define variant classes, there were 71 LOF, 15 GOF and 39 WT-like variants in the functional dataset. All ANN models predict LOF better than GOF or WT-like (see S8 Fig). Biophysical features-based model predicts GOF better than other feature sets whereas evolutionary-based model predicts LOF better than the rest. By combining biophysical and evolutionary features, there is a significant improvement in predictive ability of WT-like variants.
Input sensitivity highlights feature importance and their association with KCNQ1 functional parameters
To study the contribution of different features in our ANN model, we examined the input sensitivity of input features on output labels. Since considering the magnitude of input sensitivity for feature importance can be misleading due to the issue in rescaling the input features [43], we considered sign of the input sensitivity with respect to output label. Input sensitivity is defined as zero when half instance of the variants predicts a positive change with respect to the result label and the other half predict a negative change with respect to the result label. Similarly, an input sensitivity close to one signifies that the input feature strongly correlates with the output label. More details on input sensitivity analysis can be found in the Materials and Methods section.
The input sensitivities for our best predictive model (eleven biophysical features and one evolutionary feature) are averaged across 320 models with each model having 25 instances of input sensitivity using 25 different monitoring data subsets. These averaged input sensitivities are reported in Fig 4.
Biophysical features are shown in orange and evolutionary features in olive color.
PSSM-based amino acid substitution score (Δ PSSM(NR)) was found to be the most sensitive feature for all the functional parameters, suggesting that sequence-based features are of high quality. For peak current density (IKs), four biophysical features were highly sensitive: hydrophobicity of mutant amino acid, hydrophobicity around mutation site within radius of 1 Å and 6.5 Å, and distance of mutation site from channel pore axis. For predicting V1/2, the most sensitive biophysical features were burial of mutation site in the membrane, mutant polarizability, and neighborhood polarizability around mutation site with 6.5 Å radius. For τact, the most sensitive biophysical properties were neighborhood polarizability around mutation site with 6.5 Å radius, distance from the channel axis and change in volume of the amino acid. And for predicting τdeact, the most sensitive features were burial of mutation site on the membrane, distance of mutation site from channel pore axis, and neighborhood hydrophobicity around mutation site within 1 Å radius.
We also report the input sensitivity for an ANN model trained solely with 14 biophysical features. These input sensitivities are also averaged across 320 models with each model having 25 instances of input sensitivity using 25 different monitoring data subsets. The averaged input sensitivities for the biophysical features only model is reported in Fig 5. We observed that all biophysical features are sensitive to functional parameters, but some features were more sensitive to specific functional parameters than other biophysical features. The distance from the channel pore axis, change in number of hydrogen donor sites and hydrophobicity at neighborhood size of 1 Å were the top highly sensitive biophysical features for IKs. The mutant amino acid hydrophobicity, neighbor vector and burial on the membrane for V1/2. Change in number of hydrogen donor sites and mutant amino acid hydrophobicity for τact and burial on the membrane and neighborhood polarizability within 12 Å radius for τdeact.
Discussion
A note on training the ANNs on functional data tested in the homozygous state
The paper in Science Advances 6, 2018 by Huang et al. provides evidence that training models based on experimental data collected on homozygous cells is relevant for KCNQ1 related diseases [19]. Specifically, this group conducted experiments in which WT KCNQ1 was co-expressed with a mutant of interest and the total trafficking of KCNQ1 was quantitated (see Figure 3 in reference [19]). They found that the total and cell surface expression of total KCNQ1 (WT + mutant) was usually in between the results for WT only and mutant-only. However, there were a few exceptions in which the WT protein appeared to rescue trafficking of the mutant or where the mutant protein impeded trafficking of the WT protein. These results suggest that studies of mutant-only condition are usually a good predictor for the corresponding WT/mutant heterozygous conditions, but there are exceptions.
Biophysical features perform as well as PSSM-based evolutionary features in predicting KCNQ1 variant function
Biophysical features performed as well as evolutionary features in predicting the functional outcomes of KCNQ1 variants. This suggests that ANN with only biophysical features only recognized relationships between KCNQ1 structure, function, and mutation-induced dysfunction. However, biophysical features outperformed the PSSM-based amino acid substitution score in predicting the V1/2 and τdeact. This could be linked to the prevalence of variants located in the VSD amongst all variants in the training dataset, thus, allowing the network to effectively learn about this channel domain. We found that presence of V1/2 labels in the training dataset improves the τdeact predictions and that biophysical features that were highly sensitive to V1/2, were also found to be sensitive to τdeact (Figs 4 and 5). These observations could indicate that similar molecular determinants are important for voltage-dependent channel activation as well as kinetics of channel deactivation. Biophysical features like burial on the membrane and distance from the channel pore axis were most important for the ANN to learn the phenotype of variants in the VSD. Other features like the polarizability of mutant amino acid and the functional density of amino acid polarizability within a radius of 6.5Å and 12Å around the mutation site were also sensitive to V1/2 and τact (see Fig 4). This could indicate the fact that high polarizability of amino acids in the VSD confers sensitivity to transmembrane voltage that is required for KCNQ1 activation mechanisms. For instance, at R195 the functional density of polarizability is 0.19 for radius size of 6.5 Å. Mutation at this site to Q and P decreases site polarizability. R195Q and R195P exhibit LOF possibly because changed amino acid polarizability affects KCNQ1 activation. This is in line with the high sensitivity of V1/2 prediction for polarizability as shown in Fig 5. Similarly, R195W is a GOF mutant and this correlates with increase in the site polarizability. However, not all the sites in VSD actively participate in activation; some exist to stabilize the protein fold of the VSD [19]. L114P, E115G, Y125D, G189A, S225L, L236P, and L236R are variants that fail to activate due to protein misfolding [19]. These variants have similar native hydrophobicity and neighborhood hydrophobicity, however, mutations at these sites result in changes in hydrophobicity suggesting that the amino acid side chains for these variants are located in an energetically unfavorable environment. Therefore, network predictions have high sensitivity for mutant hydrophobicity, and functional density for hydrophobicity, suggesting that protein stability is a crucial aspect in interpreting the functional phenotype of V1/2 and τdeact. Exposure of a site (neighbor vector) to solvent environment is also sensitive to functional phenotype of V1/2 and τdeact.
We were able to model peak current density and τact using only biophysical features, however, this model performance was not as good as the one obtained with only PSSM-based evolutionary features. In summary, it was challenging to find biophysical features that significantly improve the peak current density and τact predictions over PSSM amino acid substitution scores. Interestingly, we observed some interdependency of peak current density and τact predictions, similar as for V1/2 and τdeact. Due to the multi-task classification scheme, the network model can benefit and learn from other predicted labels, thus, increasing prediction accuracy compared to a single-task classification network. As per our dataset, it is more probable for V1/2, τact and τdeact to be dysfunctional if peak current is also dysfunctional. This is the case for 42 variants in the training set, for which IKs is less than 17% WT giving rise to dysfunctional V1/2, τact and τdeact. In summary, biophysical model performs well in predicting variant phenotype when all four parameters are dysfunctional or normal, but it does not perform well when only one or two parameters are impaired.
Combination of PSSM-based evolutionary and biophysical features improve functional phenotype predictions
When combining both types of features, the ANN model predicts functional parameters with better average accuracy than models trained with only biophysical or evolutionary features. For the voltage-dependence of activation V1/2, prediction accuracy markedly improves by combining both feature sets. On examination of the input sensitivities in Fig 4, functional density of polarizability within 6.5Å radius, burial on the membrane, and mutant polarizability are the biophysical features that have highest impact on V1/2 prediction. The movement of helices in VSD under the influence of electric field is due to the sites with high polarizability residues and neighborhoods with high polarizability. Therefore, mutant polarizability and functional density of polarizability are biophysical features with high sensitivity and can be linked to the KCNQ1 activation mechanism. Burial on the membrane indicates higher prominence of mutations that are inside the membrane and in close proximity to VSD and PD region. For example, variants R259H, V280E, A300E, A300T, F340L, A344V and others are located in membrane-embedded regions of KCNQ1 and near to the pore domain cause LOF. On the other hand, site A287 is in close proximity to the pore but lies outside the membrane. This could explain why A287E and A287V variants are WT-like for all functional parameters. Proximity to the pore and location in the membrane appear to classify sites that are in functionally critical regions of the protein.
Prediction accuracy for τdeact increases when biophysical and evolutionary features are combined, MCC improves to 0.60, and AUC increases to 0.83. Based on input sensitivity (see Fig 4) burial on the membrane, distance from the channel pore, mutant hydrophobicity, and functional density of hydrophobicity within radius 1 Å were highly sensitive biophysical features for τdeact prediction. The high sensitivity for hydrophobicity-based features indicates that stability at the site of the mutation affects the phenotype of τdeact. The improvement in MCC and AUC scores for τdeact may be due to the improved ability of the model to identify the VSD region by using biophysical features like burial on the membrane and distance from the channel pore axis. We also observed that τdeact predictions improved when V1/2 labels were present, highlighting to a relation between V1/2 and τdeact.
For peak current density, there was a small improvement in MCC, suggesting that for this parameter, both feature types carry similar information content. Based on our input sensitivity analysis for peak current density shown in Fig 4, we deduce that evolutionary features (i.e., PSSM score) carry valuable information about protein folding stability. Even though hydrophobicity-based features are highly sensitive features for peak current density, adding them to the network training process did not improve performance. This means that PSSM scores, and hydrophobicity feature carry redundant information for predicting peak current density. The fact that hydrophobicity features like mutant hydrophobicity and functional density with hydrophobicity are amongst the most sensitive features highlights the relationship of peak current density with protein structure stability. It is possible that due to protein structure instability, KCNQ1 protein tends to misfold resulting into dysfunction. The work from Huang et al. observed that many variants in the KCNQ1 VSD negatively impact protein folding stability, leading to trafficking defect and consequently low peak current density [19]. Considering that VSD region variants predominate in our database and the high input sensitivity for hydrophobicity-based features, we also conclude that protein stability is a major effect of mutations in VSD.
Combining biophysical and evolutionary features did not improve the MCC for τact whereas AUC increased from 0.81 to 0.83. Among the different biophysical features used for predicting τact, we found that the change in the number of hydrogen donor sites due to amino acid substitution significantly improved the predictions for τact. Likewise, the change in the number of hydrogen bond donor sites was also found to be sensitive to τact. There were 43 variants in our dataset that experienced loss of donor sites due to amino acid substitution. Possible explanations for the number of hydrogen bond donor sites and τact association are that τact is dependent on the hydrogen bonds assisting in the activation mechanism in the VSD region of the protein, or these hydrogen bonds interact with PIP2 complex in sending a signal to the pore domain. Sun et al. found that residues near the S4 helix and S4-S5 linker helix interface interact with PIP2[36]. PIP2 has charge -3, -4 or -5 depending on pH of the surroundings, and the presence of the negative charge makes PIP2 a proton acceptor. Interestingly, the majority of the sites such as R116, R195, K196, Y184, K183, R181, and R249 are located within 4 Å distance from the PIP2 interacting sites at the S2-S3 linker, S3 helix, S4 helix, and S4-S5 linker that are proton donors and have high polarizability. Thus, mutations in this protein region could lead to impairment in the function of the KCNQ1 protein due to the loss of hydrogen donor sites, impacting regulation by PIP2. The loss of hydrogen donor sites and consequently hydrogen bonds, can also be linked to stability at the mutation site. The high sensitivity of τact for mutant hydrophobicity (see Figs 4 and 5) indicate protein destabilization as a likely cause for impairment. Furthermore, the high sensitivity to change in amino acid side chain volume highlights the structural impact on local environment near the site of mutation. τact was also found to be sensitive to polarizability-based features highlighting the role that this feature places as driving force (response by amino acids under electric field) in ion channel activation.
Significance of this study
This work demonstrates the capability of ANN models and biophysical features to predict the phenotype of four KCNQ1 functional parameters. Recently, explainable AI has been an important consideration for usage of AI in medicine [44,45]. We argue that ANNs trained on biophysical rather than PSSM-based evolutionary features enable understanding of determinants of function in a more transparent way. Moving forward with AI in biology, this understanding will be critical with respect to explainable AI, i.e., understanding the relation of sequence, structure, and function of proteins. Moreover, input sensitivity analysis link biophysical features with functional parameters providing insights on underlying molecular mechanisms.
Limitations and future directions
The primary limitation of this study is the size of the dataset and overrepresentation of variants from the VSD. There exists a substantial amount of functional data available in literature, however, those results were generated and analyzed using different expression and testing systems which may complicate model training. Our models were trained on data obtained with the same experimental approach and measurement protocols. Although we were able to train the models with 125 variants, there is a need for more functional data especially for variants in the pore domain (PD). Another limitation of this study is that variants were tested in the homozygous state. This work is sufficient for evaluating disease-causing propensity of variants found in severe cases with recessive LQTS but might not address all cases of dominant LQTS. Moreover, the incomplete KCNQ1 structure limits the ability of our models to make predictions for only half of amino acid sites (< 350 sites), mostly localized to the S1-S6 region.
An interesting question is whether analyzing multiple, possible functional conformations of the KCNQ1 protein, modeled by Monte-Carlo or molecular dynamics simulations, can provide an orthogonal set of information which can then be used to further improve the ANN model. There is experimental evidence that the KCNQ1 VSD can exist in one of three states (resting, intermediate, activated), which are coupled to the pore domain and influence opening. Experimental and model structures are available for these states [46–47]. By incorporating structural and biophysical data about those states, our ANN model could learn molecular properties that underlie ion channel gating and how these properties are changed by variants. This extra information could allow the model to predict KCNQ1 gating parameters with further improved accuracy than ANNs trained on a single, static structure and address the effect side chain packing on protein stability. Furthermore, it will be valuable to investigate whether the dynamical properties of KCNQ1 protein determined in molecular dynamics simulations, can be used for interpreting the effects of variants.
Conclusion
We developed a model using biophysical features that can predict the functional consequence of KCNQ1 variants with comparable accuracy to a model that uses using PSSM-based evolutionary features. We found that combining evolutionary and biophysical features together created optimal model performance. We used biophysical features derived from a three-dimensional structure of KCNQ1 and demonstrated these features can be employed to develop a functional prediction method, highlighting vital structure-function relationships. Moreover, Q1VarPredBio will be a helpful tool to evaluate variants of uncertain significance and improve the accuracy of genetic diagnoses for LQTS. Q1VarPredBio is publicly available as a webserver at www.kcnq1predict.org.
Method and material
Neural network architecture and training
A fully connected multitask feed forward ANN with a leaky rectifier linear unit was utilized for all the models. The number of nodes in the input layers was equal to the number of predictive features i.e., 14 for biophysical features, two for evolutionary features, and 12 for biophysical and evolutionary features together. The output layer of each network had four neurons, one for each phenotype prediction for functional parameters of KCNQ1. ANNs were trained with dropouts to prevent overfitting. For instance, the introduction of dropouts in the input layer and hidden layers improve average MCC for the biophysical model from 0.38 to 0.49. The first hidden layer with 32 neurons was found optimal for all three models with a dropout rate of 33%. For the networks trained with evolutionary features and biophysical plus evolutionary features, twelve neurons in the second hidden layer with a dropout of 33% were found optimal whereas for the model trained with biophysical features, eight neurons in the second hidden layer without the dropout rate performed better. Additionally, the biophysical model had a 20% dropout rate in the input layer whereas a 5% dropout rate was sufficient to prevent overfitting for evolutionary model and evolutionary plus biophysical model. These three networks were trained on binary labels (1 for dysfunctional and 0 for normal) for each phenotype with backpropagation of errors. Based on these errors, weights were updated for 1200 iterations with a learning rate set to 0.001 and momentum equal to 0.5. We utilized accuracy at 0.5 as our objective function during training. All neurons utilized a leaky rectifier transfer function
(1)
where x is the total input to a neuron. We observed that introduction of second hidden layer improved the performance of the models on non-perturbing mutations with no effect on the prediction accuracy for functional variants.
For better generalizability and balancing the different classes within the training, monitoring, and independent subsets, we adopted a 25-fold cross-validations strategy, wherein 23 subsets (typically 432 variants) were utilized for training, one subset for monitoring (19 variants), and one subset for prediction (19 variants). For further balancing the different classes and removing any effect of biased subset used for prediction and monitoring, we randomly shuffled our data before dividing them into 25 subsets. It was observed that decreasing the size of these subsets hampers the performance suggesting an incomplete dataset for training the model. Moreover, training data was over-sampled with experimental data (= 125) at a ratio of 3:1 for training the model, restricting the model to overtrain on non-perturbing mutations (= 345).
Performance metrics
The presence of size-imbalanced classes in the dataset led us to adopt Matthew’s Correlation Coefficient (MCC) which is proven as the best performance metric available especially when classes in data are imbalanced [48]. It considers all four parameters of the confusion matrix: numbers of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) as shown in Eq (1). MCC value of 1 signifies perfect classification, the value of 0 indicates random classification, and the value of -1 means opposite classification. MCC measures the correlation of predicted value with observed value at a specific threshold. MCC values were computed by using BCL (see S1 Protocol Capture) wherein the threshold was adjusted individually for IKs, V1/2, τact and τdeact from 0 to 1 to achieve the best MCC for each phenotype. This hyper-parameter fluctuated roughly 20% based on different instances of the model.
Moreover, to measure the robustness of our predictive models without being dependent on this threshold value, we utilized a receiver operating characteristic (ROC) curve that summarizes the performance of different feature sets on the positive class. In ROC plots, the x-axis indicates a false positive rate (FPR), and the y-axis indicates a true positive rate (TPR). The area under the curve (AUC) in a ROC plot quantifies the performance of the model which can be utilized to compare different models. Higher the AUC, the better the model is in distinguishing between normal (negative class) and dysfunctional (positive class) phenotypes. AUC value more than 0.5 signify that the classifier is better than a random classifier (AUC = 0.5) in distinguishing dysfunctional variants from normal variants. Similarly, AUC value less than or equal to 0.5 indicates that the classifier is unable to distinguish between positive and negative classes.
We evaluated the ability of evolutionary features, biophysical, and the combination of both the features to correctly classify the dysfunctional versus normal variants by plotting ROC curves to compare feature sets by their binary classification capability. The ROC curves for 25-fold cross validated models are shown in Fig 3 with shaped region depicting 99% confidence interval.
Input sensitivity analysis
The prediction of phenotypes for functional parameters are average predictions of 25 independent subsets using 25 monitoring subsets. We can analyze the average effect of input features on these 25 prediction datasets using the concept of input sensitivity. However, we acknowledge from Brown et al. [43] that calculating the magnitude of input sensitivity for feature importance cannot be meaningfully used due to the issue in rescaling the input features. Thus, we recall the consistency method adopted by Brown et al. [43] to evaluate the consistency of feature perturbation on our four result labels across the cross-validation models. Here, we iterate across all input features of the training set and change feature value by a small amount to record the movement of the result label. For each feature with a corresponding label, we count the number of models that will improve the prediction by a change in the descriptor. The net consistency is defined as zero when half of the variants predict a positive change with respect to the result label and the other half predicts a negative change with respect to the result label. This sensitivity result reported in the result section is averaged across all the 320 models simulated by shuffling dataset 320 times, with each model individually averaged for 25 cross validation models, and instances in the training dataset for each individual feature for desired result label.
Supporting information
S2 Fig. Depth of the site of mutation on the membrane.
https://doi.org/10.1371/journal.pcbi.1010038.s002
(DOCX)
S3 Fig. Definition of three regions of hydrophobicity utilized in this work.
https://doi.org/10.1371/journal.pcbi.1010038.s003
(DOCX)
S4 Fig. This figure depicts polarizability distribution with clusters of high, medium, and low polarizability.
This also captures the concept of functional density by quantifying these clusters of polarizabilities for different neighborhood size.
https://doi.org/10.1371/journal.pcbi.1010038.s004
(DOCX)
S5 Fig. Correlation of peak current with polarizability at different pockets in the protein structure.
https://doi.org/10.1371/journal.pcbi.1010038.s005
(DOCX)
S6 Fig. A definition of neighbor that includes a smooth transition function used in the neighbor vector algorithm with lower bound at 3.3 Å and upper lower at 11.4 Å.
https://doi.org/10.1371/journal.pcbi.1010038.s006
(DOCX)
S7 Fig. Distribution of Prediction by ANN for non-perturbing, benign, and pathogenic variants depicting that ANN can distinguish these variants by predicting in three different regions between 0 and 1.
Decision threshold is between benign and pathogenic variants.
https://doi.org/10.1371/journal.pcbi.1010038.s007
(DOCX)
S8 Fig. Percentage of accurate predictions for GOF, LOF and WT-like based on peak current density by the three ANNs models considered in this study.
https://doi.org/10.1371/journal.pcbi.1010038.s008
(DOCX)
S9 Fig. Exclusion of 3 biophysical features and 1 evolutionary feature does not affect the performance when evolutionary and biophysical features are combined.
https://doi.org/10.1371/journal.pcbi.1010038.s009
(DOCX)
S1 Protocol Capture. Protocol capture using BioChemical Library(BCL).
https://doi.org/10.1371/journal.pcbi.1010038.s011
(DOCX)
S1 Text. Text on extraction of evolutionary and biophysical features.
https://doi.org/10.1371/journal.pcbi.1010038.s012
(DOCX)
References
- 1. Schwartz PJ, MD; Crotti L, MD, PhD; Roberto Insolia P. Long-QT Syndrome From Genetics to Management. Arrhythmogenic Disorders of Genetic Origin 2012; Volume 5: 868–877.
- 2. Goldenberg I, Moss AJ. Long QT Syndrome. Journal of the American College of Cardiology. 2008; 51: 2291–2300. pmid:18549912
- 3. Apgar TL, Sanders CR. Compendium of causative genes and their encoded proteins for common monogenic disorders. Protein science: a publication of the Protein Society 2022; 31: 75–91. pmid:34515378
- 4. Schwartz PJ, Stramba-Badiale M, Crotti L, Pedrazzini M, Besana A, Bosi G et al. Prevalence of the congenital long-qt syndrome. Circulation 2009; 120. pmid:19841298
- 5. Kapplinger JD, Tester DJ, Salisbury BA, Carr JL, Harris-Kerr C, Pollevick GD et al. Spectrum and prevalence of mutations from the first 2,500 consecutive unrelated patients referred for the FAMILION® long QT syndrome genetic test. Heart Rhythm 2009; 6: 1297–1303. pmid:19716085
- 6. Wang Q, Curran ME, Splawski I, Burn TC, Millholland JM, Vanraay TJ et al. Positional cloning of a novel potassium channel gene: KVLQT1 mutations cause cardiac arrhythmias Refined genetic and physical localization of LQT1. 1996 http://www.nature.com/naturegenetics.
- 7. Sanguinetti M. C., Curran M. E., Zou A., Shen J., S P. Spector DLA & MTK. Coassembly of KvLQT1 and minK (lsK) proteins to form cardiac fKs potassium channel. 1996.
- 8. Barhanin J, Lesage F, Guillemare E, Fink M, Lazdunski M, Romey G. K(V)LQT1 and IsK (minK) proteins associate to form the I(Ks) cardiac potassium current. Nature 1996; 384: 78–80. pmid:8900282
- 9. Wu J, Ding W-G, Horie M. Molecular pathogenesis of long QT syndrome type 1. pmid:27761162
- 10. Apgar TL, Sanders CR. Compendium of causative genes and their encoded proteins for common monogenic disorders. Protein science: a publication of the Protein Society 2021. pmid:34515378
- 11. Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Research 2016; 44. pmid:26582918
- 12. Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 2017; 136: 665–677. pmid:28349240
- 13. Ackerman MJ. Genetic purgatory and the cardiac channelopathies: Exposing the variants of uncertain/unknown significance issue. Heart Rhythm. 2015;: 2325–31. pmid:26144349
- 14. Giudicessi JR, Ackerman MJ. Genetic testing in heritable cardiac arrhythmia syndromes: differentiating pathogenic mutations from background genetic noise. 2012. pmid:23128497
- 15. Vanoye CG, Desai RR, Fabre KL, Gallagher SL, Potet F, DeKeyser J-M et al. High-Throughput Functional Evaluation of KCNQ1 Decrypts Variants of Unknown Significance. Circulation: Genomic and Precision Medicine 2018; 11: e002345. pmid:30571187
- 16. Yang T, Chung S-K, Zhang W, Mullins JGL, Mcculley CH, Crawford J et al. Biophysical Properties of 9 KCNQ1 Mutations Associated With Long-QT Syndrome. 2009. pmid:19808498
- 17. Restier L, Cheng L, Sanguinetti MC. Mechanisms by which atrial fibrillation-associated mutations in the S1 domain of KCNQ1 slow deactivation of I Ks channels. J Physiol 2008; 586: 4179–4191. pmid:18599533
- 18. Eldstrom J, Wang Z, Werry D, Wong N, Fedida D. Microscopic mechanisms for long QT syndrome type 1 revealed by single-channel analysis of IKs with S3 domain mutations in KCNQ1. Heart Rhythm 2015; 12: 386–394. pmid:25444851
- 19. Huang H, Kuenze G, Smith JA, Taylor KC, Duran AM, Hadziselimovic A et al. Mechanisms of KCNQ1 channel dysfunction in long QT syndrome involving voltage sensor domain mutations. Science Advances 2018; 4. pmid:29532034
- 20. Aromolaran AS, Subramanyam P, Chang DD, Kobertz WR, Colecraft HM. LQT1 mutations in KCNQ1 C-terminus assembly domain suppress I Ks using different mechanisms. pmid:25344363
- 21. Huang H, Chamness LM, Vanoye CG, Kuenze G, Meiler J, George AL et al. Disease-linked supertrafficking of a potassium channel. The Journal of biological chemistry 2021; 296. pmid:33600800
- 22. Bhuiyan ZA. Silent mutation in long QT syndrome: Pathogenicity prediction by computer simulation. Heart Rhythm 2012; 9: 283–284. pmid:22001705
- 23. Li B, Mendenhall JL, Kroncke BM, Taylor KC, Huang H, Smith DK et al. Predicting the Functional Impact of KCNQ1 Variants of Unknown Significance. Circulation: Cardiovascular Genetics 2017; 10. pmid:29021305
- 24. Kernik DC, Yang P-C, Kurokawa J, Wu JC, Clancy CE. A computational model of induced pluripotent stem-cell derived cardiomyocytes for high throughput risk stratification of KCNQ1 genetic variants. PLOS Computational Biology 2020; 16: e1008109. pmid:32797034
- 25. Giudicessi JR. Machine Learning and Rare Variant Adjudication in Type 1 Long QT Syndrome. Circulation: Cardiovascular Genetics 2017; 10. pmid:29021308
- 26. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P et al. A method and server for predicting damaging missense mutations. Nature Methods 2010; 7: 248–249. pmid:20354512
- 27. Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protocols 2009 4:7 2009; 4: 1073–1081. pmid:19561590
- 28. Davydov E v., Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS computational biology 2010; 6. pmid:21152010
- 29. Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research 2019; 47: D886. pmid:30371827
- 30. Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. The American Journal of Human Genetics 2016; 99: 877–885. pmid:27666373
- 31. Kroncke BM, Mendenhall J, Smith DK, Sanders CR, Capra JA, George AL et al. Protein structure aids predicting functional perturbation of missense variants in SCN5A and KCNQ1. Computational and Structural Biotechnology Journal 2019; 17: 206–214. pmid:30828412
- 32. Sallah SR, Sergouniotis PI, Barton S, Ramsden S, Taylor RL, Safadi A et al. Using an integrative machine learning approach utilising homology modelling to clinically interpret genetic variants: CACNA1F as an exemplar. European Journal of Human Genetics 2020 28:9 2020; 28: 1274–1282. pmid:32313206
- 33. Schulze-Bahr E, Haverkamp W, Wedekind H, Rubie C, Hördt M, Borggrefe M et al. Autosomal recessive long-QT syndrome (Jervell Lange-Nielsen syndrome) is genetically heterogeneous. Human Genetics 1997; 100: 573–576. pmid:9341873
- 34. Kroncke BM, Duran AM, Mendenhall JL, Meiler J, Blume JD, Sanders CR. Documentation of an Imperative To Improve Methods for Predicting Membrane Protein Stability. Biochemistry 2016; 55: 5002–5009. pmid:27564391
- 35. Vanoye CG, Thompson CH, Desai RR, DeKeyser JM, Chen L, Rasmussen-Torvik LJ et al. Functional evaluation of human ion channel variants using automated electrophysiology. Methods in enzymology 2021; 654: 383–405. pmid:34120723
- 36. Sun J, MacKinnon R. Structural basis of human KCNQ1 modulation and gating. Cell 2020; 180: 340. pmid:31883792
- 37. FAUCHÈRE J -L, CHARTON M, KIER LB, VERLOOP A, PLISKA V. Amino acid side chain parameters for correlation studies in biology and pharmacology. International journal of peptide and protein research 1988; 32: 269–278. pmid:3209351
- 38. Koehler J, Woetzel N, Staritzbichler R, Sanders CR, Meiler J. A Unified Hydrophobicity Scale for Multi-Span Membrane Proteins. Proteins 2009; 76: 13. pmid:19089980
- 39. Durham E, Dorr B, Woetzel N, Staritzbichler R, Meiler J. Solvent accessible surface area approximations for rapid and accurate protein structure prediction. Journal of Molecular Modeling 2009; 15: 1093. pmid:19234730
- 40. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997; 25: 3389–3402. pmid:9254694
- 41. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 2007; 23: 1282–1288. pmid:17379688
- 42. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 2007; 35: D61. pmid:17130148
- 43. Brown BP, Mendenhall J, Geanes AR, Meiler J. General Purpose Structure-Based Drug Discovery Neural Network Score Functions with Human-Interpretable Pharmacophore Maps. Journal of Chemical Information and Modeling 2021; 61: 603–620. pmid:33496578
- 44. Kundu S. AI in medicine must be explainable. Nature medicine 2021; 27: 1328. pmid:34326551
- 45. Holzinger A, Langs G, Denk H, Zatloukal K, Müller H. Causability and explainability of artificial intelligence in medicine. Wiley interdisciplinary reviews Data mining and knowledge discovery 2019; 9. pmid:32089788
- 46. Kuenze G, Duran AM, Woods H, Brewer KR, McDonald EF, Vanoye CG et al. Upgraded molecular models of the human KCNQ1 potassium channel. PloS one 2019; 14. pmid:31518351
- 47. Taylor KC, Kang PW, Hou P, du Yang N, Kuenze G, Smith JA et al. Structure and physiological function of the human KCNQ1 channel voltage sensor intermediate state. eLife 2020; 9. pmid:32096762
- 48. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2019 21:1 2020; 21: 1–13. pmid:31898477