Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases

Mihaly Badonyi; Joseph A. Marsh

doi:10.1371/journal.pone.0307312

Abstract

Many dominant genetic disorders result from protein-altering mutations, acting primarily through dominant-negative (DN), gain-of-function (GOF), and loss-of-function (LOF) mechanisms. Deciphering the mechanisms by which dominant diseases exert their effects is often experimentally challenging and resource intensive, but is essential for developing appropriate therapeutic approaches. Diseases that arise via a LOF mechanism are more amenable to be treated by conventional gene therapy, whereas DN and GOF mechanisms may require gene editing or targeting by small molecules. Moreover, pathogenic missense mutations that act via DN and GOF mechanisms are more difficult to identify than those that act via LOF using nearly all currently available variant effect predictors. Here, we introduce a tripartite statistical model made up of support vector machine binary classifiers trained to predict whether human protein coding genes are likely to be associated with DN, GOF, or LOF molecular disease mechanisms. We test the utility of the predictions by examining biologically and clinically meaningful properties known to be associated with the mechanisms. Our results strongly support that the models are able to generalise on unseen data and offer insight into the functional attributes of proteins associated with different mechanisms. We hope that our predictions will serve as a springboard for researchers studying novel variants and those of uncertain clinical significance, guiding variant interpretation strategies and experimental characterisation. Predictions for the human UniProt reference proteome are available at https://osf.io/z4dcp/.

Citation: Badonyi M, Marsh JA (2024) Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases. PLoS ONE 19(8): e0307312. https://doi.org/10.1371/journal.pone.0307312

Editor: Muhammad Farooq, The Islamia University of Bahawalpur, PAKISTAN

Received: March 7, 2024; Accepted: June 26, 2024; Published: August 22, 2024

Copyright: © 2024 Badonyi, Marsh. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data and code to build the models and reproduce the analyses have been deposited on the Open Science Framework at https://osf.io/z4dcp/.

Funding: European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 101001169) Biotechnology and Biological Sciences Research Council EASTBIO Doctoral Training Programme (BB/M010996/1) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Genetic diseases are often caused by protein-altering mutations that act via different protein-level mechanisms. One such mechanism is loss of function (LOF), wherein the mutation completely ablates the function of the protein. While LOF-associated disease is often due to protein null mutations, in which little or no protein is produced due to premature termination codons and nonsense mediated decay, it can also be caused by protein-altering mutations, e.g., missense mutations that destabilise protein structure. Nearly all recessive disorders are associated with LOF, with only a few rare cases of alternative mechanisms identified to date [1]. By contrast, although LOF mutations often cause dominant disease via haploinsufficiency when the lost function cannot be compensated by the remaining wild-type allele [2], many dominant disorders are caused by non-LOF mechanisms. These include gain of function (GOF), which is characterized by an altered or newly appeared function in the mutant protein, and the dominant-negative (DN) effect, whereby the mutant allele can directly or indirectly disrupt the function of the wild type [1]. Understanding the molecular mechanisms through which protein mutations exert their effects has great potential for improving diagnoses and developing adequate treatments of genetic disorders. Recent years have seen major improvements in experimental methods capable of probing a large number of variants at once. In particular, deep mutational scanning and its modifications hold huge promise for deciphering the effect of variants and, by extension, their mechanisms [3]. For now, relative to LOF, our understanding of the properties of variants that act via DN or GOF (collectively, non-LOF) mechanisms remains limited.

In many cases, the different dominant molecular mechanisms are not mutually exclusive in a gene. For example, a type of cardiomyopathy can be influenced by both titin haploinsufficiency (LOF) and truncated titin peptides that seem to “poison” wild-type complexes (DN) [4]. Although molecular mechanisms are indisputably variant-level phenomena, it has become evident that mutations in many genes are more likely to exhibit a specific mechanism [5]. This concept has shed light on certain structural and functional properties associated with non-LOF proteins at the systems level [5, 6]. Moreover, these analyses have called attention to a limitation of current variant effect predictors (VEPs), as they struggle to accurately predict the pathogenicity of non-LOF variants [5, 7, 8]. This is a critical problem to address, as otherwise there is a strong possibility that we could miss variants on the account of difficulties in computationally predicting their effects. A gene-level predictor of molecular mechanisms could potentially be very useful for highlighting genes where VEPs could perform poorly and miss pathogenic variants. Given the limitations of current methods that primarily focus on LOF mechanisms, such as haploinsufficiency predictors [9–11], we aimed to develop a method for more directly differentiating between dominant molecular mechanisms.

Previously, we built a simple model to predict genes most likely to be associated with non-LOF mechanisms [6]. Predictions from this model were incomplete for the proteome due to limited availability of protein structural and functional features, in particular its reliance on quaternary structure information that comes from experimentally determined structures. Additionally, DN and GOF mechanisms were combined into a single non-LOF category because the genes associated with them tend to have similar attributes such as evolutionary constraints, functional niches, and protein complex topologies. To address these problems, here we train three support vector machine binary classifiers using surrogate structural and functional features complete for the proteome. Each classifier is tasked with identifying one of the molecular mechanisms over another unique class (DN vs LOF and GOF vs LOF) or pooled classes (LOF vs non-LOF). This design maximizes the number of cases available for training and enables a flexible classification regime. Analysis of an unbiased set of disease-linked proteins with predicted mechanisms suggests that their properties, independent from model features, are consistent with known characteristics of the mechanisms, thus strongly supporting the utility of this approach.

To facilitate a community-wide effort in piecing together the human variant effect map, we make predictions for all human protein-coding genes. These predictions can pinpoint the likely mechanism of a variant in a dominant gene with no previous association to a molecular mechanism and offer valuable insight when assessing novel variants in a gene with no prior disease linkage. Additionally, they could help prioritise genes in clinical and population genetics data for laboratory studies and allow researchers to explore the functional, structural, and evolutionary properties inherent to the mechanisms.

Methods

Workflow

Our strategy to predict genes associated with different dominant molecular mechanisms is illustrated in Fig 1 and detailed in the following sections.

Download:

Fig 1. Data acquisition and model building workflow.

1) Identification of reported dominant molecular mechanisms in the literature and clinical databases. 2) Collection of diverse gene-level features that are known to discriminate between molecular mechanisms. 3) Training of 3 semi-independent support vector machine classifiers to predict genes most likely to be associated with DN, GOF and LOF molecular mechanisms. 4) Validation of predictions via conventional cross-validation, as well as through the examination of functional and structural properties of an independent test set.

https://doi.org/10.1371/journal.pone.0307312.g001

Software and databases

In this study, we rely on a number of properties derived from the AlphaFold-predicted structures of human proteins [12]. Solvent accessible surface area is calculated at residue level with AREAIMOL from the CCP4 programme suite [13]. FoldX version 5.0 [14] is run to determine the predicted Gibbs free energy of folding (ΔΔG) of missense substitutions by first calling the RepairPDB command followed by BuildModel. Residue-level SCRIBER scores [15] for the human proteome are extracted from the 9606_database.json file provided by the DescribePROT database [16]. Protein-level UniProt ProtNLM embeddings [17] are obtained via the UniProt FTP for the human proteome (2023_02 release). All 1024 embedding dimensions are extracted from the per-protein.h5 file using the h5dump command from the HDF5 command line tool. Clustal Omega version 1.2.4 [18] is used to create a distance matrix of protein sequences for homology filtering. Scores from VEPs ESM-1v [19], EVE [20], MetaRNN [21], and VARITY_R [22] are gathered using an in-house pipeline described in [23]. Machine learning methods are implemented in R version 4.3.0 [24] using the tidymodels framework with extension packages themis, probably, and DALEX. Statistical tests are performed with the rstatix package. We consider an alpha level of 0.05 to be statistically significant. Bootstrap confidence intervals are obtained from 1,000 resamples with the percentile method [25].

ClinVar and gnomAD mapping to UniProt reference proteome

Genomic coordinates of pathogenic and likely pathogenic (hereinafter “pathogenic”) missense variants are extracted from the ClinVar [26] variant calling file (VCF) (accessed on 2023-06-15) with BCFtools [27]. Putatively benign variants (hereinafter “benign”) with a “PASS” filter are extracted from gnomAD v2.1.1 [28] using the exomes all chromosomes VCF file. Mapping is performed with Ensembl VEP 107 [29] using the—uniprot and—canonical flags. For each variant, we select either the canonical transcript or the first UniProt isoform on condition that the mapped amino acid is identical to that in the sequence.

Data acquisition

Evidence for DN, GOF, or LOF molecular disease mechanism observed in a gene was collected using a combination of text-mining and semi-manual curation, as previously described [6]. We updated the gene set using newly added autosomal dominant genes from the Online Inheritance of Man (OMIM) database [30] (accessed via the API on 2023-05-25) and developmental disorder genes with monoallelic requirement from the Deciphering Developmental Disorders Genotype-to-Phenotype (DDG2P) database (https://www.deciphergenomics.org/ddd/ddgenes/, accessed on 2023-05-25). The updated list contains 1,270 genes with one or more molecular mechanism class based upon available evidence. Of the genes, 874 are assigned to a unique class, 318 are assigned to two classes, and 78 genes intersect with all three classes. We provide the set of manually curated genes, their class labels and evidence lines along with their PubMed identifiers in S1 Table.

Feature selection and engineering

We use gene and protein-level features previously described in [6], notably excluding subunit interface size, which is derived from experimentally determined structures of protein complexes, and the PANTHER functional classification of proteins [31], which is incomplete for the human proteome. We sought to rationally select and design surrogate features that make up for the loss of protein structural and functional information. To replace interface size, we calculate the median SCRIBER score of residues that have >5% relative surface accessible surface area (RSA) in the corresponding AlphaFold predicted structure. RSA is defined as the ratio of the maximum solvent accessible surface area determined from Gly-X-Gly tripeptides (where X denotes the amino acid) [32] and that of the residue in the context of the monomeric protein. To substitute protein functional classification, we use UniProt ProtNLM embeddings that capture information about Pfam domains and functional sequence features and transfer those to unannotated and manually unreviewed entries. In each model, to keep the ratio of primary cases to number of predictors high and avoid overfitting on a large number of uncorrelated features, we select the top 20 ProtNLM embeddings by Wilcoxon effect size between the relevant binary outcomes. In addition, we introduce a pair of features related to the likelihood of randomly drawing a missense substitution with the characteristics of a particular type of molecular mechanism. This heuristic metric is derived as the median of the ratio of ESM-1v score and the FoldX-computed ΔΔG of all missense mutations based on possible amino acid substitutions in the canonical sequence. We use both raw ΔΔG and its absolute value (|ΔΔG|), which has been shown to improve the performance of pathogenic variant prediction [33]. Finally, we complement existing population genetics constraint metrics with s_het [34], which relates the frequency of protein null mutations in a gene to the strength of selection against them, with higher values indicating stronger selection and lower tolerance to LOF mutations. We provide the list of all features used in the models and their detailed descriptions in S2 Table.

Model design and data preprocessing

We opted for three separate binary classifiers rather than one multiclass predictor for several reasons. First, a multiclass approach would require seven classes to account for all mixed class combinations. The case numbers of the mixed classes are not sufficient for modelling, as they prohibit cross-validation and necessitate the use of fewer features to avoid overfitting [35]. Second, the prevalence of different mechanisms within the same gene is often unknown, making it difficult to weight them appropriately in a multiclass setting. Since many genes with a prevailing mechanism can have rare occurrences of other mechanisms, our current mixed-class groups are likely subject to a level of study bias, which would be unfavourably reflected in the model output. Third, binary models allow genes with mixed classes to contribute to more than one model, delegating the task of weighting mechanism propensities to the models. Thus, multiple binary models can essentially function like a multiclass predictor, without the data scarcity problems introduced by attempting a direct multiclass prediction [36].

In each model, we prioritise the first event level of the binary outcome. For example, in the DN vs LOF model, we treat any gene with a DN class annotation as DN even if it has other class assignments; this strategy is likewise applied to the GOF vs LOF and the LOF vs non-LOF models (where non-LOF represents pooled DN and GOF gene sets). We next remove features that have a Spearman correlation >0.9 with another, which eliminates the gnomAD oe_lof metric (observed over expected ratio for predicted LOF variants in transcript) in all three models due to its high correspondence with s_het. The features are normalised and missing values are imputed based upon all other features with five nearest neighbours [37]. We use a distance matrix of the protein sequences to create a non-redundant dataset with proteins sharing <50% sequence identity within each outcome, e.g., within DN or LOF genes, but not across, as it may be important to learn differences between homologues assigned to different classes. To improve the signal-to-noise ratio, when possible, instead of randomly removing one protein from a pair of proteins above the sequence identity cutoff, we remove the one that overlaps with another class. This procedure results in the following class proportions across the models (protein count, class percentage): DN (498, 52%) vs LOF (451, 48%), GOF (479, 51%) vs LOF (463, 49%) and LOF (610, 52%) vs non-LOF (559, 48%).

Initial model screen

To assess which statistical learning approach suits the data and the classification problem the best, we carried out a model screen. We chose five diverse modelling methods (tuned hyperparameters): lasso logistic regression (penalty), multilayer perceptron (penalty, hidden units, epochs), polynomial support vector machine (SVM) (cost, degree, scale factor, margin), and two tree-based methods, random forest (trees, mtry [minimum number of randomly sampled features at each split], min_n [minimum number of data points a node must contain]) and the LightGBM algorithm (trees, tree depth, mtry, min_n, loss reduction) [38]. Maintaining class ratios, we split the data into 75% training and 25% test sets and ran a 10-fold cross-validation, similarly divided into 75% analysis and 25% assessment sets. An efficient grid search was performed via the ANOVA race tuning method [39] on a random grid of size k = 10×(number of hyperparameters), which eliminated configurations unlikely to have been the best after 3 resamples using repeated measures ANOVA. We compared the models by both the area under the receiver operating characteristics (AUROC) curve of the ANOVA race-winning configuration measured on the cross-validation folds as well as the AUROC of the k random grid configurations of each model on the test set (Fig 2). By the former, the top-ranking models were LightGBM for DN vs LOF and LOF vs non-LOF models and polynomial SVM for the GOF vs LOF model. However, because the k random grid configurations of the polynomial SVM consistently gave the best performance on the test sets, suggesting it performs well on unseen data without hyperparameter tuning, it was chosen to reduce the risk of overfitting and to maintain a similar prediction profile across the three models.

Download:

Fig 2. Results of the initial model screen.

Plots on the left show the AUROC of the best hyperparameter set of each model on the cross-validation folds. Plots on the right show the AUROC of k randomly generated parameter combinations as measured on the test sets, where k = 10×(number of hyperparameters). Boxes denote data within 25^th and 75^th percentiles and the middle line represents the median. Whiskers extend from the box to 1.5× the interquartile range.

https://doi.org/10.1371/journal.pone.0307312.g002

Model building

The final DN vs LOF, GOF vs LOF, and LOF vs non-LOF models are built using an SVM with a polynomial kernel. The data are divided into 75% training and 25% test sets with 3-times repeated 10-fold cross-validation, which itself is split into 75% analysis and 25% assessment sets. We perform Bayesian hyperparameter optimisation on the tunable parameters of the model: cost, degree, scale factor, and margin. Initially, a random grid of size 10 is created for the Gaussian process model and a maximum of 20 search iterations are performed, with an early stop after 10 cycles if no improvement is observed. The models are finalised based upon the parameters that achieve the highest AUROC. Exact values of the SVM hyperparameters used in the models are provided in S2 Table.

Global and local model feature importance assessment

Global feature importance is estimated by loss in AUROC upon removal of the feature in 10 permutations. Local influence of features on the predicted outcome for a protein is estimated via computing the features’ Shapley values [40] in 20 permutations.

Model evaluation

We compute threshold-agnostic performance measures for the models based upon the test sets, including AUROC and area under the precision recall curves (AUPRC) (Fig 3). The consistency of AUROCs measured on the test sets relative to the cross-validation folds suggest the models are not overfitted (DN vs LOF: AUROC_CV = 0.718, AUROC_test = 0.71; GOF vs LOF: AUROC_CV = 0.73, AUROC_test = 0.763; LOF vs non-LOF: AUROC_CV = 0.739, AUROC_test = 0.763). To assess the models in further analyses, we determine the probability thresholds at which the models reach 50% sensitivity on their respective test sets. At these thresholds, which we refer to as t₅₀, the models are able to correctly identify half of the positive cases and approximately 80% of the negative cases. Threshold-dependent performance measures at t₅₀, including accuracy, Matthews correlation coefficient (MCC) and the F1 score, are summarised in Table 1.

Download:

Fig 3. ROC and PR curves of the models measured on the test sets.

Numbers in bold denote point estimates and those in brackets represent 90% bootstrap confidence intervals.

https://doi.org/10.1371/journal.pone.0307312.g003

Download:

Table 1.

https://doi.org/10.1371/journal.pone.0307312.t001

Biologically interpretable evaluation is performed by examining the models’ capacity to identify emergent properties known to be associated with dominant molecular disease mechanisms [5, 6]. Specifically, we evaluate 3 properties:

The energetic impact of pathogenic missense mutations in the structure, approximated by the FoldX-predicted Gibbs free energy of folding (ΔΔG). Missense mutations that map to residues with a predicted local distance difference test (pLDDT) value [41] less than 70 are excluded. This is because residues with low pLDDT are less likely to be ordered and thus ΔΔG is less interpretable for them.
The degree to which pathogenic missense mutations spatially cluster in the structure, approximated by the extent of disease clustering (EDC) metric. EDC is calculated as previously described [5], however, alpha carbon atoms with a pLDDT<70 are excluded from the calculation and only proteins with at least 5 pathogenic variants after this procedure are used for the analysis. This is because pathogenic missense mutations are enriched in structured regions of proteins [42], hence AlphaFold structures with high disorder could exhibit pseudo-clustering.
The performance of pathogenic missense variant prediction in a binary classification task, using two unsupervised (ESM1-v and EVE) and two supervised (MetaRNN and VARITY_R) VEPs, measured by AUROC. Only proteins with at least 5 pathogenic and 5 benign variants, and all variants mutually shared across the VEPs (that is, have predicted values), are considered for the analysis to increase per-protein AUROC confidence.

Statistical overrepresentation test for molecular function

Functional enrichment of proteins that are subsets of predicted DN or GOF proteins is performed by comparing the protein lists against each other via PANTHER [43] using Fisher’s exact test and the Gene Ontology release 2023-05-10. We consider functions with a false discovery rate <1%, a sample size (number of proteins) >50, and a fold-enrichment of >1.5. To avoid redundancy, we select a single term with the largest sample size for groups of similar terms.

Results and discussion

Global and local feature importance evaluation

In total, 21 interpretable features were used in the models, including properties derived from protein sequences, structures, networks, and gene mutational constraints [6]. Additionally, 20 language model-based embeddings were also included, which are thought to represent protein function in their latent space [44]. As a measure of feature importance, we calculated the loss in AUROC relative to the full model (Fig 4). While we cannot draw conclusions on why a particular ProtNLM embedding is useful for a given model, their relatively high ranks in the GOF vs LOF model may reflect that those genes whose dominant disease mutations act via GOF are functionally more dissimilar or diverse than DN or LOF genes. In terms of the interpretable features, s_het, a metric that relates the frequency of LOF mutations in a gene to the strength of selection against them, ranks 1^st and 3^rd in the DN vs LOF and LOF vs non-LOF models, respectively. This emphasizes the importance of eliminating length-bias from selective constraint metrics to improve their power [34]. Moreover, the number of paralogues a protein has in the genome seems similarly important for the latter two models, ranking 3^rd and 4^th, respectively, consistent with our previous results [6]. Other notable features are gnomAD metrics pLI, the probability of the gene being intolerant to loss of function, and pRec, the probability of the gene being associated with recessive disease, which rank consistently high. This is unsurprising given that genes with high pLI should be enriched in genuinely haploinsufficient genes and those with non-LOF disease mechanisms will tend to have higher pRec values, i.e. are more “recessive-like”. Interestingly, the median ESM-1v/|ΔΔG| metric ranks 4^th in the DN vs LOF model, suggesting that DN genes may possess a missense variant repertoire that is biased for functionally important but structurally less damaging effects.

Download:

Fig 4. Feature importance of the models.

Feature importance is estimated by loss in AUROC upon removal of the feature from the model in 10 permutations. Boxes denote data within 25^th and 75^th percentiles and the middle line represents the median. Whiskers extend from the box to 1.5× the interquartile range. Dashed line is at AUROC loss = 0.

https://doi.org/10.1371/journal.pone.0307312.g004

We next looked at the worst and best predicted genes from the training sets of each model and calculated Shapley values for the features. These values indicate the average contributions of different feature orderings, with positive values suggesting a tendency to predict a protein towards the primary outcome and vice versa. In the training set of the DN vs LOF model, the DNA-binding protein SATB2 has the lowest probability of belonging to the DN class (pDN), while the type I cytoskeletal keratin 14 (KRT14) has the highest (Fig 5). Shapely values for SATB2 suggest that primarily s_het, but also, for example, pLI and cotranslational assembly have contributed to its low pDN. Indeed, SATB2 has a particularly high s_het (0.652, which is 5.2 standard deviations above the proteome mean), a pLI of 1, and has been found to cotranslationally assemble in a recent study [45], consistent with a decreased likelihood of observing its role in disease through a DN mechanism [6]. In our source data, evidence for SATB2 dominant negativity came from a report that describes a frameshift mutation in the gene likely to escape nonsense-mediated decay, whose clinical phenotype is more severe than glass syndrome, attributed to SATB2 haploinsufficiency [46]. However, SATB2 is recognised by the ClinGen review panel to have sufficient evidence for haploinsufficiency [47]. Thus, it is possible that SATB2 either represents a false positive case or, due to regression to the mean, the model fails to identify it as its features are too consistent with the secondary outcome. In contrast, Shapely values for KRT14 indicate congruence with known characteristics of DN proteins, e.g., the high number of paralogues (68 in the proteome, only 1 in the training set). Furthermore, keratin disorders are considered a classical group of DN diseases [48], reinforcing the prediction of the model.

Download:

Fig 5. Local interpretations of model class probabilities.

Plots show the worst (left) and best (right) predicted genes of the training sets and Shapley values of the top 10 most influential features in 20 permutations. Bars show the mean, boxes denote data within 25^th and 75^th percentiles and the middle line is the median. Whiskers extend from the box to 1.5x the interquartile range.

https://doi.org/10.1371/journal.pone.0307312.g005

Similar conclusions can be drawn for the other two models. AF4/FMR2 family member 4 (AFF4), implicated in CHOPS syndrome, is a GOF case in the source data with the lowest pGOF value among genes of the training set (Fig 5). Three heterozygous missense mutations in its gene were recently found to cause GOF due to decreased clearance of the protein by the ubuiquitin proteasomal system, leading to transcriptional overactivation [49]. Since transcription factors are frequently haploinsufficient [50], their gene-level features in our data may generally align more with a LOF mechanism, as supported by the Shapley values of pLI and s_het for AFF4, which could hinder the identification of similar edge-cases. Interestingly, the worst predicted positive case by the LOF vs non-LOF model is apolipoprotein E (APOE) (Fig 5), which does lack sufficient evidence for dosage sensitivity in ClinGen. On further review, evidence in our source data revealed that a heterozygous null mutation of APOE had been linked to protection against a type of amyloidosis observed in Alzheimer’s disease, rather than being implicated in disease causation [51]. Hence, the model is able to correctly down-weight false positive cases.

Proteome-scale molecular mechanism prediction

From the test sets we derived probability threshold t₅₀ for each model, which represents a classification performance where approximately half of the cases belonging to the primary outcome and 80% of the secondary outcome are correctly predicted (Table 1). This threshold constitutes a relatively good balance between sensitivity and specificity while being more stringent than what is commonly considered the optimal threshold, e.g., minimum distance from the [0,1] corner of the ROC curve (dashed vertical line in Fig 6A), and being more lenient than the maximum positive predictive value. To allow the specification of different thresholds, we provide estimates of common threshold metrics in S2 Table for every 0.01 increase in class probability. In Fig 6B, we show t₅₀ in the context of model probability distributions for unique and overlapping cases in the test sets. Intriguingly, both the DN vs LOF (pDN) and the GOF vs LOF (pGOF) models assign higher probabilities to mixed LOF/GOF and LOF/DN classes, respectively, relative to the unique secondary outcomes, despite the fact that the models are blind to the overlapping classification of these cases. This suggests that these genes may have intermediary characteristics or exhibit a higher false positive rate for the secondary class in the ground truth data.

Download:

Fig 6. Threshold plots and test set class probabilities.

(A) Commonly used threshold metrics derived from the test sets for pDN, pGOF, and pLOF. NPV and PPV denote negative and positive predictive values, respectively. Estimates for these metrics are provided in S2 Table for every 0.01 increase in class probability. (B) Model probabilities mapped to the test sets, classified by their ground truth classes, both unique and overlapping. Sample sizes indicate the number of proteins. Boxes denote data within 25^th and 75^th percentiles and the middle line represents the median. Whiskers extend from the box to 1.5x the interquartile range and the notch contains the 95% confidence interval of the median. Dashed lines run at the t₅₀ probability thresholds.

https://doi.org/10.1371/journal.pone.0307312.g006

We chose a distribution-agnostic approach to classify the human proteome into a single molecular mechanism class. By ranking the class probabilities rather than comparing their raw values, the approach is less sensitive to differences in the properties of the models’ class probability distributions. First, class probabilities are computed and ranked for all proteins and then each protein is assigned to the mechanism with the highest rank relative to the proteome, on condition that its class probability is above the t₅₀ value. Based upon this strategy, in the 2023_02 UniProt reference proteome, 6,058 proteins are predicted to be DN, 5,287 GOF, and 2,580 LOF, with 6,440 proteins lacking classification.

It is important to emphasize that the models do not predict disease involvement, i.e. the above proportions should not be misconstrued as implying a twofold prevalence of DN over LOF proteins contributing to disease. Only dominant disease genes were included in training, so the models were not designed to distinguish disease-associated from non-disease proteins. Instead, they can be interpreted as representing the most likely molecular mechanism by which mutations in a given gene could cause disease, if it was a dominant disease gene. The high number of predicted DN and GOF proteins may be ascribed to a higher number of paralogous proteins in these mechanism classes relative to LOF [6], which results in a larger fraction of the proteome aligning with their characteristics, regardless of whether or not they actually cause disease when mutated. When we consider only known dominant disease genes, the proportion of proteins predicted to be DN, GOF, and LOF are 27%, 28%, and 44%, respectively.

In practical scenarios, researchers are often concerned with anticipating a LOF or an alternative molecular mechanism when encountering missense variants in a gene. In such a situation, an increased pDN or pGOF score would suggest that the gene’s characteristics deviate from those typically linked to LOF genes, a notion that would be further supported by a low pLOF score. Although the models do not excel at identifying positive cases at an apparent sensitivity level of 50%, they are much better at identifying negative cases. Therefore when the researcher is confident that the dominant missense variants are linked to a particular phenotype, the higher-ranking molecular mechanism is still more likely to be correct.

Importantly, however, relative to haploinsufficiency predictors introduced in previous studies [9–11] (comparisons with pLOF in S1 Fig), our models are unique in that they were trained on strictly autosomal dominant genes and specifically designed to tell apart alternative molecular mechanisms from that of LOF. Thus, we expect them to be much more useful than haploinsufficiency predictors for the interpretation of missense variant effects, as they more accurately represent the properties of the underlying mechanisms. We provide the class probabilities, the rank-based classification results with and without t₅₀ inclusion, and the labels for training set genes in S3 Table for all human proteins (n = 20,365).

Biologically and clinically relevant validation of the models

We wished to test whether the models are able to recapitulate known properties of dominant molecular mechanisms [5]. These properties include the energetic impact of missense mutations in the protein structure, their degree of spatial clustering, and their current predictability by state-of-the-art VEPs (see Methods). None of these properties were explicitly used for training, thus making them simple and powerful means to assess the usefulness of the predictions. We created an unbiased analysis set where we removed genes used for training and those associated with autosomal recessive inheritance. We first looked at the predicted ΔΔG of pathogenic missense mutations across the different predicted classes in this dataset (Fig 7A). The results suggest mutations in DN and GOF proteins are significantly less damaging than in LOF proteins. This can be explained by the idea that destabilisation is one of the signature mechanisms of LOF mutations [5]. By contrast, GOF mutations should not be too damaging in order to alter a function, and most DN and many GOF genes are in fact assembly-mediated [1]. Thus, the effect of their mutations should not preclude protein complex assembly, which makes them necessarily less damaging.

Download:

Fig 7. Validation of the models through model-independent metrics on an unbiased analysis set.

(A) FoldX-predicted ΔΔG of pathogenic missense mutations. Numbers below classes denote the number of mutations. Holm-Bonferroni corrected p-values above DN and GOF boxes are relative to the LOF group and were determined by one-sided Wilcoxon rank-sum test. Sample sizes indicate the number of variants. (B) Class probabilities of the analysis set vs EDC. Sample sizes indicate the number of proteins in each class. Holm-Bonferroni corrected p-values above DN and GOF boxes are relative to the LOF group and were determined by one-sided Wilcoxon rank-sum test. (C) Aggregated AUROC analysis of pathogenic vs benign variants in predicted molecular mechanism classes. Labels indicate the number of proteins in each class. Boxes denote data within 25^th and 75^th percentiles, the middle line represents the median and the notch contains the 95% confidence interval of the median. Violins show area-normalized distributions.

https://doi.org/10.1371/journal.pone.0307312.g007

Next, we looked at how much pathogenic missense mutations cluster in the structures of proteins predicted to be DN or GOF. Our prior expectation is that both classes should exhibit higher degree of clustering than LOF, as measured by the EDC metric. We found this assumption to hold up, with both DN and GOF proteins having significantly higher clustering values than LOF proteins (Fig 7B). This observation agrees with the concept that LOF mutations are generally more sparsely distributed in the protein structure, but non-LOF mutations tend to be concentrated at protein interfaces and functional sites [5].

Lastly, we took advantage of an important bottleneck of contemporary VEPs, which is that they less well predict pathogenic missense mutations associated with DN and GOF genes [5]. In Fig 7C, we show a per-protein aggregated AUROC analysis of pathogenic vs benign missense variants, evaluated by two unsupervised (ESM1-v and EVE) and two supervised (MetaRNN and VARITY_R) VEPs, which we recently showed to be the top performing VEPs of their category [23]. Expectedly, missense mutations in proteins classified as LOF are much better predicted than those in DN or GOF proteins.

Importantly, the above trends are observed with the raw rank-based classification results without the t₅₀ probability cutoff (S2 Fig), demonstrating that they are not an artefact of a careful threshold selection. We conclude that the models effectively reproduce biologically and clinically relevant properties of dominant molecular mechanisms.

The functional landscape of predicted DN and GOF proteins

Finally, given that pDN and pGOF class probabilities are positively correlated by design (Fig 8), we decided to explore the molecular functions associated with proteins predicted exclusively as DN or GOF using statistical overrepresentation tests. It is well-established that DN effects are common in homomers [6, 52], which are proteins that form complexes with copies of themselves. Correspondingly, we identified functions closely related to homomeric symmetry groups in exclusively DN proteins, including “organic cyclic compound binding”, “oxidoreductase activity”, and “hydrolase activity” [53]. Interestingly, a strongly enriched, large group of proteins possess “nucleic acid binding” function, potentially indicating a higher prevalence of DN effects among transcription factors and other DNA/RNA-binding proteins. This underscores the importance, and difficulty, of distinguishing between mutations that act through DN effects versus haploinsufficiency. In contrast, exclusively GOF proteins exhibit functions less directly associated with protein complexes but more susceptible to the impact of overactivation events. These functions include “molecular transducer activity”, “signaling receptor activity”, “kinase activity”, “enzyme regulator activity”, and “phosphotransferase activity”.

Download:

Fig 8. The functional landscape of proteins predicted to be exclusively DN or GOF.

Middle: pDN vs pGOF scatter plot coloured by pLOF. Correlation triangle on the right shows Spearman correlations between class probabilities. Top: The lollipop chart shows the most enriched molecular functions for genes predicted to be GOF below the t₅₀ threshold for pDN (exclusively GOF). Bottom: Likewise, the chart shows the most enriched functions for DN genes below the t₅₀ threshold for pGOF (exclusively DN). The x-axis shows the negative log₁₀ false discovery rate.

https://doi.org/10.1371/journal.pone.0307312.g008

Altogether, these findings reinforce that the predictions are consistent with known and expected properties of non-LOF molecular mechanisms. Thus, they can be used to test hypotheses about the functional attributes of the proteins associated with the molecular mechanisms, which could facilitate the discovery of novel features for the next generation of VEPs.

Conclusion

We constructed three binary classifiers to predict DN, GOF, and LOF dominant molecular disease mechanisms by substituting structural and functional features of protein-coding genes with limited coverage. These models have a similar performance to our previous non-LOF vs LOF classifier, which was based upon such limited but interpretable features. We combined the models’ output into a single class prediction using a proteome rank-based approach and examined the properties of proteins assigned to the different molecular mechanisms. Our analyses provide strong evidence that the models reproduce the properties linked to molecular mechanisms at systems level. Notably, our findings underscore the predictability of pathogenic mutations in DN and GOF proteins–a problem recognised to be more challenging than in LOF proteins. This observation was consistently upheld in our unbiased analysis set, further stressing the need for diversifying the predictive capabilities of future VEPs.

Interestingly, by examining the mechanism probabilities assigned to genes with mixed mechanism classes, we found some indication that the models have learnt the properties of genes in which multiple disease mechanisms can arise. However, our tripartite model does not directly account for different molecular mechanisms that may coexist within a gene. Therefore, considering the properties of missense variants linked to distinct diseases may improve prediction accuracy at both the phenotype and gene levels. Enhancing these predictions will be particularly beneficial for clinicians, as it will enable them to focus on variants associated with a given phenotype and obtain more precise, context-aware mechanism predictions.

We have made available the predictions for the human reference proteome, and we hope researchers will use the data to prioritise novel or challenging variants and test hypotheses about the functional roles fulfilled by these proteins. It is important to emphasize that the models do not predict the likelihood of disease involvement by a gene. Instead, a non-disease associated gene with a predicted mechanism implies that the protein’s properties are most consistent with the given mechanism, which we expect will be particularly useful for researchers seeking to determine the molecular mechanisms underlying novel disease mutations.

Supporting information

S1 Fig. Comparison of pLOF with haploinsufficiency predictors.

(A) Scatter plots of haploinsufficiency scores from the indicated studies vs pLOF, assessed on the LOF vs non-LOF test set. pLOF has the highest agreement with the Shihab et al. predictions (Pearson’s r = 0.54, p = 4.5×10^-24) and the lowest with those of Steinberg et al. (r = 0.35, p = 6.9×10^-19). (B) Pearson correlation triangle showing mutually shared values of all four metrics on human genes (n = 15,046). Although considerable variation exists even among haploinsufficiency predictors, low correlation with pLOF is expected, because haploinsufficiency predictors were not exclusively trained on dominant genes and their negative class comprised of haplo-sufficient genes of mostly no disease relevance. By contrast, pLOF values from the LOF vs non-LOF model were obtained by training the model to recognise haploinsufficient genes against a background of strictly dominant genes with molecular mechanisms other than simple LOF.

https://doi.org/10.1371/journal.pone.0307312.s001

(TIF)

S2 Fig. Supplemental analysis to Fig 6: Validation of the models through model-independent metrics on an unbiased analysis set, using rank-based classification without t₅₀ cutoff.

(A) FoldX-predicted ΔΔG of pathogenic missense mutations. Numbers below classes denote the number of mutations. Holm-Bonferroni corrected p-values above DN and GOF boxes are relative to the LOF group and were determined by one-sided Wilcoxon rank-sum test. Sample sizes indicate the number of variants. (B) Class probabilities of the analysis set vs EDC. Sample sizes indicate the number of proteins in each class. Holm-Bonferroni corrected p-values above DN and GOF boxes are relative to the LOF group and were determined by one-sided Wilcoxon rank-sum test. (C) Aggregated AUROC analysis of pathogenic vs benign variants in predicted molecular mechanism classes. Labels indicate the number of proteins in each class. Boxes denote data within 25^th and 75^th percentiles, the middle line represents the median and the notch contains the 95% confidence interval of the median. Violins show area-normalized distributions.

https://doi.org/10.1371/journal.pone.0307312.s002

(TIF)

S1 Table. Molecular disease mechanisms collected from literature reports.

This CSV format table contains the molecular mechanisms associated with dominant genes collected from the literature. The columns are as follows: gene–gene name; class–one of “dn”, “gof”, or “lof” representing dominant-negative, gain-of-function, and loss-of-function mechanisms, respectively; pmid–the relevant PubMed identifier; evidence_line–a supporting statement for the mechanism in the published work.

https://doi.org/10.1371/journal.pone.0307312.s003

(CSV)

S2 Table. Model features, final hyperparameters, and common threshold metrics.

A collection of XLSX sheets containing information on the features used in the models, the final hyperparameters of the models, and for all models the estimates of common threshold metrics for every 0.01 increase in class probability.

https://doi.org/10.1371/journal.pone.0307312.s004

(XLSX)

S3 Table. Predicted molecular mechanisms for the human proteome.

The first sheet of this XLSX file contains column descriptions for the second sheet. The second sheet contains the molecular mechanism predictions.

https://doi.org/10.1371/journal.pone.0307312.s005

(XLSX)

Acknowledgments

We thank Benjamin J. Livesey for generating the VEP scores and providing helpful comments on the manuscript.

References

1. Backwell L, Marsh JA. Diverse Molecular Mechanisms Underlying Pathogenic Protein Mutations: Beyond the Loss-of-Function Paradigm. Annual review of genomics and human genetics. 2022;23. pmid:35395171
- View Article
- PubMed/NCBI
- Google Scholar
2. Veitia RA. Exploring the etiology of haploinsufficiency. BioEssays. 2002;24: 175–184. pmid:11835282
- View Article
- PubMed/NCBI
- Google Scholar
3. Fowler DM, Adams DJ, Gloyn AL, Hahn WC, Marks DS, Muffley LA, et al. An Atlas of Variant Effects to understand the genome at nucleotide resolution. Genome Biology. 2023;24: 147. pmid:37394429
- View Article
- PubMed/NCBI
- Google Scholar
4. Fomin A, Gärtner A, Cyganek L, Tiburcy M, Tuleta I, Wellers L, et al. Truncated titin proteins and titin haploinsufficiency are targets for functional recovery in human cardiomyopathy due to TTN mutations. Science Translational Medicine. 2021;13: eabd3079. pmid:34731013
- View Article
- PubMed/NCBI
- Google Scholar
5. Gerasimavicius L, Livesey BJ, Marsh JA. Loss-of-function, gain-of-function and dominant-negative mutations have profoundly different effects on protein structure. Nature Communications 2022 13:1. 2022;13: 1–15. pmid:35794153
- View Article
- PubMed/NCBI
- Google Scholar
6. Badonyi M, Marsh JA. Buffering of genetic dominance by allele-specific protein complex assembly. Science Advances. 2023;9: eadf9845. pmid:37256959
- View Article
- PubMed/NCBI
- Google Scholar
7. Bayrak CS, Stein D, Jain A, Chaudhary K, Nadkarni GN, Vleck TTV, et al. Identification of discriminative gene-level and protein-level features associated with pathogenic gain-of-function and loss-of-function variants. The American Journal of Human Genetics. 2021;108: 2301–2318. pmid:34762822
- View Article
- PubMed/NCBI
- Google Scholar
8. Flanagan SE, Patch A-M, Ellard S. Using SIFT and PolyPhen to Predict Loss-of-Function and Gain-of-Function Mutations. Genetic Testing and Molecular Biomarkers. 2010;14: 533–537. pmid:20642364
- View Article
- PubMed/NCBI
- Google Scholar
9. Huang N, Lee I, Marcotte EM, Hurles ME. Characterising and Predicting Haploinsufficiency in the Human Genome. Schierup MH, editor. PLoS Genet. 2010;6: e1001154. pmid:20976243
- View Article
- PubMed/NCBI
- Google Scholar
10. Shihab HA, Rogers MF, Campbell C, Gaunt TR. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics. 2017;33: 1751–1757. pmid:28137713
- View Article
- PubMed/NCBI
- Google Scholar
11. Steinberg J, Honti F, Meader S, Webber C. Haploinsufficiency predictions without study bias. Nucleic Acids Research. 2015;43: e101. pmid:26001969
- View Article
- PubMed/NCBI
- Google Scholar
12. Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596: 590–596. pmid:34293799
- View Article
- PubMed/NCBI
- Google Scholar
13. Agirre J, Atanasova M, Bagdonas H, Ballard CB, Baslé A, Beilsten-Edmands J, et al. The CCP4 suite: integrative software for macromolecular crystallography. Acta Cryst D. 2023;79: 449–461. pmid:37259835
- View Article
- PubMed/NCBI
- Google Scholar
14. Delgado J, Radusky LG, Cianferoni D, Serrano L. FoldX 5.0: Working with RNA, small molecules and a new graphical interface. Bioinformatics. 2019;35: 4168–4169. pmid:30874800
- View Article
- PubMed/NCBI
- Google Scholar
15. Zhang J, Kurgan L. SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences. Bioinformatics. 2019;35: i343–i353. pmid:31510679
- View Article
- PubMed/NCBI
- Google Scholar
16. Zhao B, Katuwawala A, Oldfield CJ, Dunker AK, Faraggi E, Gsponer J, et al. DescribePROT: database of amino acid-level protein structure and function predictions. Nucleic Acids Research. 2021;49: D298–D308. pmid:33119734
- View Article
- PubMed/NCBI
- Google Scholar
17. Gane, A., Bileschi, M.L., Dohan, D., Speretta, E., Héliou, A., Meng-Papaxanthos, L., et al. ProtNLM: Model-based Natural Language Protein Annotation. 2022. https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf
18. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 2011;7: 539. pmid:21988835
- View Article
- PubMed/NCBI
- Google Scholar
19. Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv; 2021. p. 2021.07.09.450648.
- View Article
- Google Scholar
20. Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599: 91–95. pmid:34707284
- View Article
- PubMed/NCBI
- Google Scholar
21. Li C, Zhi D, Wang K, Liu X. MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning. Genome Medicine. 2022;14: 115. pmid:36209109
- View Article
- PubMed/NCBI
- Google Scholar
22. Wu Y, Liu H, Li R, Sun S, Weile J, Roth FP. Improved pathogenicity prediction for rare human missense variants. The American Journal of Human Genetics. 2021;108: 1891–1906. pmid:34551312
- View Article
- PubMed/NCBI
- Google Scholar
23. Livesey BJ, Marsh JA. Updated benchmarking of variant effect predictors using deep mutational scanning. Molecular Systems Biology. 2023;19: e11474. pmid:37310135
- View Article
- PubMed/NCBI
- Google Scholar
24. R Core Team. R core team. R: A language and environment for statistical computing R Foundation for Statistical Computing, Vienna, Austria URL http://www.R-project.org. 2024.
25. Jung K, Lee J, Gupta V, Cho G. Comparison of Bootstrap Confidence Interval Methods for GSCA Using a Monte Carlo Simulation. Frontiers in Psychology. 2019;10. Available: https://www.frontiersin.org/articles/10.3389/fpsyg.2019.02215 pmid:31681066
- View Article
- PubMed/NCBI
- Google Scholar
26. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2018;46: D1062–D1067. pmid:29165669
- View Article
- PubMed/NCBI
- Google Scholar
27. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10: giab008. pmid:33590861
- View Article
- PubMed/NCBI
- Google Scholar
28. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581: 434–443. pmid:32461654
- View Article
- PubMed/NCBI
- Google Scholar
29. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17: 122. pmid:27268795
- View Article
- PubMed/NCBI
- Google Scholar
30. Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM^®), an Online catalog of human genes and genetic disorders. Nucleic Acids Research. 2015;43: D789–D798. pmid:25428349
- View Article
- PubMed/NCBI
- Google Scholar
31. Mi H, Ebert D, Muruganujan A, Mills C, Albou L-P, Mushayamaha T, et al. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Research. 2021;49: D394–D403. pmid:33290554
- View Article
- PubMed/NCBI
- Google Scholar
32. Miller S, Janin J, Lesk AM, Chothia C. Interior and surface of monomeric proteins. Journal of Molecular Biology. 1987;196: 641–656. pmid:3681970
- View Article
- PubMed/NCBI
- Google Scholar
33. Gerasimavicius L, Liu X, Marsh JA. Identification of pathogenic missense mutations using protein stability predictors. Sci Rep. 2020;10: 15387. pmid:32958805
- View Article
- PubMed/NCBI
- Google Scholar
34. Zeng T, Spence JP, Mostafavi H, Pritchard JK. Bayesian estimation of gene constraint from an evolutionary model with gene features. bioRxiv; 2023. p. 2023.05.19.541520. pmid:37398424
- View Article
- PubMed/NCBI
- Google Scholar
35. Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368: m441. pmid:32188600
- View Article
- PubMed/NCBI
- Google Scholar
36. Lorena AC, de Carvalho ACPLF, Gama JMP. A review on the combination of binary classifiers in multiclass problems. Artif Intell Rev. 2009;30: 19.
- View Article
- Google Scholar
37. Gower JC. A General Coefficient of Similarity and Some of Its Properties. Biometrics. 1971;27: 857–871.
- View Article
- Google Scholar
38. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
39. Kuhn M. Futility Analysis in the Cross-Validation of Machine Learning Models. arXiv; 2014.
40. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 4768–4777.
41. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596: 583–589. pmid:34265844
- View Article
- PubMed/NCBI
- Google Scholar
42. Porta-Pardo E, Ruiz-Serra V, Valentini S, Valencia A. The structural coverage of the human proteome before and after AlphaFold. PLOS Computational Biology. 2022;18: e1009818. pmid:35073311
- View Article
- PubMed/NCBI
- Google Scholar
43. Mi H, Muruganujan A, Huang X, Ebert D, Mills C, Guo X, et al. Protocol Update for Large-scale genome and gene function analysis with PANTHER Classification System (v.14.0). Nat Protoc. 2019;14: 703–721. pmid:30804569
- View Article
- PubMed/NCBI
- Google Scholar
44. Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun. 2023;14: 2222. pmid:37076519
- View Article
- PubMed/NCBI
- Google Scholar
45. Bertolini M, Fenzl K, Kats I, Wruck F, Tippmann F, Schmitt J, et al. Interactions between nascent proteins translated by adjacent ribosomes drive homomer assembly. Science. 2021;371: 57–64. pmid:33384371
- View Article
- PubMed/NCBI
- Google Scholar
46. Boone PM, Chan YM, Hunter JV, Pottkotter LE, Davino NA, Yang Y, et al. Increased bone turnover, osteoporosis, progressive tibial bowing, fractures, and scoliosis in a patient with a final-exon SATB2 frameshift mutation. American Journal of Medical Genetics Part A. 2016;170: 3028–3032. pmid:27409069
- View Article
- PubMed/NCBI
- Google Scholar
47. Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, et al. ClinGen—The Clinical Genome Resource. New England Journal of Medicine. 2015;372: 2235–2242. pmid:26014595
- View Article
- PubMed/NCBI
- Google Scholar
48. McLean WHI, Moore CBT. Keratin disorders: from gene to therapy. Human Molecular Genetics. 2011;20: R189–R197. pmid:21890491
- View Article
- PubMed/NCBI
- Google Scholar
49. Izumi K, Nakato R, Zhang Z, Edmondson AC, Noon S, Dulik MC, et al. Germline gain-of-function mutations in AFF4 cause a developmental syndrome functionally linking the super elongation complex and cohesin. Nat Genet. 2015;47: 338–344. pmid:25730767
- View Article
- PubMed/NCBI
- Google Scholar
50. Seidman JG, Seidman C. Transcription factor haploinsufficiency: when half a loaf is not enough. J Clin Invest. 2002;109: 451–455. pmid:11854316
- View Article
- PubMed/NCBI
- Google Scholar
51. Kim J, Jiang H, Park S, Eltorai AEM, Stewart FR, Yoon H, et al. Haploinsufficiency of Human APOE Reduces Amyloid Deposition in a Mouse Model of Amyloid-β Amyloidosis. J Neurosci. 2011;31: 18007–18012. pmid:22159114
- View Article
- PubMed/NCBI
- Google Scholar
52. Veitia RA. Exploring the molecular etiology of dominant-negative mutations. Plant Cell. 2007;19: 3843–3851. pmid:18083908
- View Article
- PubMed/NCBI
- Google Scholar
53. Bergendahl LT, Marsh JA. Functional determinants of protein assembly into homomeric complexes. Sci Rep. 2017;7: 4932. pmid:28694495
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Backwell L, Marsh JA. Diverse Molecular Mechanisms Underlying Pathogenic Protein Mutations: Beyond the Loss-of-Function Paradigm. Annual review of genomics and human genetics. 2022;23. pmid:35395171
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Veitia RA. Exploring the etiology of haploinsufficiency. BioEssays. 2002;24: 175–184. pmid:11835282
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Fowler DM, Adams DJ, Gloyn AL, Hahn WC, Marks DS, Muffley LA, et al. An Atlas of Variant Effects to understand the genome at nucleotide resolution. Genome Biology. 2023;24: 147. pmid:37394429
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Fomin A, Gärtner A, Cyganek L, Tiburcy M, Tuleta I, Wellers L, et al. Truncated titin proteins and titin haploinsufficiency are targets for functional recovery in human cardiomyopathy due to TTN mutations. Science Translational Medicine. 2021;13: eabd3079. pmid:34731013
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Gerasimavicius L, Livesey BJ, Marsh JA. Loss-of-function, gain-of-function and dominant-negative mutations have profoundly different effects on protein structure. Nature Communications 2022 13:1. 2022;13: 1–15. pmid:35794153
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Badonyi M, Marsh JA. Buffering of genetic dominance by allele-specific protein complex assembly. Science Advances. 2023;9: eadf9845. pmid:37256959
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Bayrak CS, Stein D, Jain A, Chaudhary K, Nadkarni GN, Vleck TTV, et al. Identification of discriminative gene-level and protein-level features associated with pathogenic gain-of-function and loss-of-function variants. The American Journal of Human Genetics. 2021;108: 2301–2318. pmid:34762822
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Flanagan SE, Patch A-M, Ellard S. Using SIFT and PolyPhen to Predict Loss-of-Function and Gain-of-Function Mutations. Genetic Testing and Molecular Biomarkers. 2010;14: 533–537. pmid:20642364
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Huang N, Lee I, Marcotte EM, Hurles ME. Characterising and Predicting Haploinsufficiency in the Human Genome. Schierup MH, editor. PLoS Genet. 2010;6: e1001154. pmid:20976243
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Shihab HA, Rogers MF, Campbell C, Gaunt TR. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics. 2017;33: 1751–1757. pmid:28137713
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Steinberg J, Honti F, Meader S, Webber C. Haploinsufficiency predictions without study bias. Nucleic Acids Research. 2015;43: e101. pmid:26001969
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596: 590–596. pmid:34293799
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Agirre J, Atanasova M, Bagdonas H, Ballard CB, Baslé A, Beilsten-Edmands J, et al. The CCP4 suite: integrative software for macromolecular crystallography. Acta Cryst D. 2023;79: 449–461. pmid:37259835
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Delgado J, Radusky LG, Cianferoni D, Serrano L. FoldX 5.0: Working with RNA, small molecules and a new graphical interface. Bioinformatics. 2019;35: 4168–4169. pmid:30874800
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Zhang J, Kurgan L. SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences. Bioinformatics. 2019;35: i343–i353. pmid:31510679
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Zhao B, Katuwawala A, Oldfield CJ, Dunker AK, Faraggi E, Gsponer J, et al. DescribePROT: database of amino acid-level protein structure and function predictions. Nucleic Acids Research. 2021;49: D298–D308. pmid:33119734
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Gane, A., Bileschi, M.L., Dohan, D., Speretta, E., Héliou, A., Meng-Papaxanthos, L., et al. ProtNLM: Model-based Natural Language Protein Annotation. 2022. https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf

[ref18] 18. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 2011;7: 539. pmid:21988835
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref19] 19. Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv; 2021. p. 2021.07.09.450648.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref20] 20. Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599: 91–95. pmid:34707284
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref21] 21. Li C, Zhi D, Wang K, Liu X. MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning. Genome Medicine. 2022;14: 115. pmid:36209109
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref22] 22. Wu Y, Liu H, Li R, Sun S, Weile J, Roth FP. Improved pathogenicity prediction for rare human missense variants. The American Journal of Human Genetics. 2021;108: 1891–1906. pmid:34551312
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref23] 23. Livesey BJ, Marsh JA. Updated benchmarking of variant effect predictors using deep mutational scanning. Molecular Systems Biology. 2023;19: e11474. pmid:37310135
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref24] 24. R Core Team. R core team. R: A language and environment for statistical computing R Foundation for Statistical Computing, Vienna, Austria URL http://www.R-project.org. 2024.

[ref25] 25. Jung K, Lee J, Gupta V, Cho G. Comparison of Bootstrap Confidence Interval Methods for GSCA Using a Monte Carlo Simulation. Frontiers in Psychology. 2019;10. Available: https://www.frontiersin.org/articles/10.3389/fpsyg.2019.02215 pmid:31681066
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref26] 26. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2018;46: D1062–D1067. pmid:29165669
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref27] 27. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10: giab008. pmid:33590861
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref28] 28. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581: 434–443. pmid:32461654
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref29] 29. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17: 122. pmid:27268795
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref30] 30. Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM^®), an Online catalog of human genes and genetic disorders. Nucleic Acids Research. 2015;43: D789–D798. pmid:25428349
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref31] 31. Mi H, Ebert D, Muruganujan A, Mills C, Albou L-P, Mushayamaha T, et al. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Research. 2021;49: D394–D403. pmid:33290554
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref32] 32. Miller S, Janin J, Lesk AM, Chothia C. Interior and surface of monomeric proteins. Journal of Molecular Biology. 1987;196: 641–656. pmid:3681970
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref33] 33. Gerasimavicius L, Liu X, Marsh JA. Identification of pathogenic missense mutations using protein stability predictors. Sci Rep. 2020;10: 15387. pmid:32958805
View Article
PubMed/NCBI
Google Scholar

[123] View Article

[124] PubMed/NCBI

[125] Google Scholar

[ref34] 34. Zeng T, Spence JP, Mostafavi H, Pritchard JK. Bayesian estimation of gene constraint from an evolutionary model with gene features. bioRxiv; 2023. p. 2023.05.19.541520. pmid:37398424
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

[ref35] 35. Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368: m441. pmid:32188600
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref36] 36. Lorena AC, de Carvalho ACPLF, Gama JMP. A review on the combination of binary classifiers in multiclass problems. Artif Intell Rev. 2009;30: 19.
View Article
Google Scholar

[135] View Article

[136] Google Scholar

[ref37] 37. Gower JC. A General Coefficient of Similarity and Some of Its Properties. Biometrics. 1971;27: 857–871.
View Article
Google Scholar

[138] View Article

[139] Google Scholar

[ref38] 38. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

[ref39] 39. Kuhn M. Futility Analysis in the Cross-Validation of Machine Learning Models. arXiv; 2014.

[ref40] 40. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 4768–4777.

[ref41] 41. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596: 583–589. pmid:34265844
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref42] 42. Porta-Pardo E, Ruiz-Serra V, Valentini S, Valencia A. The structural coverage of the human proteome before and after AlphaFold. PLOS Computational Biology. 2022;18: e1009818. pmid:35073311
View Article
PubMed/NCBI
Google Scholar

[148] View Article

[149] PubMed/NCBI

[150] Google Scholar

[ref43] 43. Mi H, Muruganujan A, Huang X, Ebert D, Mills C, Guo X, et al. Protocol Update for Large-scale genome and gene function analysis with PANTHER Classification System (v.14.0). Nat Protoc. 2019;14: 703–721. pmid:30804569
View Article
PubMed/NCBI
Google Scholar

[152] View Article

[153] PubMed/NCBI

[154] Google Scholar

[ref44] 44. Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun. 2023;14: 2222. pmid:37076519
View Article
PubMed/NCBI
Google Scholar

[156] View Article

[157] PubMed/NCBI

[158] Google Scholar

[ref45] 45. Bertolini M, Fenzl K, Kats I, Wruck F, Tippmann F, Schmitt J, et al. Interactions between nascent proteins translated by adjacent ribosomes drive homomer assembly. Science. 2021;371: 57–64. pmid:33384371
View Article
PubMed/NCBI
Google Scholar

[160] View Article

[161] PubMed/NCBI

[162] Google Scholar

[ref46] 46. Boone PM, Chan YM, Hunter JV, Pottkotter LE, Davino NA, Yang Y, et al. Increased bone turnover, osteoporosis, progressive tibial bowing, fractures, and scoliosis in a patient with a final-exon SATB2 frameshift mutation. American Journal of Medical Genetics Part A. 2016;170: 3028–3032. pmid:27409069
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

[ref47] 47. Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, et al. ClinGen—The Clinical Genome Resource. New England Journal of Medicine. 2015;372: 2235–2242. pmid:26014595
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref48] 48. McLean WHI, Moore CBT. Keratin disorders: from gene to therapy. Human Molecular Genetics. 2011;20: R189–R197. pmid:21890491
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

[ref49] 49. Izumi K, Nakato R, Zhang Z, Edmondson AC, Noon S, Dulik MC, et al. Germline gain-of-function mutations in AFF4 cause a developmental syndrome functionally linking the super elongation complex and cohesin. Nat Genet. 2015;47: 338–344. pmid:25730767
View Article
PubMed/NCBI
Google Scholar

[176] View Article

[177] PubMed/NCBI

[178] Google Scholar

[ref50] 50. Seidman JG, Seidman C. Transcription factor haploinsufficiency: when half a loaf is not enough. J Clin Invest. 2002;109: 451–455. pmid:11854316
View Article
PubMed/NCBI
Google Scholar

[180] View Article

[181] PubMed/NCBI

[182] Google Scholar

[ref51] 51. Kim J, Jiang H, Park S, Eltorai AEM, Stewart FR, Yoon H, et al. Haploinsufficiency of Human APOE Reduces Amyloid Deposition in a Mouse Model of Amyloid-β Amyloidosis. J Neurosci. 2011;31: 18007–18012. pmid:22159114
View Article
PubMed/NCBI
Google Scholar

[184] View Article

[185] PubMed/NCBI

[186] Google Scholar

[ref52] 52. Veitia RA. Exploring the molecular etiology of dominant-negative mutations. Plant Cell. 2007;19: 3843–3851. pmid:18083908
View Article
PubMed/NCBI
Google Scholar

[188] View Article

[189] PubMed/NCBI

[190] Google Scholar

[ref53] 53. Bergendahl LT, Marsh JA. Functional determinants of protein assembly into homomeric complexes. Sci Rep. 2017;7: 4932. pmid:28694495
View Article
PubMed/NCBI
Google Scholar

[192] View Article

[193] PubMed/NCBI

[194] Google Scholar

Figures

Abstract

Introduction

Methods

Workflow

Software and databases

ClinVar and gnomAD mapping to UniProt reference proteome

Data acquisition

Feature selection and engineering

Model design and data preprocessing

Initial model screen

Model building

Global and local model feature importance assessment

Model evaluation

Statistical overrepresentation test for molecular function

Results and discussion

Global and local feature importance evaluation

Proteome-scale molecular mechanism prediction

Biologically and clinically relevant validation of the models

The functional landscape of predicted DN and GOF proteins

Conclusion

Supporting information

S1 Fig. Comparison of pLOF with haploinsufficiency predictors.

S2 Fig. Supplemental analysis to Fig 6: Validation of the models through model-independent metrics on an unbiased analysis set, using rank-based classification without t50 cutoff.

S1 Table. Molecular disease mechanisms collected from literature reports.

S2 Table. Model features, final hyperparameters, and common threshold metrics.

S3 Table. Predicted molecular mechanisms for the human proteome.

Acknowledgments

References

S2 Fig. Supplemental analysis to Fig 6: Validation of the models through model-independent metrics on an unbiased analysis set, using rank-based classification without t₅₀ cutoff.