Skip to main content
Advertisement
  • Loading metrics

PON-Del predictor for sequence retaining protein deletions

?

This is an uncorrected proof.

Abstract

Protein deletions are frequent among both disease-causing and tolerated variants. Several mechanisms at the DNA, RNA and protein levels can lead to deletions. Many deletions are misclassified in the literature and databases, especially when the mRNA is degraded by the cellular quality-control mechanism. We developed a novel predictor for sequence retaining protein deletions, i.e., variants that do not alter the sequence downstream of the deletion site. We collected an extensive dataset of verified protein deletions, each described by a comprehensive set of context, content, position, and gene-based features. We evaluated both statistical and deep learning algorithms and selected a gradient boosting–based approach to develop the PON-Del predictor for short, 1–10 amino acid, sequence-retaining deletions. Variants are typically classified into two categories: either pathogenic or benign. However, there is always a third class of variants: variants of uncertain significance (VUSs), which have been ignored by all previous methods. PON-Del is the first deletion interpretation method that includes VUSs. It provides two outputs, binary and three-state prediction with VUSs. The performance of PON-Del was superior to that of previous methods. The tool is freely available at https://structure.bmc.lu.se/pon_del/.

Author summary

Protein deletions are frequent among both disease-causing and tolerated variants, and are caused by several mechanisms at the DNA, RNA and protein levels. The reliable prediction of the effects of deletions is challenging. We developed a predictor for sequence retaining protein deletions, variants that do not alter the sequence beyond the deletion site. We collected an extensive dataset of verified protein deletions, and a comprehensive set of features to describe them. We evaluated seven algorithms and selected a gradient boosting–based approach to develop the PON-Del predictor for short, 1–10 amino acid, sequence-retaining deletions. Variants have typically been classified as pathogenic or benign. This practice misses the third category: variants of uncertain significance (VUSs). PON-Del is the first deletion interpretation method that includes VUSs. The performance of PON-Del was superior to that of previous methods. The tool is freely available at https://structure.bmc.lu.se/pon_del/.

Introduction

Protein deletions are frequent variations. Among the close to 4.0 million variants in ClinVar [1], 5.0% are deletions (December 2025). Of these, 60% are (likely) pathogenic, and 19% are (likely) benign. Many protein deletions are tolerated, and natural length isoforms are common. Statistical analysis of natural and disease-related protein deletions revealed clear differences between the groups [2], for example, in the functions of deletion-containing proteins and in the sequence context of deletions. Further, pathogenic and benign variants differed, e.g., in size distribution, location relative to duplicated genes, domains, and protein termini, as well as the sequence context of deletions.

In the Variation Ontology [3], deletions are classified as sequence retaining or amphigoric. The former does not change the amino acid sequence after the deletion. The latter are due to mRNA frameshift variations and alter the C-terminal sequences of the encoded proteins. Many mRNA frameshift variations are misclassified in databases and literature at the protein level, when no protein is produced due to mRNA quality control mechanisms, such as nonsense-mediated decay (NMD) [4].

Depending on the structural location, deletions can be tolerated or harmful. N- and C-terminal truncations shorten the polypeptide chain from one end. The third category of deletions is internal deletion. Many mechanisms cause the deletions at the DNA, RNA, and protein levels. For a classification, see [5]. DNA deletions affect the coded protein unless the mRNA is degraded by NMD [6,7] or related mechanisms. The effect of NMD has been neglected in many articles and databases, leading to unreliable classification of deletions [4].

Deletions can also arise during transcription and translation. Alternative initiation and termination, along with alternative splicing, are common RNA changes that lead to deletions [8]. Mutually exclusive exons are a special case of alternative splicing [9]. Information for natural protein length variations can be obtained, e.g., from UniProtKB [10].

The effects of most RNA-level frameshift variations are obvious, as no protein is produced. For these, no prediction methods are needed. Sequence retaining deletions (in-frame variants at the RNA level) can be associated with disease or tolerated. Prediction methods have been developed for the disease relevance of deletions. The tools for sequence retaining deletions include CAPICE [11], DDIG-in [12], FATHMM-indel [13], INDELpred [14], KD4i [15], MutPred-Indel [16], PROVEAN [17], SHINE [18], and SIFT Indel [19]. VEST-Indel [20] is for both amphigoric and sequence-retaining deletions. The performance of the tools has been benchmarked with variants from gnomAD, ClinVar, and the DDD study [21]. The methods displayed a wide range of performances. The Matthews correlation coefficient (MCC) for the best methods was 0.68 [21], far from perfect.

These tools differ in both feature sets and algorithms. Various tree-based models have been the most popular, including decision trees (SIFT Indel), random forests (VEST-Indel), and gradient boosting methods (CAPICE, INDELpred). Other machine learning approaches have been support vector machines (DDIG-in), neural networks (MutPred-Indel), hidden Markov models (FATHMM-indel), transfer learning (SHINE), and symbolic learning based on inductive logic programming (KD4i). PROVEAN relies on evolutionary conservation scores derived from sequence alignments. In addition to the algorithm and datasets, the methods differ regarding the features used for training.

We collected an extensive verified dataset of protein deletions and used it to train a predictor for short, 1–10 amino acid, sequence-retaining deletions. We evaluated both statistical and deep learning (DL) algorithms. Gradient boosting was chosen to train PON-Del. The performance was superior to that of previous methods. The method can be used to predict deletions in any protein by using a freely available web service. Variant effects are typically classified into two categories: either pathogenic or benign. Recently, we showed that all substitution variants of uncertain significance (VUSs) cannot be classified as pathogenic or benign, even with additional data and functional studies, due to natural biological heterogeneity [22]. Thus, VUSs must also be included in other types of variation interpretation tasks, including the classification of deletions. Therefore, PON-Del provides two outputs, binary and three-state prediction with VUSs.

Design and implementation

The workflow of PON-Del was as follows (Fig 1). Briefly, data and features were collected from multiple sources. Deletions were obtained from ClinVar [1], LOVD [23], dbSNP [24], and UniProtKB [10]. Data cleaning was performed to remove duplicates. The retained deletions were mapped to the Matched Annotation from NCBI and EMBL-EBI (MANE)-selected transcripts and consist solely of short deletions. The data were then split into training and test sets to ensure a balanced representation of pathogenic and benign deletions. Feature selection was performed on the training data by removing low-frequency, zero-variance, and highly correlated features. Then, seven statistical and deep learning frameworks were evaluated using cross-validation to identify the best-performing model. In the end, the best framework was optimised for PON-Del through feature set refinement and hyperparameter tuning. To identify cases for which a binary classifier should abstain from prediction due to low confidence, we repeated the train–test split four additional times and retrained the model, resulting in 25 independently trained models (five folds across five splits). Prediction uncertainty was quantified using p-values derived from bootstrap resampling of the probability distributions and used to classify VUSs.

thumbnail
Fig 1. Overview of the PON-Del development pipeline includes (1) data and feature collection.

We obtained 611 features in four categories. (2) The data were split into training and testing sets. (3) Feature selection involved removing non-informative, zero-variance, and highly correlated features. (4) Seven frameworks were tested to identify the best-performing one. (5) The selected model was optimised through feature refinement and hyperparameter tuning to develop the final PON-Del predictor.

https://doi.org/10.1371/journal.pcbi.1014020.g001

Data collection and cleaning

We collected protein deletion variants from multiple sources to ensure comprehensive coverage. From the dbSNP [24], we obtained 265 variants by filtering for “inframe_deletion” variants with clinical significance (pathogenic, likely pathogenic, benign, or likely benign), limiting to deletion variation class and inframe deletion function class. From the LOVD database [23], we initially found 7,054 cases by selecting all transcript variants with clinical classification (benign or pathogenic), applying filters to exclude frameshift, termination, insertion, and ambiguous variants, and limiting to germline origin with valid genomic DNA change annotation. ClinVar [1] contained 2,256 deletion variants when searched for inframe_deletion variants with germline classification, clinical significance (pathogenic, likely pathogenic, benign, or likely benign), and deletion variation type. Additional benign deletions were collected from UniProtKB [10]. They represent natural protein isoforms and are classified as benign phenotypically. All human protein isoforms were collected from UniProtKB and matched with protein sequences corresponding to the MANE version 1.3 transcripts [25]. Pairwise sequence alignments were obtained using the needle algorithm for dynamic programming from the EMBOSS package [26]. After comparison to the MANE-based sequences, we identified 7,581 unique deletions from the UniProtKB.

The variants were mapped to MANE-select transcripts. The genomic positions of the variants were defined using TransVar [27]. The genomic deletion length was ≤ 30 nucleotides to avoid exon skip events, as only a few longer variants were identified. The protein deletion length was thus ≤10 amino acids. Protein variants from dbSNP and LOVD not present in MANE-select transcripts were annotated using VEP [28]. We then integrated ClinVar and UniProtKB protein deletions, mapping them to MANE-select transcripts, and removed duplicates both within and across data sources. Finally, we added UniProt identifiers to ensure proper protein identification. After processing, the final dataset comprised 4,243 deletions from MANE-select transcripts: 1,555 from ClinVar, 178 from dbSNP, 1,368 from LOVD, and 1,142 from UniProtKB. The collected deletion datasets are freely available in VariBench (Nair and [29]) https://structure.bmc.lu.se/VariBench/data/variationtype/deletions/Dataset2/data_pondel.csv.

Feature collection

We obtained an extensive set of 611 features in four categories: context-based (N = 5), content-based (N = 561), position-based (N = 32), and gene/protein-based features (N = 13).

Context-based features described the sequence information within and surrounding deletion sites. We analysed five residues upstream and downstream of each deletion, calculating position-specific bit scores from sequence alignments of all the benign and pathogenic deletions to capture conservation patterns and amino acid preferences. Sequence logos obtained with the SeqLogo program [30] were used to analyse these patterns and to define bit scores for each amino acid in each position. Bit scores for segments before and after each deletion were summed and used as features. In addition to bit scores, we extracted three sequence segments (deletion, upstream, and downstream), each padded or trimmed to 5 residues to ensure fixed-length input. These features were used specifically in the DL models.

Content-based features described the average properties of the deleted regions. Conservation was assessed using a position-specific scoring matrix (PSSM), and we used the information content per position (PSSM 3) [31] to assess the degree of conservation at each position. The pathogenicity score for each deletion was calculated as the average of the PON-P3 pathogenicity scores of all possible substitutions within the deleted region [31].

As previously described [31], the structures of MANE-selected proteins were predicted and used to obtain the solvent-accessible surface areas (SASA) of the original residues using FreeSASA [32]. Averages for deleted residues were calculated and used as a feature. The average amino acid physicochemical properties in the deletions were calculated with AAindex (553 features) [33] or ProtDCal propensities (16 features) [34], as previously described [31].

Position-based features described the localisation of deletions within the protein. The genomic start and end positions were annotated with start and end positions in protein using TransVar [27].

We calculated deletion length and relative position, along with the closest distance to the protein terminus. We annotated whether deletions overlapped with functional regions, including domains, palindromes, repeat regions, transmembrane (TM) regions, intrinsically disordered regions (IDRs), the last exon within a gene (with a 50 bp window), and 16 functional regions from SwissProt. Domains were identified with InterPro [35], repeats with T-REKS [36], and IDRs based on data in DisProt [37]. The Human Transmembrane Proteome database [38] provides information for TM. Palindromic regions were determined by identifying substrings within protein sequences using a comparison between each substring and its reverse with the R package “seqinr”. The last genomic exon regions were obtained from the MANE 1.4 GFF file. The secondary structural assignments were determined from protein structures or models obtained with AlphaFold [39; 40] with STRIDE [41] and included eight classes: α-helix, β-strand, B-beta, b-beta, turn/coil, G-helix, π-helix, and low-confidence regions from AlphaFold models [40].

The gene/protein-based features describe gene-level characteristics. Classifications included housekeeping, haploinsufficient, and redundant genes, pseudogenes, or duplicated genes, see PON-P3 [31]. Six protein-protein interaction (PPI) metrics (degree, closeness, betweenness, harmonic centrality, hub score, and power centrality) were computed with igraph [42] from the data in the STRING database [43] to indicate the network impact of proteins. The age of each gene was determined using ProteinHistorian [44]. For further details on the features, see [31].

After feature collection, we obtained a dataset of 3,912 unique deletions with complete feature information. The dataset was then split into a training set (3,100 deletions: 1,757 pathogenic and 1,343 benign) and a test set (812 deletions: 500 pathogenic and 312 benign) for cross-validation and final evaluation, see Table 1. Four additional train–test splits were generated to assess model robustness and to compute bootstrap p-values for VUS classification.

thumbnail
Table 1. Number of variations and proteins in the training and test datasets.

https://doi.org/10.1371/journal.pcbi.1014020.t001

Feature filtering

To refine the feature set, we selected features using the training data. For binary features (such as position-based features and gene/protein classifications), we removed columns with a minority class of 5 or fewer samples, as these features were considered unstable.

For numerical features, we first removed those with zero variance, as they had no discriminative power. Next, we computed the Spearman correlation matrix for the remaining features. For any pair of features with an absolute Spearman correlation greater than 0.8, one feature was removed to reduce redundancy. The three sequence-based inputs (deletion, upstream, and downstream segments) were not filtered. They were used in the DL models.

After these steps, the number of features was reduced from 611 to 198.

Framework selection

We compared altogether seven algorithms, four statistical learning methods, and three DL approaches, using 5-fold cross-validation.

The statistical learning methods included the gradient boosting algorithm LightGBM [45], logistic regression (LR), random forests (RF) [46], and support vector machine (SVM) [47]. LR was applied with balanced class weights and MinMax-scaled features. RF was constructed with 100 estimators to ensure robust ensemble predictions. SVM was implemented with a linear kernel and probability estimation enabled.

For the deep learning approaches, we implemented three architectures: a multi-layer perceptron (MLP) [48], convolutional neural network (CNN) [49], and gated recurrent unit (GRU) [50]. The MLP used a three-layer architecture with layer normalisation and LeakyReLU activation. The CNN combined sequence processing with feature analysis through a hybrid architecture. It processed three sequences (deletion, upstream, and downstream) in parallel, concatenated the extracted features, scaled them by a factor of 0.2, and fed them into a fully connected branch with layer normalisation and LeakyReLU activation functions. The GRU employed a bidirectional design to process sequence data, with both sequence and non-sequence features processed in parallel branches. Similar to the CNN, the GRU processed three sequences in parallel, concatenated and scaled the features by a factor of 0.2, and fed them into a fully connected branch.

All the DL models were trained using the Adam optimiser with a learning rate of 0.001 and binary cross-entropy loss. For models not using tree-based methods, features were scaled using MinMax normalisation. Early stopping was implemented with a patience of 30 epochs, based on validation loss. The final models were trained on the complete training dataset using the optimal hyperparameters and feature sets identified during the optimisation process.

Model optimisation

To optimise the performance of the selected framework, we applied model-specific optimisation strategies. As the best framework was a tree model, recursive feature extraction (RFE) [51] was used to assess feature importance by evaluating feature sets ranging from the top 10 to the top 190 features, with an interval of 10 features between each set. This approach allowed us to identify the optimal number of features. Then, we employed an extensive hyperparameter optimisation framework Optuna 4.2.1 [52], which automates the search for the best parameters. The optimisation process involved defining a search space for key hyperparameters over 100 trials. The best parameters identified through this process were then used to train the final PON-Del model.

Predictions and classification schemes

PON-Del provides two types of predictions. A traditional binary predictor outputs scores between 0 and 1, with values above 0.5 indicating pathogenicity and those below 0.5 indicating benignity. We defined VUSs based on the consistency of PON-Del predictions across multiple models. For each variant, we collected 25 predicted probabilities from independent PON-Del models and used bootstrap resampling (1,000 iterations) to estimate the sampling distribution of the mean predicted probability. We then performed a two-sided hypothesis test against a null mean of 0.5 (no evidence for pathogenicity versus benignity) and computed a p-value from the bootstrap distribution.

Variants whose mean predicted probability was significantly different from 0.5 (p < 0.05) were assigned to pathogenic or benign according to whether the predicted probability was > 0.5 or ≤ 0.5, respectively. Variants for which we could not reject the null hypothesis (p ≥ 0.05) were considered VUS, reflecting insufficient statistical evidence to confidently assign them to either the pathogenic or benign class.

Performance evaluation

A systematic performance assessment was performed according to the published recommendations [53,29]. The measures included positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, accuracy, the MCC, the overall performance measure (OPM) [54], and area under the curve (AUC) as follows

TP and TN are correctly predicted pathogenic and neutral cases, respectively, and FN and FP are the numbers of incorrect predictions for pathogenic and neutral cases, respectively.

To address the slight class imbalance between pathogenic and benign variants, normalised metrics were calculated by adjusting the number of pathogenic variants to match the number of benign ones. A classification threshold of 0.5 was used for threshold-dependent metrics to distinguish between pathogenic and benign deletions. In contrast, AUC provides a threshold-independent evaluation of model performance.

Results

Both benign and pathogenic, verified sequence retaining deletions were collected from ClinVar [1], dbSNP [24], LOVD [23], and UniProtKB [10]. The clinical significance (pathogenic, benign) of the variants was retrieved from ClinVar, dbSNP, or LOVD. All these variants are sequence retaining and thereby produced. Further benign variants included natural protein isoforms from UniProtKB.

We collected a total of 611 features for the deletions. After data cleaning and feature collection, there were in total 3,912 unique deletions with a full set of features (Table 1). The deletions appear in several different proteins; the total number of proteins was 1904. The distribution of all deletion lengths is shown in the S1A Fig and in Fig 1C for training and test sets. Far majority of both pathogenic and benign deletions were of one residue long, the distribution is shown in Fig 1B. Given that most deletions were short and few were longer than 10 amino acids, we decided to train a predictor for deletions of 1–10 amino acids. The developed tool might work even with longer deletions, but since there were not enough cases for training and testing, the range of deletions was limited.

The variants were divided into training (N = 3100) and test (N = 812) data sets, both of which included a wide variety of proteins (Table 1, v1). The two data sets do not share any proteins. In the training set, benign and pathogenic variants had mean deletion lengths of 4.11 (SD = 2.87) and 2.71 (SD = 2.26), respectively. In the test set, the means were 4.51 (SD = 2.86) for benign and 2.51 (SD = 2.14) for pathogenic variants. The length differences were very small.

The flowchart of the method development is depicted in Fig 1. We collected an extensive set of features, tested seven algorithms, and hyperparameter-tuned the best-performing one.

Feature selection and choice of algorithm

We started with an extensive set of features. Initially, there were 611 features. The features were grouped into four categories. Context features describe the sequence environment of deletions. Content-based features are properties averaged over the deleted region and include, e.g., sequence conservation and physicochemical propensities. Position features capture the relative location within the sequence and proximity to structural or functional regions. Gene/protein features include PPI parameters, gene age, and localisation to functional regions.

First, we investigated which features were informative. The variances for three features were close to 0, so they were excluded. Two binary features with fewer than 5 annotated instances were removed. Next, we defined the Spearman correlation for all feature pairs. This led to the removal of 408 features with high (>0.8) correlations. With the stepwise process, we reduced the number to 198.

We used the remaining 198 features to train seven methods, i.e., CNN, GRU, LightGBM, LR, MLP, RF, and SVM. The results with 5-fold CV are shown in Table 2. Logistic regression, one of the simplest algorithms, was used to define the baseline performance. LightGBM achieved the best overall performance across most metrics, with an AUC of 0.91, accuracy of 0.83, MCC of 0.66, and OPM of 0.58 (Table 2). RF was the next best-performing method, followed by the DL models.

thumbnail
Table 2. Comparison of the performance of different algorithms in 5-fold CV. The best-performing method is shown in bold.

https://doi.org/10.1371/journal.pcbi.1014020.t002

The set of 198 remaining features was used for final feature selection by training LightGBM predictors with feature counts ranging from 10 to 190, in increments of 10. For results, see S2 Fig. To avoid the so-called curse of dimensionality —excessive numbers of features— we chose the smallest number of features that provided optimal performance. RFE was used to select the most important features. The performance was as good with the 20 features as with versions with larger numbers of features (S2A Fig). Therefore, we chose 20 features for training the final predictor, called PON-Del. Deletions are not predicted in the first position since the removal of the start codon would prevent protein synthesis.

Next, we optimised the predictor by tuning hyperparameters over 100 trials (S2B Fig). The tested parameters were, in order of significance, learning rate, boosting type, number of leaves, reg alpha, bagging fraction, reg lambda, minimum child weight, bagging frequency, minimum split gain, feature fraction, and minimum number of child samples. Results in Table 3 show that the effect of optimisation was marginal. The performance was increased by less than 1%. The unnormalised results pertain to the ratio of pathogenic to benign variants in the dataset (58% vs 42%), whereas the normalised results pertain to balanced data.

thumbnail
Table 3. Comparison of the performance of different algorithms on the blind test set. The best-performing method is shown in bold.

https://doi.org/10.1371/journal.pcbi.1014020.t003

Table 3 shows the results for all the tested methods on the blind test data. The best performance measures were scattered among the algorithms. PON-Del achieved the highest overall performance, with an AUC of 0.91, an accuracy of 0.82, an MCC of 0.65, and an OPM of 0.56. It also showed the high sensitivity (0.84) and strong precision (PPV: 0.81, NPV: 0.84) with balanced specificity (0.80). These results suggest that while different modelling frameworks exhibit varying performance, the choice of features has a greater impact on model effectiveness than the specific algorithm used. On the blind test set, GRU had the second-best performance. To evaluate the robustness of PON-Del predictions, we assessed performance under five independent train–test partitions. As shown in S3 Fig, PON-Del showed consistently high and stable performance over the splits and with very small differences between the splits.

We assessed how the p-value threshold used to define VUSs influenced the predictive performance of PON-Del. The evaluated thresholds ranged from 1.0 (equivalent to no uncertainty) down to 0.001. The performance increased together with decreasing threshold (S4 Fig). We did not determine an optimal p-value threshold, as stricter thresholds improve performance at the cost of increasing the number of unclassified (missing) predictions. As a practical solution, we used 0.05 as the threshold.

Interpretability analysis

To gain insight into the reasoning of the predictor, we investigated the selected features. Since most ML methods are largely black boxes, their interpretability is a concern. Of the tested algorithms, the decision process is intuitively understandable only for logistic regression. We used two analyses to assess the importance of the selected features and two analyses to investigate if the datasets were biased.

The Shapley plot in Fig 2A shows the importance of the features to pathogenicity (positive values) and benign (negative values) prediction. Colour indicates the range of feature values; blue indicates low values, and red indicates high values. In the binary features, a missing property is indicated by a zero (blue), and the presence of the property is indicated by a red value. For features with a range of values, the colour scale indicates the increasing feature value.

thumbnail
Fig 2. Shapley plot and distribution of the scores for the 20 selected features, organised in descending order of importance.

A) The features are colored based on their value, ranging from blue to red. The SHAP value indicates the impact for both positive (pathogenic) and negative (benign) predictions. B) Distributions of the values for the selected features among pathogenic and benign data in the training set.

https://doi.org/10.1371/journal.pcbi.1014020.g002

The selected 20 features were arranged in descending importance in the Shapley plot [55]. The most important feature is haploinsufficiency, followed by structural and functional features, including low-confidence secondary structural assignments, sequence conservation, accessibility, and location within a domain (Fig 2A).

Almost half of the features, nine, are for AAindex parameters of protein physical and chemical propensities. They included LEVM760103, side chain angle theta (AAR) [56]; RACS820103, average relative fractional occurrence in AL(i) [57]; GEOR030105, linker propensity [58]; SUEM840102, helix-coil stability constant [58]; ARGP820102, signal sequence helical potential [59]; NAKH920103, membrane protein amino acid composition [60]; CHOP780215, the frequency of the 4th residue in turn [61]; NAKH900104, membrane protein amino acid composition [60]; and BIGC670101, amino acid hydropathy [62]

Other types of features include protein length, closeness and hub score in the PPI networks, location within turns or coils, repeats, and in palindromes. Closeness and hubscores are overall measures of the topology of the PPI network. Low-confidence regions and turns and coils were the only protein secondary structural classes selected. Repeats and palindromes are short sequence stretches which either appear several times or read the same way from both ends.

As another analysis to highlight the decision-making process in PON-Del, we investigated the distribution of the values in the 20 features (Fig 2B). Haploinsufficiency and low-confidence AlphaFold predictions in the deleted region were the most important features. There are significant differences in the distribution patterns for pathogenic and benign deletions. Distinct patterns are also evident, e.g., in location to repeats, palindromes, turns or coils, domain localisation, sequence conservation, accessibility, and PPI scores. Further, physicochemical AAindex features show different distributions; however, the differences in these are smaller.

To investigate the impact of deletion origin on predictor performance, we evaluated the classifier separately on subsets restricted to each source individually, or combined with UniProtKB, when benign cases were insufficient in numbers (ClinVar + UniProtKB, dbSNP + UniProtKB, LOVD + UniProtKB). Performance varied across databases (S5 Fig). The LOVD-only dataset differed from the others and showed the lowest performance for AUC, PPV, accuracy and MCC. This dataset contains only a very small number of benign deletions, which apparently are not fully representative. In summary, the combination of the datasets is the best option and was used to train and test PON-Del.

Next, we studied the performance for deletions in different types of proteins and genes, as well as in genes duplicated during human evolution (Fig 3). The functional features investigated included housekeeping, essential, and haploinsufficiency-related proteins. In all these categories, correct predictions were more common than misclassifications. In pseudogenes, the performance was almost equal. In proteins for duplicated genes, correct predictions were somewhat more common. The proportions in Fig 3 indicate the occurrence of the features in the dataset.

thumbnail
Fig 3. Comparison of correct and incorrect deletion predictions in protein functional categories, pseudogenes, redundant proteins, and those originating from gene duplications.

The percentages are for the proportions of correctly predicted and misclassified variants for each feature. None of the categories classifies all the variants.

https://doi.org/10.1371/journal.pcbi.1014020.g003

Combined, the analysis provides a clear understanding of the features and their contributions to PON-Del predictions. Deletions affect sequences and structures in many ways, as evident from the wide spectrum of features. Three out of the four feature categories are represented among the selected features.

Comparison to other tools

Several methods have been deployed to predict the pathogenicity of sequence retaining deletions. Some methods were excluded due to unavailability or outdated scripts that could not be installed. SIFT-Indel was excluded because it predicted just a tiny fraction of the test cases. We could not run DDIG-IN or KD4i because no code was available. CAPICE did not provide all the results. PROVEAN is no longer supported. SHINE is limited only to single-residue deletions. We used their pre-calculated scores for comparison.

We compared the performance of PON-Del to CADD, FATHMM-indel, INDELpred, MutPred-Indel, SHINE, and VEST-Indel (Table 4). These methods cover a wide range of different algorithms and approaches. We used the prediction score to measure AUC, except for VEST-Indel. It provides only the p-value, which we used to calculate AUC. INDELpred and MutPred-Indel used as a threshold 0.5, and VEST-Indel used 0.05, based on its p-value. FATHMM-indel provided binary pathogenicity labels directly. For CADD, the pathogenicity threshold was set to 20, as recommended by the developers.

thumbnail
Table 4. Performance comparison of deletion predictors on the blind test set. The best-performing method is shown in bold.

https://doi.org/10.1371/journal.pcbi.1014020.t004

Table 4 shows that PON-Del 2-state and 3-state (with VUSs) versions outperformed the other deletion pathogenicity predictors on the blind test set, achieving the highest overall metrics. PON-Del 3-state predictor obtained the highest scores for AUC (0.92), accuracy (0.83), MCC (0.66), and OPM (0.57). It also demonstrated balanced performance with high PPV (0.8), NPV (0.87), sensitivity (0.88), and specificity (0.78). The 2-state predictor had the best PPV of 0.81 and specificity of 0.8. While FATHMM-indel had the highest sensitivity (0.97) and NPV (0.83), its poor PPV (0.26) and specificity (0.24) reflect a strong positive bias. The other tools showed lower and less balanced performance, while VEST-Indel performed relatively well (AUC: 0.88, accuracy: 0.83). These results highlight that the superior performance of PON-Del is primarily driven by its carefully engineered features.

The other methods have been trained on ClinVar data; therefore, their performance measures are likely inflated by circularity. Thus, their true performance is even lower than the benchmark showed.

Table 4 contains results for the two versions of PON-Del, the two-state pathogenic-benign predictor and the three-state predictor including VUSs. The two-state prediction is an oversimplification, since in reality, there are also always VUSs that cannot be classified along the binary pathogenic-benign axis [22]. The results in Table 4 show that the three-state predictor achieved almost identical performance to the binary tool. Because three-state prediction is more difficult and VUSs can overlap with pathogenic or benign variants, the result indicates that accommodating uncertainty did not compromise predictive power.

We could not test the performance with verified VUSs due to the lack of such data. If VUSs were included in the performance assessments, it would be apparent that the three-state predictor would outperform other tools. All other tools ignore VUSs; thus, in real-life situations, their performance is lower than shown in Table 4.

To evaluate the robustness of the methods, we assessed performance under five independent train–test partitions in method comparison (S5 Fig). The results are consistent over the partitions, indicating that the partitions are representative. The blind test set can thus be considered representative. PON-Del was consistently the best-performing tool, or among the best methods. The other methods showed more differences for the individual measures and less balanced predictions.

The results in S6 Fig indicate the consistency of different train-test splits. The performances are comparable to those in Table 4, and the differences in the splits are largely consistent for the tools, indicating that the datasets do not introduce major bias. The order for performances is the same as in Table 4. The results between the splits are the most consistent, i.e., most similar, for PON-Del.

Test case

As an example of the method application, we show the distribution of the predicted single amino acid deletions in the Bruton tyrosine kinase, BTK. Among more than 500 human protein kinases, BTK contains the largest number of different disease-causing variants. There are several verified small deletions known and distributed in BTKbase, the database for BTK variants [63]. The positions of the disease-causing deletions and PON-Del predicted deletions are indicated in Fig 4. All the known short BTK-related deletions were predicted to be pathogenic. The positions of the harmful variations are shown in the protein three-dimensional structure in Fig 4B. They are distributed along the protein chain.

thumbnail
Fig 4. A) Prediction of all the one-amino acid deletions in BTK.

Most variations are deleterious, except in the polyproline segment in the TH region. This region is disordered and can adopt numerous different conformations. Yellow boxes indicate the positions of the XLA-causing short deletions listed in BTKbase. B) Distribution of the pathogenic and benign variants in the BTK structure, obtained with AlphaFold2 [40], file AF-Q06187-F1-model_v4(1). The pleckstrin homology domain is located at the top left, below it are the Src homology 3 and 2 domains, and the kinase domain is positioned to the right. Known XLA-causing deletions are shown in yellow, predicted benign single amino acid deletions are in blue and predicted pathogenic deletions in red. α-Helices are shown as helices and β-strands as arrows. Benign variants are indicated in blue, pathogenic ones in red, and VUSs in grey.

https://doi.org/10.1371/journal.pcbi.1014020.g004

BTK is sensitive to variations. It contains several domains with different functions. Prediction of pathogenicity of amino acid substitutions with a reliable PON-P3 [31] indicated that about 70% are likely disease-causing. The longest region of benign deletions is located in the polyproline segment in the Tec homology (TH) region [64], which is likely intrinsically disordered [63]. This is indicated in the structure as a loosely packed string at the top of the protein structure (Fig 4B). Short deletions may not be harmful in this malleable region, which apparently binds to several partners, including the adjacent SH3 domain [65,66]. The other benign deletions mainly occur in connecting loops or towards the ends of secondary structural elements. Experimental studies and structural information are in line with the deletion predictions. In addition, the PON-P3 predicted disease-causing substitutions are mainly pathogenic in the regions predicted not to tolerate the short deletions.

Availability and future directions

PON-Del is freely available as a web service at http://structure.bmc.lu.se/pon_del. Unlike most other deletion predictors, which accept only genomic coordinates, the variations can be submitted at the genomic, transcript, or protein level. The genome build used in PON-Del is GRCh38/hg38. Nucleotide variants are converted to protein alterations with TransVar. Only nucleotide deletions that lead to sequence retaining amino acid deletions are predicted. The allowed size range is 1–10 amino acids. Exon skipping variants are not allowed; the nucleotide deletion cannot be more than 30 bases. The dataset for short exon skipping variants was too small to support the development of a reliable predictor. It is possible to submit up to 1000 variants at a time.

PON-Del is entirely MANE-based. Therefore, the submitted variants must be mapped to MANE reference sequences. This is because many features are specific to a variation position and its context. If a submitted variant is in a protein for which all features cannot be obtained, a note is provided that prediction is not possible. This happens, e.g., when proteins are unique to humans and lack evolutionary details, or when no protein structure is available.

Users can submit multiple variations simultaneously across multiple genes or proteins. It is even possible to download a Fig that contains all the one-amino acid deletions in a protein. These data can be searched by gene name, RefSeq transcript or protein ID, or Ensembl gene, transcript, or protein ID. The precalculated data are available for 19354 unique MANE-compliant sequences.

PON-Del successfully predicts the disease relevance of short sequence retaining deletions. Once more data are available, it will be possible to expand to longer deletions. It will also facilitate further benchmarking. In addition, it will be interesting to investigate how well the current version of PON-Del can extrapolate from the short deletions to longer ones.

When verified VUSs are defined in deletions, they can be used both for training and testing of further versions. The future developments will be highly dependent on additional data and annotations.

Supporting information

S1 Fig. Distribution of the sizes of deletions obtained from the four databases.

Only variants 10 amino acids or shorter were used for method development due to the low number of longer deletions. Benign variants are in blue; pathogenic variants are in red. A) The deletion length distributions, B) the distribution of the deletion numbers per protein, and C) the length distribution of deletions in training and test datasets.

https://doi.org/10.1371/journal.pcbi.1014020.s001

(TIFF)

S2 Fig. Hyperparameter optimisation of PON-Del.

(A) Performance metrics across different numbers of top-ranked features. Red lines indicate the median, and boxplots represent the variability across cross-validation folds. (B) Hyperparameter tuning using Optuna. Left: optimisation history showing the progression of AUC values over 100 trials. Right: relative importance of hyperparameters, indicating that the number of leaves, boosting type, and learning rate contributed most to model performance.

https://doi.org/10.1371/journal.pcbi.1014020.s002

(TIFF)

S3 Fig. The performance of PON-Del and other frameworks on five train-test splits of the dataset.

https://doi.org/10.1371/journal.pcbi.1014020.s003

(TIFF)

S4 Fig. The effect of p-value threshold in the prediction of VUSs.

https://doi.org/10.1371/journal.pcbi.1014020.s004

(TIFF)

S5 Fig. The performance of PON-Del and other methods on the different datasets collected.

https://doi.org/10.1371/journal.pcbi.1014020.s005

(TIFF)

S6 Fig. The performance of PON-Del and other methods on five train-test splits of the dataset.

https://doi.org/10.1371/journal.pcbi.1014020.s006

(TIFF)

References

  1. 1. Landrum MJ, Chitipiralla S, Brown GR, Chen C, Gu B, Hart J, et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 2020;48(D1):D835–44. pmid:31777943
  2. 2. Zhang H, Luan X, Vihinen M. Proteome-wide analysis of human deletions. Proteins. 2025.
  3. 3. Vihinen M. Variation Ontology for annotation of variation effects and mechanisms. Genome Res. 2014;24(2):356–64. pmid:24162187
  4. 4. Vihinen M. Systematic errors in annotations of truncations, loss-of-function and synonymous variants. Front Genet. 2023.
  5. 5. Vihinen M. Functional effects of protein variants. Biochimie. 2021;180:104–20. pmid:33164889
  6. 6. Kurosaki T, Popp MW, Maquat LE. Quality and quantity control of gene expression by nonsense-mediated mRNA decay. Nat Rev Mol Cell Biol. 2019;20(7):406–20. pmid:30992545
  7. 7. Lindeboom RGH, Supek F, Lehner B. The rules and impact of nonsense-mediated mRNA decay in human cancers. Nat Genet. 2016;48(10):1112–8. pmid:27618451
  8. 8. Vihinen M. Systematics for types and effects of RNA variations. RNA Biol. 2021;18(4):481–98. pmid:32951567
  9. 9. Lam SD, Babu MM, Lees J, Orengo CA. Biological impact of mutually exclusive exon switching. PLoS Comput Biol. 2021;17(3):e1008708. pmid:33651795
  10. 10. UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):D523–31. pmid:36408920
  11. 11. Li S, van der Velde KJ, de Ridder D, van Dijk ADJ, Soudis D, Zwerwer LR, et al. CAPICE: a computational method for Consequence-Agnostic Pathogenicity Interpretation of Clinical Exome variations. Genome Med. 2020;12(1):75. pmid:32831124
  12. 12. Folkman L, Yang Y, Li Z, Stantic B, Sattar A, Mort M, et al. DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. Bioinformatics. 2015;31(10):1599–606. pmid:25573915
  13. 13. Ferlaino M, Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, et al. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. BMC Bioinform. 2017;18(1):442. pmid:28985712
  14. 14. Wei Y, Zhang T, Wang B, Jiang X, Ling F, Fang M, et al. INDELpred: improving the prediction and interpretation of indel pathogenicity within the clinical genome. HGG Adv. 2024;5(4):100325. pmid:38993112
  15. 15. Bermejo-Das-Neves C, Nguyen H-N, Poch O, Thompson JD. A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i). BMC Bioinform. 2014;15:111. pmid:24742296
  16. 16. Pagel KA, Antaki D, Lian A, Mort M, Cooper DN, Sebat J, et al. Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome. PLoS Comput Biol. 2019;15(6):e1007112. pmid:31199787
  17. 17. Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PLoS One. 2012;7(10):e46688. pmid:23056405
  18. 18. Fan X, Pan H, Tian A, Chung WK, Shen Y. SHINE: protein language model-based pathogenicity prediction for short inframe insertion and deletion variants. Brief Bioinform. 2023;24(1):bbac584. pmid:36575831
  19. 19. Hu J, Ng PC. SIFT Indel: predictions for the functional effects of amino acid insertions/deletions in proteins. PLoS One. 2013;8(10):e77940. pmid:24194902
  20. 20. Douville C, Masica DL, Stenson PD, Cooper DN, Gygax DM, Kim R, et al. Assessing the pathogenicity of insertion and deletion variants with the variant effect scoring tool (VEST-Indel). Hum Mutat. 2016;37(1):28–35. pmid:26442818
  21. 21. Cannon S, Williams M, Gunning AC, Wright CF. Evaluation of in silico pathogenicity prediction tools for the classification of small in-frame indels. BMC Med Genomics. 2023;16(1):36. pmid:36855133
  22. 22. Zhang H, Kabir M, Ahmed S, Vihinen M. There will always be variants of uncertain significance. Analysis of VUSs. NAR Genom Bioinform. 2024;6(4):lqae154. pmid:39633727
  23. 23. Fokkema IFAC, Kroon M, López Hernández JA, Asscheman D, Lugtenburg I, Hoogenboom J, et al. The LOVD3 platform: efficient genome-wide sharing of genetic variants. Eur J Hum Genet. 2021;29(12):1796–803. pmid:34521998
  24. 24. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29(1):308–11. pmid:11125122
  25. 25. Morales J, Pujar S, Loveland JE, Astashyn A, Bennett R, Berry A, et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature. 2022;604(7905):310–5. pmid:35388217
  26. 26. Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16(6):276–7. pmid:10827456
  27. 27. Zhou W, Chen T, Chong Z, Rohrdanz MA, Melott JM, Wakefield C, et al. TransVar: a multilevel variant annotator for precision genomics. Nat Methods. 2015;12(11):1002–3. pmid:26513549
  28. 28. McLaren W, et al. The ensembl variant effect predictor. Genome Biol. 2016;17(1):122.
  29. 29. Vihinen M. Guidelines for reporting and using prediction tools for genetic variation analysis. Hum Mutat. 2013;34(2):275–82. pmid:23169447
  30. 30. Wagih O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics. 2017;33(22):3645–7. pmid:29036507
  31. 31. Kabir M, et al. PON-P3: accurate prediction of pathogenicity of amino acid substitutions. Int J Mol Sci. 2025;26(5).
  32. 32. Mitternacht S. FreeSASA: an open source C library for solvent accessible surface area calculations. F1000Res. 2016;5:189. pmid:26973785
  33. 33. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. pmid:10592278
  34. 34. Ruiz-Blanco YB, Paz W, Green J, Marrero-Ponce Y. ProtDCal: a program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins. BMC Bioinform. 2015;16:162. pmid:25982853
  35. 35. Paysan-Lafosse T, et al. InterPro in 2022. Nucleic Acids Res. 2023;51(D1):D418–27.
  36. 36. Jorda J, Kajava AV. T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics. 2009;25(20):2632–8. pmid:19671691
  37. 37. Aspromonte MC, Nugnes MV, Quaglia F, Bouharoua A, DisProt Consortium, Tosatto SCE, et al. DisProt in 2024: improving function annotation of intrinsically disordered proteins. Nucleic Acids Res. 2024;52(D1):D434–41. pmid:37904585
  38. 38. Dobson L, Reményi I, Tusnády GE. The human transmembrane proteome. Biol Direct. 2015;10:31. pmid:26018427
  39. 39. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. pmid:38718835
  40. 40. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
  41. 41. Heinig M, Frishman D. STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res. 2004;32(Web Server issue):W500-2. pmid:15215436
  42. 42. Csardi G, Nepusz T. The igraph software package for complex network research. InterJ Complex Syst. 2006:1695.
  43. 43. Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, et al. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021;49(D1):D605–12. pmid:33237311
  44. 44. Capra JA, Williams AG, Pollard KS. ProteinHistorian: tools for the comparative analysis of eukaryote protein origin. PLoS Comput Biol. 2012;8(6):e1002567. pmid:22761559
  45. 45. Ke G, et al. LightGBM: a highly efficient gradient boosting decision tree. La Jolla (CA): Neural Information Processing Systems; 2017.
  46. 46. Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32.
  47. 47. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
  48. 48. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. In: Rumelhart DE, McClelland JLM, PDP Research Group, editors. Parallel distributed processing, volume 1: explorations in the microstructure of cognition: foundations. MIT Press; 1986.
  49. 49. Homma T, Atlas L, Marks R II. An artificial neural network for spatio-temporal bipolar patterns: application to phoneme classification. Adv Neural Inf Process Syst. 1987;1:31–40.
  50. 50. Cho K, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014.
  51. 51. Darst BF, Malecki KC, Engelman CD. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet. 2018;19(Suppl 1):65. pmid:30255764
  52. 52. Akiba T, et al. Optuna: a next-generation hyperparameter optimization framework. KDD ‘19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019. p. 2623–31.
  53. 53. Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics. 2012;13 Suppl 4(Suppl 4):S2. pmid:22759650
  54. 54. Niroula A, Urolagin S, Vihinen M. PON-P2: prediction method for fast and reliable identification of harmful variants. PLoS One. 2015;10(2):e0117380. pmid:25647319
  55. 55. Lundberg SM, Lee S-U. A unified approach to interpreting model predictions. 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA; 2017. p. 4768–77.
  56. 56. Levitt M. A simplified representation of protein conformations for rapid simulation of protein folding. J Mol Biol. 1976;104(1):59–107. pmid:957439
  57. 57. Zhao W, Tao Y, Xiong J, Liu L, Wang Z, Shao C, et al. GoFCards: an integrated database and analytic platform for gain of function variants in humans. Nucleic Acids Res. 2025;53(D1):D976–88. pmid:39578693
  58. 58. George RA, Heringa J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 2002;15(11):871–9. pmid:12538906
  59. 59. Argos P, Rao JK, Hargrave PA. Structural prediction of membrane-bound proteins. Eur J Biochem. 1982;128(2–3):565–75. pmid:7151796
  60. 60. Nakashima H, Nishikawa K. The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Lett. 1992;303(2–3):141–6.
  61. 61. Chou PY, Fasman GD. Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol. 1978;47:45–148. pmid:364941
  62. 62. Bigelow CC. On the average hydrophobicity of proteins and the relation between it and protein structure. J Theor Biol. 1967;16(2):187–211. pmid:6048539
  63. 63. Schaafsma GCP, Väliaho J, Wang Q, Berglöf A, Zain R, Smith CIE, et al. BTKbase, Bruton tyrosine kinase variant database in X-linked agammaglobulinemia: looking back and ahead. Hum Mutat. 2023;2023:5797541. pmid:40225173
  64. 64. Vihinen M, Nilsson L, Smith CI. Tec homology (TH) adjacent to the PH domain. FEBS Lett. 1994;350(2–3):263–5. pmid:8070576
  65. 65. Hansson H, Okoh MP, Smith CI, Vihinen M, Härd T. Intermolecular interactions between the SH3 domain and the proline-rich TH region of Bruton’s tyrosine kinase. FEBS Lett. 2001;489(1):67–70. pmid:11231015
  66. 66. Okoh MP, Vihinen M. Interaction between Btk TH and SH3 domain. Biopolymers. 2002;63(5):325–34. pmid:11877742