EUP: Enhanced cross-species prediction of ubiquitination sites via a conditional variational autoencoder network based on ESM2

doi:10.1371/journal.pcbi.1013268

Fig 1.

The overview framework of EUP.

The diagram delineates the operational process of the EUP website, including: 1. Dara preparation,2.Protein K embedding,3. Model train, 4.Model interpretation, 5.Website server. EUP encompassing the acquisition of protein data from various species. extracting feature embedding of lysine (K) sites based on pretrained ESM2, followed by cVAE for dimensionality reduction of these features to latent feature representation. Then, constructing downstream models based on latent feature representation to predict ubiquitination sites, with creation the interfaceable and interpretable prediction result output.

More »

Expand

Table 1.

Evaluation of Predictive Performance Across Models.

More »

Expand

Fig 2.

The accuracy outcomes of ubiquitination site predictions for multiple species, encompassing.

(a) Radar chart depicting the precision of predictions, (b) AUROC curve graph for assessing, (c) a confusion matrix for predictive analysis predictive performance, and (d) AUPRC curve graph for a comprehensive evaluation of prediction accuracy.

More »

Expand

Table 2.

Inference latency analysis for different models.

More »

Expand

Fig 3.

Model performance under different window sequences.

(A) Performance measured by MCC (Matthews Correlation Coefficient).(B) Performance measured by PRAUC (Precision-Recall Area Under the Curve).The “window size” on the x-axis indicates the length of the local sequence centered on the lysine (K) residue, constructed by expanding amino acids forward and backward from the lysine site center, where and . These sequence windows were encoded using ESM2 to extract context-specific features for model input. The value “All_seq” corresponds to the full-length protein sequence.

More »

Expand

Fig 4.

The accuracy outcomes of ubiquitination site predictions for multiple species, encompassing.

(a) A bar chart illustrating the accuracy (ACC) of the best-performing models for ubiquitination site prediction across ten species, highlighting model variations and species-specific performance differences.(b) A receiver operating characteristic (ROC) curve for the optimal models across ten species, with area under the curve (AUC) values reflecting classification performance in ubiquitination site prediction.(c) A radar chart displaying a comprehensive performance comparison of four models for Homo sapiens, showcasing balanced performance across all models.(d) Confusion matrices of the four models applied to Homo sapiens, visualizing true positive and negative rates, as well as misclassification rates in a heatmap format.

More »

Expand

Fig 5.

Integrated Gradients (IG) analysis for ubiquitination site predictions in Homo sapiens, encompassing.

(a) Feature dependence plots for the top four most important features (K_feature_1542, K_feature_696, K_feature_461, and K_feature_2502), illustrating the relationship between feature values and feature attribution (IG) values..(b) A dense plot displaying the density distribution of the top four features’ values, highlighting variations in feature value distributions across K_feature_1542, K_feature_696, K_feature_461, and K_feature_2502.(c) A swarm plot of log-transformed IG values for the top four features, accompanied by Mann-Whitney U tests for statistical significance. The median log-transformed IG values and p-values indicate distinct patterns in feature importance, with K_feature_1542 showing the highest impact.

More »

Expand