Fig 1.
Pipeline of the interactor algorithm.
(A) Extensive database of PDB files, where each chain is split into an individual PDB file and converted to MOL2 format. (B) The following metrics are calculated: atom-atom distances (d) and van der Waals radii (r); bond angles (α); geometric centers (g); interaction types and atom types. Molecular representation with carbon atoms in gray, oxygen in red, and nitrogen in blue. (C) The following features are calculated: interatomic interaction features (Table 1); side chain physicochemical properties (S1 Table); mono-, di-, and tripeptide composition (S2 Table). (D) The extracted features are exported in tabular format, which then are utilized for feature selection, dimensionality reduction, clustering, visualization, and machine learning.
Table 1.
Protein structural features based on interatomic interactions and physicochemical properties.
Table 2.
Overview of protein features extracted.
Fig 2.
Principal Cobioremponent Analysis (PCA) of protein families and molecular functions.
A) Clustering of the most prominent protein families. B) Clustering of three representative groups based on molecular functions. C) Three-dimensional PCA representation of protein families, capturing the variance and complexity within different functional categories. D) Three-dimensional PCA visualization of molecular functions.
Fig 3.
Mutual Information (MI) scores across protein families and molecular functions.
A) Frequency distribution of MI scores across protein families B) Density distribution of MI scores across protein families, focusing on the top 12 ranked features. C) Frequency distribution of MI scores for molecular functions. D) Density distribution of MI scores for molecular functions, focusing on the top 12 ranked features.
Fig 4.
Clustering and visualization of protein families.
A) Hierarchical clustering of protein families, depicting relationships and groupings based on structural and functional similarities. B) t-SNE plot depicting inter-family relationships within a reduced-dimensional space. C) Optimized projection plot highlighting the most significant features for each group, illustrating key attributes that differentiate protein families.
Fig 5.
Machine learning benchmark across different feature sets.
A) F1 scores were computed for several classification models (shown as colored bars) across different feature sets. Feature sets were ranked by their impact on model performance, with higher F1 scores indicating superior predictive accuracy. B) Performance of the best model (Histogram Gradient Boosting) across different feature sets, showing mean validation accuracy after 5 runs.
Fig 6.
Feature importance (SHAP) in protein family classification.
A-H) The plot ranks features by their impact on model predictions, illustrating contribution distributions. Each dot represents a protein, with colors indicating feature values: red for high, blue for low. The horizontal spread of dots reflects the density of examples at given feature values. Negative SHAP values (left) indicate a reduced likelihood of classification, while positive SHAP values (right) suggest an increased likelihood. Colored tiles beside each feature indicate feature types: orange for 3D structural features (based on interatomic interactions and structural properties), green for CPAASC, and brown for sequence compositional (n-peptide) features. SHAP values were computed using family-specific explicands and baselines, ensuring consistent cross-group comparisons. I) Classification model performance measured by the F1 Score for each protein family. The bars represent the mean F1 Score obtained from 10 independent runs, with error bars corresponding to the standard deviation.