OneProt: Towards multi-modal protein foundation models via latent space alignment of sequence, structure, binding sites and text encoders

doi:10.1371/journal.pcbi.1013679

Fig 1.

Overview of OneProt’s alignment of protein sequences with other modalities for comprehensive cross-modal integration.

Training is performed using pairs comprising a sequence and another modality, leading to the emergent alignment between all other modalities, as indicated by the dashed lines.

More »

Expand

Fig 2.

Overview of the OneProt model.

The model aligns multiple modalities, including primary protein sequence, 3D protein structure, binding pockets, and text annotations. Each modality is processed by its respective encoder, generating embeddings aligned in a shared latent space, facilitating cross-modal learning and integration.

More »

Expand

Table 1.

Overview of OneProt’s different encoders.

More »

Expand

Table 2.

OneProt data overview.

More »

Expand

Fig 3.

Alignment performance across modality combinations paired (left column) and not paired (emergent, right column) during training for OneProt-5 (top row) and OneProt-4 (bottom row).

The axes of the polygons correspond to the modality pairs, and the vertices correspond to R@1 (inner polygon), R@10 (middle polygon), and R@100 (outer polygon), which represent the fraction of queries for which the correct (ground-truth) match appears among the top 1, top 10, or top 100 retrieved embeddings, respectively, with the best possible value being equal to 1. MR is the Median Rank of the corresponding embedding in the other modality, best possible being equal 1.

More »

Expand

Table 3.

Performance comparison of OneProt, SaProt-LoRa, SaProt, ProTrek, ESM, and OpenFold on five diverse downstream biological protein tasks: ThermoStability (regression), HumanPPI, Metal Ion Binding, DeepLoc Binary (binary classification) and DeepLoc Subcellular (multiclass classification), using Spearman correlation for ThermoStability, accuracy (ACC) and Area Under the Reciever Operating Curve (AUC) for the remaining tasks.

More »

Expand

Table 4.

Performance comparison of OneProt, SaProt-LoRa, SaProt, ProTrek, ESM, and OpenFold on four multi-label function prediction tasks: Enzyme Commission numbers (EC), Gene Ontology (GO) terms corresponding to Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), using maximum F1-score metric (Fmax) defined by Eq (6).

More »

Expand

Fig 4.

Model performance comparison based on Area Under Precision Recall curve (AUPR) scores for TopEnzyme.

Each boxplot shows the AUPR distribution for a method (TopEC, CLEAN, ESM-2, Protrek-35M, ProTrek-650M, OneProt).

More »

Expand

Fig 5.

Cosine Similarity distributions for models ESM-2, ProTrek-35M and -650M, OneProt-4 and -5.

The plot shows the similarity of a given protein to three groups: the 50 most evolutionarily similar proteins, the 50 most evolutionarily divergent sequences, and 1000 unrelated sequences. While all models partially capture evolutionary relationships, OneProt distinctly separates the three classes, demonstrating its ability to generate meaningful sequence representations.

More »

Expand

Table 5.

AUC Scores for the ProSPECCTs datasets. ST and SG stand for Structure Token and Structure Graph modalities, respectively.

More »

Expand

Table 6.

Downstream results of the selected ablations on the datasets from [28].

Abbreviations as in Tables 3–5.

More »

Expand

Fig 6.

Normalized performance drops .

The heatmaps visualize the Eq (7) applied to the OneProt ablations, where x is the model on the horizontal axis, while y is the model on the vertical axis. Colours in the shades of blue correspond to negative values (model on the horizontal axis outperforms model on the vertical axis), and colours in the shades of red correspond to positive ones.

More »

Expand