Fast protein structure comparison through effective representation learning with contrastive graph neural networks

doi:10.1371/journal.pcbi.1009986

Fig 1.

(A) The complete graph is constructed based on protein tertiary structure, where the adjacency matrix is derived from the intra-residue distance matrix. (B) Raw node features consist of distance-based feature x_v and angle-based feature x_a.

More »

Expand

Fig 2.

Architecture of GNN-based encoder.

The BiLSTM module extracts low-level node features from the primary structures of proteins. The graph convolution module extracts high-level node features based on the adjacency matrices . The readout module transforms node features to the descriptors by a global max pooling layer. The residual blocks (ResBlock) used in the graph convolutional module consists of two graph convolutional (GC) layers.

More »

Expand

Fig 3.

The contrastive learning framework for protein structure representation learning.

At each iteration, raw features X_q and X_k are extracted from the query protein structure and the key protein structure, respectively. Then, descriptors y_q and y_k are encoded by GNN encoder and , respectively. The value of loss function guides the optimization of the parameters θ_q of while the parameters θ_k are updated based on θ_q. At the end of the current iteration, y_k will enqueue as a negative sample for the next iteration.

More »

Expand

Table 1.

Ablation studies of length-scaling cosine distance, the dynamic training data partition strategy and the GNN-based encoder on SCOPe v2.07 and ind_PDB.

More »

Expand

Table 2.

Ranking performance of GraSR and other baseline methods.

More »

Expand

Fig 4.

Correlation between distance derived from the representations learned by GraSR/DeepFold and TM-score on (A) SCOPe v2.07 and (B) ind_PDB.

The Pearson correlation coefficient (PCC) is calculated for quantitative assessment.

More »

Expand

Fig 5.

The F1-score of each class in SCOPe of GraSR and other baseline methods.

a: All alpha proteins; b: All beta proteins; c: Alpha and beta proteins (a/b); d: Alpha and beta proteins (a+b); e: Multi-domain proteins (alpha and beta); f: Membrane and cell surface proteins and peptides; g: Small proteins.

More »

Expand

Table 3.

Multi-class classification performance of GraSR and other methods.

More »

Expand

Table 4.

Time cost of GraSR and other methods for protein structure retrieval from ind_PDB.

More »

Expand

Fig 6.

Visualization of descriptors learned from GraSR and other methods by t-SNE.

a: All alpha proteins; b: All beta proteins; c: Alpha and beta proteins (a/b); d: Alpha and beta proteins (a+b); e: Multi-domain proteins (alpha and beta); f: Membrane and cell surface proteins and peptides; g: Small proteins.

More »

Expand

Fig 7.

Protein structure superposition derived from the residue-level descriptors of GraSR.

(A) SCOPe-sid: d1v59a2 (red) and d1h6va2 (blue) (B) SCOPe-sid: d5dqpa_ (red) and d1ezwa_ (blue).

More »

Expand