Fig 1.
The overall workflow of SPIN-CGNN, which employed a contact-based graph with improved edge and node updates.
Table 1.
Contact-based versus K-nearest neighbors in the absence of edge information for all methods (Model 1 for SPIN-CGNN) according to perplexity and median sequence recovery for two test datasets (CATH4.2-StructNR193 and PDB-StructNR156).
Table 2.
Impact of CGNN edge updates (symmetric information and second-order edge (SOE) information) according to perplexity and median sequence recovery for two test datasets (CATH4.2-StructNR193 and PDB-StructNR156).
Table 3.
Impact of the use of selective kernels in node update and edge update according to perplexity and median sequence recovery for two test datasets (CATH4.2-StructNR193 and PDB-StructNR156).
Table 4.
Method comparison on the whole CATH4.2 test set according to perplexity and median native sequence recovery.
Table 5.
Comparison of sequences designed by SPIN-CGNN, RosettaFixBB, OSCAR-design, ProteinMPNN, and PiFold on CATH4.2-StructNR193 and PDB-StructNR156 test sets according to perplexity, median sequence recovery, median relative deviation of the frequency of amino-acid residue types, the median relative BLOSUM score, the fraction of low complexity regions, conservation of hydrophobic and hydrophilic sequence positions, the mean steric clash count of refolded structures, and the difference between refolded and target structures in term of RMSD, GDT-TS and TM-score.
Fig 2.
Deviation of the frequency of an amino acid in designed sequences from that in the native sequences by RosettaFixBB, OSCAR-design, ProteinMPNN, PiFold and SPIN-CGNN.
(A) CATH4.2-StructNR193 and (B) PDB-StructNR156 test set.
Fig 3.
Confusion matrix of SPIN-CGNN in comparison to the reference matrix BLOSUM62 on the CATH4.2-StructNR193 test set.
Positive values (colored) indicate substitutions between amino acids. ρ denotes the Pearson correlation coefficient between confusion matrix of SPIN-CGNN and BLOSUM62.
Fig 4.
Deviations of the structures of designed sequences predicted by AlphaFold2 from their respective target structures on three separate test sets from left to right panels (CATH4.2-StructNR193, PDB-StructNR156, Hallucination129, and Diffusion100 test sets) evaluated according to RMSD (Å), GDT-TS, and TM-score (from top to bottom panels). The statistical significance of the difference of a given method to SPIN-CGNN was marked with ‘**’ for highly statistically significant (p-value<0.01), ‘*’ for statistically significant (0.01<p-value<0.05), and ‘-’ for not statistically significant (p-value>0.05). Specific p-values are presented in S4 Table.
Table 6.
Comparison of sequences designed by SPIN-CGNN, ProDesign-LE, RosettaFixBB, OSCAR-design, ProDesign-LE, ProteinMPNN, and PiFold on the Hallucination129 and Diffusion100 test sets according to the fraction of Low-Complexity Regions (LCR), the mean steric clash count of refolded structures, and the difference between refolded and target structures in term of RMSD, GDT-TS and TM-score.
Fig 5.
An illustrative example to highlight the dense local contacts accounted by SPIN-CGNN for improving fixed backbone design.
(A) The neighbors for residue 10 (yellow) in PDB 2AUV chain A, which has the greatest number of neighbors determined by CGraph12 (magenta). (B) The AlphaFold2-predicted structure of sequence designed by SPIN-CGNN (magenta), aligned on the native PDB structure (green). (C) The neighbors determined by KNN-30 (cyan) for the same protein. (D) The corresponding AlphaFold2-predicted structure of sequence designed by PiFold (cyan).