Figure 1.
Flowchart overview of EPC-map, combining evolutionary information (upper box) and physicochemical information (lower box).
For evolutionary contact prediction, multiple-sequence alignments are constructed by searching the Uniprot20 database with HHblits. GREMLIN is then used to predict contacts from the alignments. For physicochemical contact prediction, decoys are generated with Rosetta. From each decoy, contact graphs are constructed and feature input vectors computed. An SVM ensemble predicts the contact probability from each feature vector. The SVM probability and occurrence statistics predict physicochemical contacts. Lastly, evolutionary and physicochemical contact prediction are combined to form the output of EPC-map.
Figure 2.
Definition of graphs used to model the neighborhood of the contacting residues i and j: Nodes represent residues (circles), edges represent contacts (solid black lines).
A: The neighborhood graph for residue
contains all residues in contact with residues
, and
(dark grey). B: The neighborhood graph
. C: The shared neighborhood graph
for the contact between residues
and
is defined by the intersection of
and
. Residues that belong to
are shown in blue. Shared neighborhood graphs capture the local context of the shared neighborhood of the contacting residues. D: The immediate neighborhood graph
is defined by all residues that are in contact to
or
. Residues that belong to
are shown in blue. Immediate neighborhood graphs capture the direct neighborhood of the contacting residues.
Table 1.
Overview of the features used for contact prediction. A detailed description of the features is given in the supporting information.
Figure 3.
Prediction performance overview for the CASP10 and CASP10hard data sets.
The figure shows the long-range contact prediction performance of the top scoring L/5 contacts. Different methods are shown as color coded violin plots. The lower and upper end of the black vertical bars in each violin denote the accuracy at the 25 and 75 percentile, respectively. White horizontal bars indicate the median, red horizontal bars the mean accuracy. The distribution of the prediction accuracies for individual proteins is indicated by the shape of the violin.
Figure 4.
Prediction performance overview for the CASP9-10_hard, EPC-map_test, D329 and SVMCON_test data sets.
The figure shows the long-range contact prediction performance of the top scoring L/5 contacts. Different methods are shown as color coded violin plots. The lower and upper end of the black vertical bars in each violin denote the accuracy at the 25 and 75 percentile, respectively. White horizontal bars indicate the median, red horizontal bars the mean accuracy. The distribution of the prediction accuracies for individual proteins is indicated by the shape of the violin. Data sets are sorted from difficult (CASP9-10_hard) to easy (SVMCON_test). The last panel shows the pooled results for all proteins from these data sets.
Figure 5.
Alignment depth composition of the CASP9-10_hard, EPC-map_test, D329 and SVMCON_test data sets.
Proteins are grouped into bins based on their number of sequences in the alignment. Colors correspond to a particular bin, from dark blue (few sequences) to red (many sequences). Data sets are sorted from difficult (CASP9-10_hard) to easy (SVMCON_test). The last panel shows the pooled results.
Figure 6.
Prediction performance for proteins with increasing sequence alignment depth.
Results are shown for all proteins pooled from the CASP9-10_hard, EPC-map_test, D329 and SVMCON_test data sets. Different methods are shown as color coded violin plots. The lower and upper end of the black vertical bars in each violin denote the accuracy at the 25 and 75 percentile, respectively. White horizontal bars indicate the median, red horizontal bars the mean accuracy. The distribution of the prediction accuracies for individual proteins is indicated by the shape of the violin. EPC-map is consistently more accurate than the other tested methods, regardless how many sequences are available.
Table 2.
Contribution of the SVM component to contact prediction.
Figure 7.
Dependence of prediction accuracy on sequence length.
EPC-map is more accurate or on par with GREMLIN, irrespective of sequence length. The performance increase over GREMLIN is most pronounced for proteins smaller than 250 residues. Counting performs better on smaller proteins. The SVM component of EPC-map consistently improves the contact prediction from decoys over Counting by leveraging physicochemical information.
Figure 8.
Comparison of ab initio structure prediction of 132 proteins from EPC-map_test with and without predicted contacts: each data point corresponds to the GDT_TS of the lowest-energy structure generated with and without the use of EPC-map predicted contacts.
EPC-map increases the average prediction accuracy by 7.8% from 33.1 to 40.9 GDT_TS (paired Student's t-test p-value).
Figure 9.
Tertiary structure prediction improvement of the dissimilatory sulfite reductase D (PDB1ucrA), of the E. coli SSB-DNA polymerase III (PDB
3sxuB) and of the GIT1 paxillin-binding domain (PDB
2jx0A).
Contact maps show false positive predictions in the upper triangle (red), true positive predictions in the lower triangle (blue) and native contacts in grey. For the shown predictions, native structures are shown in grey and predicted structures are colored from N-terminus (blue) to C-terminus (red). The predictions correspond to the lowest-energy structure generated without use of contacts (middle column) and with EPC-map predicted contacts (right column).