DeepCDpred: Inter-residue distance and contact prediction for improved prediction of protein structure

Rapid, accurate prediction of protein structure from amino acid sequence would accelerate fields as diverse as drug discovery, synthetic biology and disease diagnosis. Massively improved prediction of protein structures has been driven by improving the prediction of the amino acid residues that contact in their 3D structure. For an average globular protein, around 92% of all residue pairs are non-contacting, therefore accurate prediction of only a small percentage of inter-amino acid distances could increase the number of constraints to guide structure determination. We have trained deep neural networks to predict inter-residue contacts and distances. Distances are predicted with an accuracy better than most contact prediction techniques. Addition of distance constraints improved de novo structure predictions for test sets of 158 protein structures, as compared to using the best contact prediction methods alone. Importantly, usage of distance predictions allows the selection of better models from the structure pool without a need for an external model assessment tool. The results also indicate how the accuracy of distance prediction methods might be improved further.

The problem of predicting protein structure from amino acid sequence has been 2 transformed in the last decade from one of aspiration to one of application, although 3 prediction methods are not yet a routine laboratory tool. Recently, well founded structural models. More accurate structures can be generated and better models can be 14 selected from a pool of possible structures than when contact predictions alone are used 15 to constrain the models in the pool. Inter-residue distance predictions thus enhance the 16 ability to generate and select good quality models. 17 Contacts in a protein structure often involve amino acids that vary across homologs 18 in a correlated way, which is attributed to evolution selecting contacting amino acids to 19 maintain the structural stability of the protein [3] (Fig 1A). However, strong correlation 20 can arise due to two residues contacting a common third amino acid, referred to as a 21 transitive effect, and these residues can thus be falsely predicted as being in contact. 22 During the last decade, efficient global statistical techniques for removing the transitive 23 effect have been developed, thus allowing one to identify clearly the directly coupled 24 positions in multiple sequence alignments (MSAs) [3][4][5][6][7]. For an alignment of length L, 25 such statistical techniques can now predict L/10 contact pairs with accuracies as high as 26 70 to 80% [8]. A direct result of this has been a rapid improvement in de novo structure 27 prediction, e.g. [1,2], thus fulfilling the original hopes of Valencia and co-workers [9]. Co-variance between columns of a multiple sequence alignment can be used to predict inter-residue contacts and distances and thence protein structure.
Training neural networks provides further improvement to the accuracy of contact 29 predictions. MetaPSICOV [8] uses a two-layer neural network for contact prediction (i.e. 30 input layer, hidden layer, output layer), with an input vector of 672 features. The 31 feature vector includes local properties of the amino acids under consideration, such as 32 predicted secondary structure and solvent accessibility, properties of the whole sequence, 33 such as the average of the predicted solvent exposure, and coevolutionary scores for 34 directly coupled amino acid positions inferred from the aforementioned global statistical 35 techniques. MetaPSICOV pushes the accuracy of contact prediction to over 90% for the 36 predicted top L/10 contacts, where L in the number of amino acids in the target protein 37 sequence [8]. Very recent applications of deep learning, i.e. multi-layered neural 38 networks, are reported as surpassing the prediction accuracy of MetaPSICOV by 16 to 39 23% for the top L/5 long range contacts, for the CASP11 protein sets [10][11][12]. Deep 40 learning increases the number of processing layers so the network can learn to abstract 41 features from the input data, which the final layer can then use for classification.

42
Another recent paper uses a naïve Bayes classifier to calculate the posterior probabilities 43 of eight coevolution analysis, which are then processed by a shallow feed-forward neural 44 network [13]. 45 We hypothesised that pairs of spatially distant amino acids may also co-evolve. 46 Indeed, the literature has some evidence for this [14][15][16] and Pollastri and co-workers 47 have published distance predictions [17,18]. However, the RMSD between the actual 48 distance and the predicted distance by this method was over 8 for residues separated by 49 23 or more residues in sequence. Moreover, ab initio models generated with this data 50 had TM-scores of a little over 0.2. TM-score measures the similarity of two protein 51 structures, with 0.2 meaning the predicted structure and the native structure are 52 unrelated. 0.5 indicating that the structures are more likely to be the same fold than 53 not and 0.7 or greater indicating that proteins are almost certainly the same fold [19].

54
In test systems, adding distance information to contact information could improve 55 structure predictions significantly compared to using contacts alone, even when the 56 distance information was noisy [17,18]. Thus, accurately predicting precise inter-residue 57 distance from sequence would provide further constraints during structure prediction, 58 with the possibility of increased speed and accuracy ( Fig 1B). 59 We trained four feed-forward neural network models to distinguish residue pairs in 60 spatial distance ranges of 0-8Å (i.e. contact), 8-13Å, 13-18Å and 18-23Å, which 61 together we call DeepCDpred (deep contact distance prediction). Our method uses a 62 similar feature vector to MetaPSICOV but with eight hidden layers and with only one 63 prediction stage (while MetaPSICOV uses two). The key development is that we predict 64 accurate inter-residue distances, which we show improve structure prediction, and the 65 ability to choose better models from a pool of candidate models.

67
Neural Network Feature Vector and Set Up 68 All four distance ranges were trained using the same neural network architecture and 69 inputs, but with training data appropriate to that distance interval. Distances were 70 measured between the C β atoms (or C α in the case of glycine) of a pair of residues.

71
There are 733 features as input, described in the supplementary methods, and nine-layer 72 neural network model (i.e. one input layer, one output layer and eight hidden layers average output value of all these four. For each of the three distance prediction models, 77 only one neural network was produced. 78 We implemented here the QUIC algorithm [20] for sparse inverse covariance 79 estimation to calculate amino acid direct couplings for inclusion in the feature vectors. 80 QUIC is similar to PSICOV [5], solving a GLASSO problem, but our tests show that it 81 is much faster than PSICOV, taking on average a quarter of the time to calculate  Residue positions that are very close in protein sequence would be expected to be 86 close in the 3D structure without any need for sophisticated prediction tools. Thus, 87 since they may mask other significant co-evolutionary signals during neural network 88 training, residue pairs separated by 5 or fewer residues were ignored during training and 89 testing, as was also done by others [5][6][7][8]. 90 Similarly, it is trivial to predict the distance between two residues on the same 91 secondary structure element, if their sequence separation is known. For the distance 92 predictions, residue pairs on the same predicted alpha helix or beta strand were ignored, 93 to stop their trivial distance prediction; this was done for the training, test and 94 validation sets. In addition, a different minimum sequence separation was set for each 95 distance bin. If the sequence separation of a pair is 5 amino acids, even once residues on 96 the same secondary structure have been ignored, their spatial separation can still be  cut-off of 8 or more residues was used for distance bin 8-13Å and similarly, 13 was the 101 minimum sequence separation for distance bin 13-18Å. For the 18-23Å bin a separation 102 of 15 amino acids or more was chosen. For contact predictions, we kept in the data set 103 residue pairs that were predicted to be on the same secondary structure element.

104
Training and testing 105 The main test set consists of 108 from the 150 proteins of the MetaPSICOV test set, so 106 we can be sure that the test proteins are not in the training set of either MetaPSICOV 107 or DeepCDpred. Additionally, these 108 proteins are not listed in the training set of 108 RaptorX [11]. A chain was removed from the MetaPSICOV test set when a sequence 109 with >25% identity to it was found in the training set of SPIDER2 [21], since we used 110 SPIDER2 for secondary structure prediction, which is included in our feature vector 111 and for subsequent structural modelling. This gave 108 protein chains ranging from 52 112 to 266 amino acids with 25% or less sequence identity to each other. Based on 113 annotation in the PDB, 87 chains are monomers in the biological unit, and 21 are from 114 multimeric complexes of some sort, one chain is a membrane protein. The PDB IDs of 115 these 108 protein chains are listed in S2 Table. 116 Even though the maximum sequence identity is 25% between the training and the 117 test sets, some of the proteins in our test set have common topology classes (and 118 homologous superfamily classes) with the training set proteins, based on CATH 119 classification [22]. In order to test whether our trained model has a bias towards 120 predicting contacts and distances for structures with training set topologies, we 121 generated another test set with 50 proteins that do not have the same topology as any 122 of the training set proteins of DeepCDpred, RaptorX and MetaPSICOV which are listed 123 in S3 Table. 124 The training set was chosen from the PISCES set [23], downloaded in November 125 2016. The selected training set protein chains and 50 topologically independent test 126 protein structures were solved with no worse than 2Å resolution, a maximum R value 127 of 0.25, with no more than 25% pairwise sequence identity to each other or the test set, 128 and with fewer than 400 amino acids. Of these structures, 1701 chains were arbitrarily 129 selected.

130
The neural network training protocols are described in supplementary methods. The 131 accuracy of the test set predictions was calculated as the true positives divided by the 132 total number of predictions. Since we make no predictions for false positives, i.e. FP=0, 133 the standard formulae for accuracy and precision (PPV) become identical. DeepCDpred distance predictions. As constraints, the top 3L/2 scoring contacts were 139 used for structure predictions, applying the same Rosetta protocol for all predictions, as 140 described in the next subsection. Residue pairs predicted in the 8-13Å, 13-18Å and 141 18-23Å distance ranges were selected when they had a neural network score of greater 142 than or equal to 0.6, up to a maximum of 1.5L, L, and L pairs, respectively, for 8-13Å, 143 13-18Å and 18-23Å bins. For comparison, the RaptorX server was also used for 144 structure predictions.

145
Structure prediction protocol 146 All structures were predicted with AbinitioRelax from Rosetta [24], with constraints 147 applied to enforce predicted secondary structure, contacts and inter-residue distances, 148 as described in Supplementary Methods. Three-residue and nine-residue fragments were 149 created using the program make fragments.pl from the Rosetta suite with the option of 150 excluding homologous structures. We generated 100 candidate structures for each test 151 protein and the one with the lowest total Rosetta energy, including constraint energy, 152 was selected as the prediction, unless otherwise stated. The script for the protocol is 153 given in the supplementary material.

154
Determining whether certain residue types are over represented 155 in our predictions 156 Correctly predicted distances and contacts were examined for biases towards certain 157 residue types. For a score from our neural network of >= 0.7 , we examined the ratio of 158 the expected distribution (E) in a given distance bin for a given structure, assuming all 159 residue pairs were predicted equally well, to the bias in the distribution that was 160 actually observed (O).
In the above equations, AB is a given residue pair type and d is the distance bin to 162 which they are assigned based on their spatial separation in the structure under 163 consideration. The mean O/E over all structures was calculated for each pair type, AB. 164 We also calculated the fraction of predictions that were true positives for each pair of 165 residue types in each distance bin.

167
Distance predictions lead to improved structure prediction 168 Our nine-layer neural network was tested on two sets of proteins. The first set tests the 169 network's ability on the types of proteins that it might encounter in practice, and the 170 second set tests the network's ability to deal with totally novel folds. The first test set 171 consisted of 108 proteins with 25% or less sequence identity to each other, of which 80 172 belong to a CATH family homologous to one of the training proteins, 90 have the same 173 CATH topology as a training protein, and the remaining 18 being neither topological 174 nor homologous to our training set. This group represents the sort of sequences that 175 might routinely be submitted to a contact/structure prediction algorithm, 25% sequence 176 identity generally being considered too low for a reliable homology model, even where 177 family homology can be detected [25]. The second test set was 50 proteins topologically 178 different from the training set proteins of our network and of the MetaPSICOV and 179 RaptorX contact prediction neural networks.

180
The accuracies of the distance predictions for the 108 protein test set are higher than 181 the accuracies of the 50 protein set (Fig 2). For the test set with 108 proteins, distance 182 prediction accuracies are better than the contact prediction accuracies of many other  The model with the lowest Rosetta energy when modelled using distance and contact 194 constraints, from DeepCDpred and RaptorX respectively, is more similar to the 195 experimental structure than when using contact constraints alone. This is true for both 196 the 108 protein test set (Fig 3A) and the set of 50 proteins topologously distinct from 197 the training sets of RaptorX, MetaPSICOV and DeepCDpred (Fig 3C).

198
Selecting models by the lowest Rosetta energy, distance constraints in addition to 199 contact constraints improved the mean TM-score compared to experiment by ∼ 0.07 in 200 the 108 protein set and ∼ 0.03 in the 50 protein set (p-value = 9x10 −9 and p-value = 201 0.004 in two paired t-tests, respectively). Inclusion of distance constraints on average 202 improves the best model (i.e. highest TM-score compared to experiment) produced for 203 the 108 protein test set, compared to using constraints only (a p-value of 0.001 in a 204 paired t-test), and has a small but statistically insignificant improvement on the set of 205 50 proteins (p-value 0.158), with average TM-scores increased by ∼ 0.01 and ∼ 0.008 206 respectively (Fig 3B,D). 207 Thus, applying distance constraints in addition to contact constraints improves the 208 quality of models produced, increasing average TM-scores by ∼ 11% and ∼ 4% for the 209 108 and 50 protein sets, respectively, although much of this effect is achieved by the 210 model with the lowest Rosetta energy being close to the best model in the ensemble of 211 models produced by Rosetta, which is not true when contact constraints alone are used 212 (Fig 4). With both test sets ∼ 1.4% of the improvement is attributable to improved 213 modelling, with the rest attributable to model selection.  consideration, with aliphatic residues in particular being highly represented (Fig 5 ). As 222 the inter-residue distance increases, hydrophilic residues become more prevalent in the 223 predictions, but aliphatic residues are over-represented up to a distance of 18Å, with 224 the 18-23Å range having hydrophilic interactions disproportionately represented, albeit 225 not as strongly as the 0-8 range is dominated by aliphatic residues. Lysine, and to some 226 degree arginine and glutamate are the most disproportionately represented in the 13-18 227 and 18-23Å distance ranges, with valine also continuing to be over-represented. 228 However, the accuracy of a prediction for a given neural network score is largely  length; M ef f is the number of sequences with <80% sequence identity with respect to 237 each other [6]. It has been shown that there is a correlation between the Nf value and 238 the structure prediction accuracy [26]. Sequences were removed from the alignments of 239 the 108 protein set to give them the same Nf as the set of 50, which reduced the  It seems unlikely that DeepCDpred is overfitting the examples, since it was trained 246 with early stopping, and validation and test results give similar prediction accuracies to 247 those of the training data. Nonetheless, the results point to DeepCDpred having poorer 248 generality than RaptorX. Two obvious differences between the prediction methods are 249 that RaptorX uses 6767 training proteins compared to 1701 used for DeepCDpred, and 250 that RaptorX uses a residual neural network, a type of convolutional neural network, 251 whereas DeepCDpred uses a feed-forward network. Increasing the size of the 252 DeepCDpred training set is likely to improve its accuracy for contact and distance 253 prediction. Similarly, investigating the use of alternative neural network architectures, 254 may also lead to improved distance predictions.

255
It is reasonable to question whether there is a need to pursue improved predictions 256 of inter-residue distance, since one might assume that it is sufficient to be able to 257 predict all contacting residues. It seems unlikely that the goal of predicting all 258 contacting residues will be achievable, since it depends currently on co-variance in 259 residue substitution patterns and there will generally be many positions in a sequence 260 alignment that are totally or highly conserved. The results here demonstrate that using 261 distance prediction can help in model selection and thus improve the prediction of 262 model structure above what can be achieved by the best contact prediction method that 263 was available at the time we undertook this work. Moreover, others report that even in 264 the event of knowledge of all contacts further distance information can improve de novo 265 modelling [17,18]. Knowledge of why the distance between non-contacting residues can 266 be predicted is of interest for trying to improve predictions and also for the potential 267 insight into protein structure and function.

268
It may be anticipated that the distance prediction could be achieved simply by 269 realising that hydrophilic residues have longer inter-residue distances since they are on 270 the surface and thus in many cases on the far extreme ends of the protein from each distance the network has more sensitivity for finding hydrophobic pairs. As the distance 278 increases the network becomes more sensitive for finding hydrophilic residue pairs. It is 279 not clear why there should be this difference in sensitivity, although it may reflect the 280 relative number of examples of the different residue pair types in each distance range, at predicting the most abundant residue pair types in a given distance range.

284
The data show that inter-residue distances can be predicted reliably using DeepCDpred, 285 the method introduced here. The consequent addition of distance constraints into de 286 novo structural modelling leads to better models than when contact predictions alone 287 are used. Including distance constraint terms leads to the models with the lowest 288 Rosetta energy being much closer to the experimental structure than when using only 289 contact constraints together with the Rosetta forcefield. Although others have 290 previously pointed towards the usefulness of distance prediction [17,18],to our 291 knowledge, this is the first demonstration of the practical benefit of inter-residue 292 distance prediction in the structure prediction problem. We anticipate that improved 293 prediction of inter-residue distance is possible via the most recent developments in deep 294 learning and by understanding the intrinsic bias in amino-acid distribution within 295 protein structures and the effect that has on the accuracy of deep learning methods.       The scale is given on the right hand side for each plot. Precision is calculated as the 345 number of correctly predicted contacts for that pair of amino acid types divided by the 346 total number of contact predictions for that pair for the predictions with >=0.7 network 347 score. replicated (replica1 (r1) and replica2 (r2)). For Rosetta server predictions models were 351 selected either by the lowest energy score (CNS score) or the best model among the 5 352 structures that the server provides. For all other prediction methods, models were 353 selected either with the lowest Rosetta energy or the best TM-score. The calculations 354 were performed for the test set of 108 proteins. The upper and the lower edges of the 355 boxes indicate the 25 th and 75 th percentiles, respectively. The medians are shown with 356 the central lines, the means are shown with black '+' signs and the outliers are shown 357 with red '+' signs. Even though the first set of best models which were generated with 358 the restraints of RaptorX contact predictions (RaptorX r1) are significantly better than 359 the best models generated with DeepCDpred contact predictions, replication of the 360 structure predictions with RaptorX contacts (RaptorX r2) resulted in no significantly 361 different average TM-score than the predictions performed with DeepCDpred contacts 362 (paired t-test p-value: 0.507). The results from the RaptorX server were on average 363 worse than all other calculations except the use of MetaPSICOV contact restraints 364 together with Rosetta, presumably because CNS, used by the RaptorX server, is not as 365 good at modelling structures as Rosetta is.