Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction

Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81–82%, while SecNet’s accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.

2018. By using two sequence alignment programs, the protocol removes from the test set any proteins with more than 230 25% sequence identity to any previously published PDB structure of any experimental type, resolution, or quality.

231
The test set guarantees unbiased accuracy estimation for our prediction method, SecNet, and any previous software 232 trained and validated on proteins released before Jan 1, 2018.

234
For the training/validation set, the same resolution, R-factor, length, and sequence identity features is provided in Methods.

268
Training and validation sets were randomly divided with a 9:1 ratio from (All-Test) as full sequences for 10 cross- (G,I,B,T,S,C)  C has been used in some programs [53,76,81,[96][97][98][99]. Several published methods 287 did not clearly define which 3-label rule was used [73,77,78,82]. If we use Rule #1 on the training, 288 validation, and testing data, we achieve a 3-label accuracy of 84.0% on Test2018. The confusion 289 matrices, true positive rates (recalls), positive predictive values (precisions), false negative rates, 290 and false discovery rates for the 8-label and 3-label Rule #1 predictions are presented in Tables 2 291 and 3 respectively. When SecNet is trained, validated, and tested on data derived using Rule #2, 292 which is easier to predict [28,91], it has a higher accuracy of 86.0%. 293 We also tested SecNet with the popular CB513 test set derived by Cuff and Barton in 1999 294 for benchmarking of competing secondary structure prediction software [91] and adopted for 295 accuracy assessment many times [73,77,[79][80][81]. CB513 has several deficiencies. First, disordered 296 residues that have no atoms present in the coordinates are deleted from the sequences and the 297 DSSP strings altogether. This means that sequence profiles and HMMs will be distorted since 298 proteins in the sequence database will seem to have insertions relative to the query protein even 299 when they do not. Second, it has relatively poor average and maximal resolutions (2.11 Å and 3.5 300 Å) compared to sets that can be derived from the PDB today (e.g., the Test2018 test set has average 301 and maximal resolutions of 1.83 Å and 2.2 Å). The CB513 data set protocol initially required ≤ 302 2.5 Å resolution; however, 5% of its entries have resolution between 2.5 and 3.5 Å resolution. 303 Moreover, 32% of CB513 entries have free R-factor in the worst 25 th percentile for the resolution sequences as short as 20 amino-acid residues; 5% and 20% of chains in CB513 are shorter than 40 312 and 80 residues respectively. Finally, secondary structure labels were generated with an obsolete 313 version of DSSP that was available in 1998.  Nevertheless, because we eliminated proteins from our training and validation sets with 324 more than 25% sequence identity with any chain in CB513 (Fig 1)

360
CB513 accuracies were compiled from recent papers of the fifth-generation, template-free methods. Accuracy taken 361 from an original publication has "authors" tag. If a study did not utilize the CB513 test set but a third-party 362 benchmarked the method, the latter source is reported instead and the sequence identity to CB513 is listed as "Any."

363
Some "authors" papers did not specify an identity cutoff between their training set and their test sets, including CB513 364 ("unspec"). The widely used PSIPRED program belongs to the third generation. We recalculated the eCRRNN and  Test2018 set is independent of their training sets (at 25% sequence identity). We applied both 380 methods to the Test2018 test sequences, and determined that our SecNet has higher accuracies than 381 DeepCNF by 3.3-3.4% (Table 6). Our method's accuracies are 2.0-2.2% higher than eCRRNN on 382 Test2018. Next, we ran both of these methods on the original CB513 test set. The DeepCNF 383 accuracies that we calculated on CB513 closely reproduce those of the authors with slightly higher 384 8-label and slightly lower 3-label accuracies (Table 5).

389
When the authors tested DeepCNF on CB513, they excluded 31 shorter entries, and this may 390 account for these small observed differences. In contrast, our recalculated CB513 accuracies for eCRRNN indicated that the authors overestimated their accuracy by 3.0-3.8% (Table 6). We 392 excluded the possibility that we incorrectly executed the eCRRNN software by reproducing their 393 results for the three tests sets included with their software. With these findings we updated the 394 CB513 accuracy for eCRRNN in Table 5 with our calculated values. With these accuracy revisions,

397
Several programs (including eCRRNN) in Table 5  Finally, the accuracy of secondary structure prediction methods for 3 labels depends on the 432 helix, sheet, and coil fractions in the test set. If we oversample E or H, we can vary the accuracy 433 from 78% to 89% accuracy respectively (Fig 3). As it turns out, our training and test sets and 434 CB513 are very similar to each other in terms of the 3-label fractions of both Rule #1 with 37%, 435 24%, and 39% and Rule #2 with 34%, 23%, and 43% for helix, sheet, and coil respectively (Fig 3   436 and Table 3). This should be kept in mind when deriving new test sets.

447
Choices that affect the accuracy of secondary structure prediction 448 We explored many different ideas and options during the development of SecNet. We can 449 divide these choices into three categories: (1) neural network type, architecture, and complexity;

450
(2) input features; and (3) how to perform training of the neural network. After optimizing SecNet and using the Test2018 set only once (Table 4)

467
The first category of choices involves the architecture of the neural network and its 468 complexity. For example, secondary structure formation is influenced by short, middle, and long-469 range inter-residue interactions. We were interested to see how the accuracy relates to the size of 470 the sequence "input window" (line 3 in Fig 4) centered on a residue subject to prediction. To 471 investigate this, we varied the input window size, and observed increasing accuracy from 69.5 to 472 71.9% for a window of 9 to 27 (left), the highest 72.0% for the optimal window of 29 (middle) 473 and subsequent degradation to 71.9% for a window of 31 to 39 (right). The severe degradation of the accuracy (1.7-2.5%) was observed for 9-13 amino-acid windows, moderate reduction (0.   suffers by a hefty 2.7% for NR ("NR; partial" vs. NR's "complete" in line 7) and only 0.6% for 510 Uniref90 (line 8). PsiBlast produces profiles both in log-odds form and amino-acid percentages.

511
The log-odds form ("score") is better than the percentage values by 2.5% (line 9). It is common to use a sigmoid function to transform a feature into a 0 to 1 range. The log-odds score is also better 513 than their sigmoid function values ("sigmoid(score)")-by 0.5%.

514
PsiBlast profiles generated after a certain number of rounds have a significant effect 515 (line 6). Profiles obtained from the third, fourth, and fifth rounds contain homologues that are very 516 distant from a target sequence and therefore may contain significant differences in secondary 517 structure, which seem to adversely affect the accuracy. The accuracy of neural networks trained 518 on profiles after the third, fourth, and fifth rounds alone drops respectively by 1.7%, 2.2%, and 519 2.7% relative to the best option which includes both profiles after the first and second rounds. This 520 best option is higher by 0.5% and 0.7% than using the first or second round profiles alone. We 521 observe a steady but modest degradation of accuracy by 0.0% or 0.2% or 0.4% when a combination 522 of rounds 1-3, 1-4, or 1-5 are used. A possible explanation is that the additional rounds decrease 523 the data/parameter ratio. The round 2-3 combination is worse by 0.9% than the round 1-2 524 combination.

525
The third category of choices involves the training options. A difference in training data 526 set size of 1 or 2 million amino-acids has no effect, but a 10-fold training data set reduction to 100 527 thousand amino acids pushes accuracy down by 3% (line 10). For regularization, we chose the 528 dropout approach by randomly forgetting 30% of the model parameters during each training 529 iteration (line 11). This approach results in 1.6% higher accuracy compared to training without 530 any dropout. The 30% dropout is optimal; the accuracy drops when dropout is either decreased or 531 increased from 30%. CNN is relatively insensitive to small changes in learning rate from 0.01 and

542
In order to find a better optimum during the training and especially toward its end, it is 543 critical to reduce the size of gradient-descent steps by decreasing the current learning rate 544 ("decay"). A formula-based decay of "learning rate / epoch total" which is equal to 10 -5 with the 545 optimal learning rate of 0.01 and 650 epochs for 4 days, is better by 0.6% than no decay (0.0) or 546 smaller decay of 10 -6 ; it is also better by 0.6, 2.8 and 6.7% than a larger decay of 10 -4 or 10 -3 or 547 0.01 respectively (line 15). Finally, if we permute the data set and split into the training and 548 validation sets many times, the validation accuracy becomes overestimated by 2.3% due to easier 549 targets in the favorable random validation set; however, we observe no improvement (0.0%) in the 550 testing accuracy (line 16). Other training options were described earlier (lines 1-2), resulting in a 551 final accuracy of 73.0%.

552
We conclude that the best results are achieved with good choices for input features, 553 sequence databases and arguments, and the training pipeline and regularization techniques. It 554 appears that the NN type, architecture and complexity, and the associated number of training 555 parameters are not as critical as long as the NN is not too simple.

556
Most existing secondary structure prediction programs have been trained and tested on 559 either 8-label or 3-label data sets. In most cases, the 8 DSSP labels are reduced to 3 by treating 310 560 helices ("G" in DSSP) and π helices ("I" in DSSP) as H, and single-residue beta strands (beta 561 bridges or "B" in DSSP) as E; the other labels ("C" or coil, "T" or turns, "S" or bends) are reduced 562 to C. We wondered how label sets with more than 3 but fewer than 8 labels would behave and how 563 these might be developed.

564
To investigate this, we calculated the confusion matrix for 8-label SecNet (Table 2)

611
The dominant FNR of 33.7% for C in the prevailing GGG subclass of 55% and almost equal FNRs for H and C of We analyzed the TPR and FNR values for these three categories of residues in 310 helices 616 from the 5-label predictions (bottom of are very similar to beta turns, retaining the prediction of turns (label T) may enable the sampling 625 of these structures in loop regions. As before, B and S are converted to C, and I is converted to H.

626
The accuracy for this 4-label practical scheme is 79.5%; the column-and row-normalized 627 confusion matrices for this scheme are provided in Table 9. It achieves very high TPRs of 94%, 628 79%, and 83% and PPVs of 90%, 75%, and 85% for H, C and E; it achieves satisfactory TPR of 629 50% and PPV of 71% for T. If simpler, practical labels are required, T may be replaced with C as 630 the best candidate both structurally and in terms of the highest false negative rate of 33% and 631 highest false discovery rate of 22%. The final accuracy for this 3-label prediction scheme is 86.0%; 632 its TPR (recall), FNR, PPV (precision) and FNR are in Table 10.

644
This last 3-label alphabet is neither Rule #1 nor Rule #2 3-label alphabets found in the literature.

645
It closely resembles Rule #2 of the 3-label alphabet except for conversion of I (0.02% of all Set2018 646 residues); due to the negligible frequency of I, there is no observable difference in the overall 647 accuracy (86.0%) between this rule and Rule #2. In Table 11

685
Alphabet Definition Labels Accur. Diff. results of the ablation study were that window sizes larger than 29 residues did not increase the 707 accuracy, and that PSI-BLAST profiles are more useful than HMMs in predicting secondary 708 structure, even though the information in HMMs is inherently richer. 100], which may lead to a higher accuracy that would not be sustained on proteins unrelated to the 719 training and testing sets [84]. For example, we observed that if our model is trained and tested on 720 data sets with 50% sequence identity instead of 25%, it has 1.3% higher accuracy (top of Fig 4).

721
This issue has been raised several times by different groups [34,46,90,91]. In this work, we 722 developed a rigorous protocol that relies on two passes of two different sequence alignment 723 software programs (HHblits and Clustal-Omega) to enforce 25% identity within our data sets.

724
An unforeseen violation of a sequence-identity threshold may occur when a method relies 725 on predictions from third-party software as input features when the third-party software was 726 trained on protein structures that are more than 25% identical to sequences in the test set of the features. This can be hard to determine, since not all training sets are defined by authors of feature 732 prediction programs or there is a set of nested programs with each trained on its own training set.

733
To avoid such a situation, we should enforce the pairwise similarity threshold between every new 734 structure published after the third-party software release and those before it. It is necessary to 735 compare all sequences before and after the required date, regardless of resolution, experimental method, or structure quality. These filters can be applied to the training, validation, and test data 737 later.

738
Another consideration on which chains to include in the training, validation, and test sets 739 is their secondary structure content. Alpha helices are easier to predict and a test set enriched with 740 them is biased to have higher accuracy (Fig 3); therefore, data sets should have secondary structure 741 labels representative of the general PDB population. (E)  E, (other 6 labels)  C. The 3-label Rule #2 is easier for prediction than Rule #1 by 2.0% (top of Fig 4). Some authors did not report which rule was used [73,77,78,82]; some continue to 760 use the easier Rule #2 and compare their results to previous methods that used the harder Rule #1 761 [81,103]. Which rule is used should be clearly stated in a publication to avoid ambiguity.

762
Label sets greater in size than 3 but less than 8 may be useful in some contexts. We 763 demonstrated that the B (beta bridge) and S (bend) labels have very little sequence signal, and since 764 they may be difficult to sample anyway they can productively be converted to the generic C label.  sequences, we selected a protein chain from a crystal structure having the highest resolution, 821 followed by the best R-factor, and then the PDB chain having the largest number of coordinates.

822
This procedure produced the Test2018 test set (After2018 after the quality and sequence 823 identity filters were applied), which contains sequences that are not more than 25% similar to any 824 chain in a PDB entry released before Jan 1, 2018. Similarly, the Set2018 training and validation 825 sets were created from a 9 to 1 random split of the chains in Before2018 after the quality and 826 sequence identity filters were applied. We note that complete chain sequences are stored (not just 827 the ones with coordinates) in our sets. We store original one-letter sequences using the standard representation where a non-standard amino acid has a single-letter code of the closest analog 829 among 20 standard amino acids if it exists; otherwise "X" is used.

831
Input labels and features 832 The 8 secondary-structure labels were obtained from the file, "ss_dis.txt" which has records 833 for the entire PDB and is available for download from PDB. This file also contains strings that 834 indicate the ordered and disordered residues in the sequence. The DSSP and disorder strings were 835 combined, with spaces in the DSSP string replaced with "C" when the disorder string indicated an 836 ordered residue ("-") and "X" when the disorder string indicated a disordered residue ("X"  which results in the combined DSSP-disorder string: Training and testing were based on overall accuracy which includes all residues with 933 assigned secondary structure different from 'X' (disordered residues with missing coordinates).

934
The accuracy was calculated as a number of correctly predicted labels / number of all assigned 935 labels.

937
Benchmarking of competitor software 938 We ran DeepCNF and eCRRNN with default arguments. An ensemble of 10 models was 939 enabled for eCRRNN. We made sure that we were executing these third-party programs properly 940 by benchmarking them on data sets included in their respective published studies and reproducing 941 the same results within 0.1% accuracy for each included set. term memory bidirectional recurrent neural networks for improving prediction of protein S1 Supporting information. This file contains supplementary Figure A, Table A and Table B.

1310
The 4 labels are defined as: (S, B, G not abut H)  (C) and (I, G abutting H)  (H). This 4-label alphabet has a higher 1311 overall accuracy of 79.9% compared to 79.5% for our final 4-label alphabet (Table 9) which has all G  H. The 3-1312 label derivative alphabet formed by T  C from this 4-label alphabet has no accuracy difference when compared to 1313 our final 3-label alphabet (Table 10). Thus this 4-label and its derivative 3-label alphabets were both rejected despite 1314 having a slightly higher accuracy of 0.4% and same accuracy respectively in the sake of the simpler rules in our final