Protein model accuracy estimation based on local structure quality assessment using 3D convolutional neural network

In protein tertiary structure prediction, model quality assessment programs (MQAPs) are often used to select the final structural models from a pool of candidate models generated by multiple templates and prediction methods. The 3-dimensional convolutional neural network (3DCNN) is an expansion of the 2DCNN and has been applied in several fields, including object recognition. The 3DCNN is also used for MQA tasks, but the performance is low due to several technical limitations related to protein tertiary structures, such as orientation alignment. We proposed a novel single-model MQA method based on local structure quality evaluation using a deep neural network containing 3DCNN layers. The proposed method first assesses the quality of local structures for each residue and then evaluates the quality of whole structures by integrating estimated local qualities. We analyzed the model using the CASP11, CASP12, and 3D-Robot datasets and compared the performance of the model with that of the previous 3DCNN method based on whole protein structures. The proposed method showed a significant improvement compared to the previous 3DCNN method for multiple evaluation measures. We also compared the proposed method to other state-of-the-art methods. Our method showed better performance than the previous 3DCNN-based method and comparable accuracy as the current best single-model methods; particularly, in CASP11 stage2, our method showed a Pearson coefficient of 0.486, which was better than those of the best single-model methods (0.366–0.405). A standalone version of the proposed method and data files are available at https://github.com/ishidalab-titech/3DCNN_MQA.


Introduction
The three-dimensional (3D) structure of a protein is related to its function and is important for life science applications such as drug discovery; however, experimentally determining three-dimensional protein structures is costly and time-consuming. Thus, many computational methods for predicting protein 3D structures from amino acid sequences have been developed [1][2][3][4]. Current prediction schemes often output multiple structure models because PLOS  Additionally, even if enough rotations and translations were applied, the redundant dataset generated may cause over-fitting during the training processes. To solve this problem, Pagès et al. proposed a residue-wise scoring function (Ornate) that uses 3D density maps as input which corresponds to each residue and its neighboring residues with the backbone topology of the residue [24]. This approach succeeded to avoid the problem of ambiguous orientations of the initial models. However, the method proposed by Pagès et al. uses complex inputs and network topology for the neural network, and thus the performance of the method was lower than that of state-of-the-art single MQA methods.
In this study, we developed a novel MQA method based on a residue-wise assessment method for evaluating the local structure of each residue using 3DCNN. The proposed method sets a small bounding box for each residue, and thus the orientations of the boxes can be determined using main chain coordinates. We used simpler atom categories and network topologies that could be easily trained. We applied the proposed method and existing methods to the benchmarking datasets CASP11, CASP12, and 3DRobot [25]. The proposed method showed significantly better accuracy than the 3DCNN-based method developed by Derevyanko. Additionally, the proposed method exhibited the best accuracy compared to other state-of-the-art single-model methods.

Materials and methods
In contrast to the previous method developed by Derevyanko et al. [23] which uses a single large bounding box for the whole protein structure, our proposed method is based on residuewise 3DCNN, which evaluates the local structure of a residue using 3DCNN. This method assumes that the local structure quality implies global quality. The workflow of the proposed method is shown in Fig 1. The procedure is separated into three steps: (1) residue-wise lowlevel featurization, (2) 3DCNN-based local structure assessment, and (3) integration of residue-wise local results.

Residue-wise low-level featurization
To extract input data for a neural network from a protein structure, we first set a 3D grid bounding box centered by the C-alpha atom (CA) of a residue. One side of the box was 28 Å and the box was divided into 1-Å voxels. To determine the orientation of the box, the orthonormal basis calculated from the C-CA vector and N-CA vector and cross-product of the C-CA and N-CA vector was used as the axis of the bounding box according to similar definitions used in a related study [26]. Fig 2(b) shows how the orientation was determined. Atoms within a voxel were checked and the features were assigned to the voxel. Atom features were placed into 14 categories based on the atom type as shown in Table 1. The 11 categories were used according to a previous study of previous 3DCNN study based on whole protein structures [23]. We added 3 categories (CA atom, backbone chain atom, any atom). Each category feature was assigned to an independent channel of a neural network (Fig 2(c)).

3DCNN-based local structure assessment
In this step, we evaluated the local structure of a residue based on the voxel information generated in the previous step by supervised machine learning. To train the supervised machine learning model, a label indicating the local structure quality of each residue was required. We used lDDT as a label to describe the local structure quality [27]. To overcome the binary classification problem, the label was defined using the following formula: ( To predict local structure quality, we used a deep neural network including 3DCNN layers. A 3DCNN is an expansion of a 2DCNN, which is often used in image recognition. 3DCNN is used for object recognition [20] and can effectively extract features from 3D structured data, as conducted in feature learning of 2DCNN for image recognition. S1 Fig shows the neural network architecture. We designed the architecture based on previous 3DCNN research [26]. The last 3DCNN layer was connected to the global average pooling layer [28]. After each layer, PReLU [29] was used as the activation function other than the output layer. The batch normalization layer [30] was added before each activation function. The prediction problem is a binary classification, and thus sigmoid cross entropy was used as a loss function.

Integration of local results
A neural network for local structure assessment returns an estimated quality value for each residue. Thus, we integrated the local scores into a global score to evaluate the quality of a whole protein structure. It is difficult to use machine-learning methods to integrate these scores because the number of local scores is not fixed. Thus, the global score was simply calculated as the mean value of the local scores.

Dataset
To train the local structure assessment models, native structures and non-native decoy structures were collected from the targets and prediction results of CASP experiments [16]. We used the CASP 7-10 datasets obtained from the CASP homepage (http://predictioncenter.org/ download_area/) for training. Decoy structures were 10% randomly sampled for each target protein in the training datasets (26.6 models per protein were used). In total, the training dataset contained 11,582 protein structure models for 435 proteins. The training dataset included 968,869 positive data and 958,780 negative data. We used scwrl4 [31] to optimize side-chain conformations, as used in a previous 3DCNN study based on whole protein structures.
For the test datasets, we used the CASP11, 12 datasets and 3DRobot decoy sets [25], which were used in a previous study by Derevyanko et al. [23]. The test datasets from CASP include stage1 and 2 decoys; stage1 uses up to 20 selected predictions spanning the whole range of model accuracy and stage2 uses the best 150 server predictions according to the ranking from the DAVIS-EMAconsensus method [32]. Additionally, Targets T0797, T0798, and T0825 in CASP11 were removed from the benchmark because they were released for multimeric prediction. Similarly, we used scwrl4 for the test datasets. Table 2 shows additional details of the datasets. GDT_TS in the Results section was calculated by using TMscore [33]. When using TMscore, a target structure and model structure must be specified and a different value is returned if the model structure is considered as a target structure and target structure is considered as a model structure. In the Results section, GDT_TS was calculated using a target structure and model structure in an inverted manner according to Derevyanko et al. [23]. The results for non-inversed GDT_TS are shown in S1 Table.

Evaluation
We used the correlation between the predicted quality scores and GDT_TS values of models as evaluation measures. We used Pearson's correlation coefficient and Spearman's correlation as the test datasets. A test dataset contained many target proteins, and the correlations were calculated for each target. Thus, we used the average of these values. We also evaluated the nearnative selection performance of the method using two measures. We determined the difference value between the GDT_TS of a selected model by each assessment method and that of the best GDT_TS model (GDT_TS loss). We also used the score-based rank of the best GDT_TS model (best model rank).

Neural network training for local structure assessment and performance evaluation
We first trained a deep neural network including 3DCNN layers to assess the local structure quality of a residue. Thus, this analysis is based on the binary classification problem for each residue. We split the training set into a "neural network training set" and "validation set" with a split rate of 80% in target protein level. The validation dataset was used to determine the hyperparameters of a network. To evaluate the accuracy of the networks, we constructed a receiver operator characteristics (ROC) curve [34]. The area under the ROC curve (ROC_AUC) of the validation set was used to determine the stopping epoch in the training. In this study, one epoch was determined as the end of once training with whole the training data.
We used SMORMS3 at a learning rate of 0.001 [35], which is the default value, for optimization. The loss and AUC value during training are shown in S2

Model quality assessment performance evaluation
The previous section described that the proposed 3DCNN-based model achieved high accuracy for local structure quality assessment. However, to estimate the performance of the proposed method for assessing the quality of a whole protein structure model, other evaluation experiments should be performed. To determine the performance of the proposed method, we performed two performance evaluations. The first involved comparison to a 3DCNN method based on whole protein structures [23]. We also performed another performance evaluation to compare the performance of the proposed method with state-of-the-art single MAQ methods because the method described by Derevyanko et al. does not currently give the best results [23].

Performance comparison with a previous 3DCNN method based on whole protein structures
We evaluated the model performance using the CASP11, CASP12, and 3DRobot decoy sets and compared the results with those from the 3DCNN method developed by Dereveyanko et al. [23]. We compared the performance of the proposed method with the values obtained from the previous 3DCNN method based on whole protein structures in the article [23]. Table 3 shows the results of the evaluation tests. The values of the previous 3DCNN method were obtained from the article. The results showed that proposed method achieved better performance than the previous 3DCNN method for all measures. To confirm the significance of the improvement, we performed a Wilcoxon signed-rank test. The values in parenthesis under the values obtained using the proposed method are the p-values determined in statistical analysis. The score of the previous method for CASP12 was not available, and thus we did not perform the test in this case. The improvement was significant for all datasets. The results indicate that the proposed method is superior to the 3DCNN method based on whole protein structures in the MQA task.

Comparison to state-of-the-art methods
We also performed another performance evaluation to compare the performance of the proposed method with the best-performing single-model QA methods according to CASP11,12 assessment: SVMQA [6], ProQ2 [8], ProQ2-refine [8], ProQ3 [12], RFMQA [7], VoroMQA [9], MULTICOM-CLUSTER [10], and MULTICOM-NOVEL [10]. Additionally, we also conducted comparison with Ornate [24], which is a recent single-model QA method that uses residue-wise 3DCNN. For this evaluation, we used the dataset used in CASP official assessments. This dataset was slightly different from the dataset used in the previous section and included GDT_TS labels. The details of the dataset are shown in S2 Table. The CASP11 results of ProQ2, ProQ2-refine, RFMQA, VoroMQA, MULTICOM-CLUSTER, and MULTICOM-NO-VEL were extracted through blind prediction of CASP11. The CASP12 results of ProQ2, SVMQA, ProQ3, VoroMQA, and MULTICOM-CLUSTER were similarly extracted through blind prediction of CASP12. Only the results of Ornate were extracted from the previous article [24]. CASP11,12 stage2 results are shown in Tables 4 and 5 and CASP11,12 stage 1 results are shown in S3 and S4 Tables. The score of Ornate for each target was not available, and thus we did not perform statistical analysis in this case. The proposed method achieved better or comparable accuracy, particularly in CASP11 stage2, and the proposed method outperformed the other methods evaluated.

Influences of homologues between training and test datasets
We used similar training and test sets as used in previous studies. However, the test dataset included several homologues proteins to those in the training dataset. To evaluate the influence of homologues, we removed proteins with sequence similarity to those in the training set from the test set. We used NCBI BLASTP [36] and an e-value threshold of >1e-4 to identify the homologues. There were 8 homologues in the CASP11 dataset and 6 homologues in the CASP12 dataset. The detailed information is shown in S5 Table. S6-S9 Tables show the accuracy of model quality assessment without homology proteins for each test dataset. The accuracies of the proposed method for the non-homologue dataset were nearly the same as those in the Results section (for instance, Pearson's correlations for CASP11 stage2 dataset were 0.486 The legend is the same as that for the columns 2-6 in Table 3. https://doi.org/10.1371/journal.pone.0221347.t004 and 0.483, respectively). Additionally, the improvement compared to other state-of-the-art methods did not change. The information of homologues proteins is often useful for this application. However, we disintegrate the problem to the residue-level, and thus the influence was not critical.

Evaluation on non-CASP datasets
In the Results section, we mainly used datasets from the CASP experiments. The datasets were major in this field [23,24]. However CASP datasets were constructed for a competition so that the targets were not systematically selected. Thus, they are not perfectly non-redundant and do not cover whole protein structure space. Thus, we also evaluated non-CASP datasets: 3DRobot decoy set [25] and I-TASSER decoy set II [37]. The native structures were removed from all datasets. Protein sidechain structures were optimized by using Scwrl4 and the ground truth label was TMscore calculated by using TMscore software. SVMQA [6], RWplus [37], GOAP [38], and OPUS-PSP [39] were compared to the proposed method. Accuracies were extracted from the article [6]. The results are shown in S10 and S11 Tables. For the 3DRobot dataset, the proposed method showed comparable accuracies to SVMQA and outperformed the other methods. For the I-TASSER dataset, the proposed method also showed better accuracy than the other methods except for SVMQA but the accuracy of SVMQA was better in all measures. Our data did not reveal why SVMQA showed better accuracy with the I-TASSER dataset compared to that for the other test sets. In this comparison, only SVMQA used high-level information such as evolutionally information. Thus, such information may be effective for the I-TASSER dataset.

Performance of local structure assessment
In protein structure model quality assessment, local structure assessment, which evaluates the quality of a structure model in residue-level, is also important because a user can recognize which substructure needs to be improved. Although proposed method is for assessing the quality of a global structure, it outputs a score for each residue in the evaluation process. Thus, we also evaluated the accuracy of local structure assessment of proposed method based on perresidue error estimation. To evaluate the performance of local structure assessment, we used CASP12 stage2 dataset. In the dataset, a model structure and a native structure were superimposed by local-global alignment (LGA) [40] and the distance between a model structure and a native structure for a residue can be calculated. According to CASP assessment [16], we evaluated proposed method using two metrics. One is Pearson correlation coefficient between distances and predicted scores. The other is the ROC-AUC by considering the problem as binary classification. If a distance is smaller than 3.8Å, the prediction of a residue is considered as correct. S12 Table shows the result of local assessment evaluation. We compared the accuracy of proposed method with other single model assessment methods. We used only methods which can predict residue-wise quality. Proposed method showed comparable performance with the other methods in AUC. In contrast, the performance by Pearson correlation coefficient was the worst. This result seems to be reasonable because proposed method was trained as a binary classification, and thus it is difficult to estimate the quality of a local structure quantitatively.
To improve local structure assessment accuracy by Pearson correlation coefficient, we might change the problem from binary classification to regression. However, the training of a regression model is often more difficult to a classification model, and we considered it caused the decrease of global structure assessment performance.

Performance difference in core and surface residues
We investigated the local assessment accuracy of proposed method by dividing residues into core residues and surface residues. Residues on the protein surface often have a small number of contacting residues in the bounding box and insufficient information may decrease the accuracy of assessment. The class of a residue was defined by its relative solvent accessibility area (RSA). If the RSA was less than 25%, the residue was categorized into the core. The ROC-AUC was used to determine local assessment accuracy. The CASP11 stage2 datasets and their RSAs were calculated by using FreeSASA [41]. As a result, local assessment accuracy for the core residues (0.918) was superior to that for the surface residues (0.887). These results support the assumption that core residues are more important and indicated that surface residues may decrease assessment performance. Thus, we compared the quality assessment accuracy of the whole model between the proposed method and method only using core residues for assessment (S13 Table). The method only using core residue assessments showed decreased accuracy, indicating that assessment based on surface residues is more difficult but still useful and needed for better assessment. However, improvements can be made by using more sophisticated integration methods rather than the simple mean value of local assessments.

Conclusion
We proposed a novel model quality assessment method for protein tertiary structure prediction based on machine learning. The method evaluates the local structure quality of each residue using a deep neural network including 3DCNN layers and assesses the quality of the whole structure through integration. Evaluation tests with multiple datasets revealed that the proposed method achieved better accuracy than the previous 3DCNN method, which evaluates whole protein structures within a single large box. Compared to other state-of-the-art single-model methods, the proposed method showed comparable performance. Particularly, for the CASP11 stage2 dataset, the proposed method significantly outperformed the other methods.
Additional studies are needed to extend the training set. In this study, we used a relatively small dataset containing 435 proteins, but the Protein Data Bank contained more than 140,000 protein structures as of 2018 [42]. Thus, accuracy improvement can be achieved by generating more training sets. Additionally, we used a simple average to integrate the local assessment results because the size of the results was not fixed. However, current neural network techniques can deal with such data and may improve the accuracy of the method. Our method does not use high-level features used in other methods. Thus, using high-level features such as evolutionally information may improve the accuracy.
Supporting information S1 Table. Comparison with previous 3D-CNN method with different labeling. The legend is the same as that for columns 2-6 in Table 3. GDT_TS was calculated using TMscore with the non-invers native and model structure. Derevyanko+2018 result of CASP12 is not available.  Table. Comparison with single-model methods in CASP11 stage1. The legend is the same as that for Table 4 for the first five columns. (DOCX) S4 Table. Comparison with single-model methods in CASP12 stage1. The legend is the same as that for in Table 4 for the first five columns. (DOCX) S5 Table. Detailed information on homologous proteins in the test dataset. The first column represents the test dataset protein ID. The second and third columns, respectively, show the closest protein ID in train dataset and E-value. (DOCX) S6 Table. Comparison with single-model methods in CASP11 stage1 without homologous proteins. The legend is the same as that for Table 4 for the first five columns. (DOCX) S7 Table. Comparison with single-model methods in CASP11 stage2 without homologous proteins. The legend is the same as that for Table 4 for the first five columns. (DOCX) S8 Table. Comparison with single-model methods in CASP12 stage1 without homologous proteins. The legend is the same as that for Table 4 for the first five columns. (DOCX) S9 Table. Comparison with single-model methods in CASP12 stage2 without homologous proteins. The legend is the same as that for Table 4 for the first five columns. (DOCX) S10 Table. Comparison with single-model methods in I-TASSER. The first column represents the method name. The second and third columns, respectively, represent the average Pearson's correlation coefficient (Pearson) and average Spearman's correlation (Spearman) between the actual ranking and predicted ranking. The fourth column represents the average TMscore loss. Native structures were removed. (DOCX) S11 Table. Comparison with single-model methods in 3DRobot. The legend is the same as that for S11 Table for the first four columns. (DOCX) S12 Table. Local assessment performance comparison with other methods in CASP12 stage2. First column represents method name. Second and third columns represent AUC and Pearson value of local assessment. (DOCX) S13 Table. Comparison with the method using only core residues local assessment in CASP11 stage2. The legend is the same as that for Table 3