Fig 1.
Position-specific scoring matrix of amino acids vs. secondary structure codes.
Table 1.
Accuracy of the SSE-PSSM and several state-of-the-art SSP methods.
Table 2.
Accuracy of several SSP methods for proteins of different structural classes.
Table 3.
Accuracy of several SSP methods for proteins of different sizes.
Fig 2.
Performance of preliminary incorporation of SSE-PSSM into state-of-the-art SSP methods using PSI-BLAST to generate PSSM.
(A) QuerySet-I against TargetSet-nr25, the developmental PSSM target dataset of this study. (B) TS115 against UniRef90-2015, the standard PSSM target dataset used in most SSP works. (C) CASP12 against UniRef90-2015. (D) CASP13 against UniRef90-2015. The SSE-PSSM was preliminarily incorporated with different SSP methods using a second-level machine learning feature set integration strategy. After the feature integration, the accuracy of most methods was significantly improved, especially in Q8 and SOV8. The fundamental prediction feature set of all the tested SSP methods are the amino acid PSSM generated by PSI-BLAST.
Fig 3.
Performance of preliminary incorporation of SSE-PSSM into state-of-the-art SSP methods using HHBlits to generate PSSM.
(A) TS115 against the UniRef90-2015 PSSM target dataset. (C) CASP12 against UniRef90-2015. (D) CASP13 against UniRef90-2015. These methods used HHBlits as the main PSSM generator. Except for NetSurfP-2, the PSI-BLAST PSSM was also applied in their algorithms. We used both HHBlits and PSI-BLAST to implement the SSE-PSSM and preliminarily integrate it with these HHBlits-based SSP methods. Their accuracies were higher than those of the algorithms tested in Fig 2. However, since their programs were released after 2017 and TS115 and CASP12 proteins were released before 2017, they might have learned some homologs of these datasets. Thus, the CASP13, which comprised proteins released between 2017 and 2019, should be the most reliable independent test dataset among the three. Assessed with CASP13, the preliminary feature integration of SSE-PSSM into these HHBlits-based methods improved the Q3 and Q8 by 1–3% and 2–6%, respectively.
Table 4.
Results of performance pretest for several state-of-the-art SSP methods.
Fig 4.
Flowchart of the SSE-PSSM algorithm.
(A) The core procedure of the algorithm. (B) Determination of the SSE sequence of a hit. A sequence similarity search is performed to retrieve a hit list for the query sequence. Next, for each hit protein, directly obtain its SSE sequence from the known structure or synthesize one by position-specific voting according to the homologs (Hom) of the hit retrieved by the second-round sequence similarity search against a reference protein structure dataset. An SSE transformation of the sequence alignments between the query and hits is then carried out by replacing the amino acids with SSE codes. Finally, the PSSM is generated according to the transformed alignments.