Table 1.
Division of amino acids into 3 different groups by different physicochemical properties.
Figure 1.
Extraction process of the 188-dimensional (188D) feature vectors (FV).
Sequences are input and processed by analyzing amino acid composition, distribution and protein physicochemical properties, FV1–FV188 are output as feature vectors.
Figure 2.
The architecture of our ensemble classifier.
The training dataset is classified by all base classifiers. After K-Means clustering and circulating combination the best ensemble result is achieved.
Table 2.
Algorithm 1. Circulating Combination of EFSS.
Figure 3.
Protein structure levels in SCOP.
The classification of protein classes and of protein folds are the first and second layer, respectively, of our hierarchical classification frame.
Figure 4.
Comparison of the two datasets.
In each query description, the first letter represents the class name and the second digit represents the fold number. The SCOP dataset that was used in this paper is shown in red. Seven classes containing 1195 folds (which are omitted in the figure) from SCOP dataset are explained as: (a) all-α proteins (284 folds), (b) all-β proteins (174 folds), (c) α/β proteins (147 folds), (d) α+β proteins (376 folds), (e) multi-domain proteins (66 folds), (f) membrane and cell surface proteins and peptides (58 folds), and (g) small proteins (90 folds). The benchmark dataset [3] proposed by Ding and Dubchak, composed of the 27 folds that were extracted from SCOP, is shown in blue.
Figure 5.
Comparison of success rate among several studies.
Our work outperforms all previous works with an accuracy of 74.21%.
Figure 6.
Success rate achieved by three classifiers with different sequence identity.
The two graphs show the results of two datasets((a) SCOP version 1.75, (b) SCOP version 1.75A). Their similar success rates demonstrate the robustness of our model. As identity increases it becomes less stringent and success rate rises. It also shows our ensemble classifier outperforms other two classifiers.
Table 3.
Performance on different classifiers on protein fold recognition (one sequence in each family).
Table 4.
Performance on different classifiers on protein fold recognition (sequence at 35% identity).
Table 5.
Preliminary results* of PCA analysis.
Table 6.
Loadings of most informative features* on principle component factors.
Figure 7.
Success rate of seven subsets with different sequence identities.
The figure shows factors influencing success rate. Success rate has an increasing trend when sequence identity rises or class number drops.
Table 7.
Influential factors for success rate of 1st and 2nd hierarchical layers.