Figure 1.
Different sequence-based protein representations.
The different shades of gray denote predicted buried (B) and exposed (E) regions in case of the the solvent accessibility, and predicted helix (H), strand (E), and random coil (C) region in case of the secondary structure.
Table 1.
Predefined amino acid clusters.
Table 2.
Prediction performance scores (auroc).
Figure 2.
ROC-curves of composition-based classifiers using the codon sequence (), the signal peptide sequence (
), and the protein sequence (
). Performances are shown for classifiers A) trained and tested on
, B) trained and tested on
, and C) trained on
and tested on
.
Figure 3.
Comparing hom and het classifiers.
Amino acid contributions obtained from and
trained classifiers are the
- and
-values respectively, the correlation is denoted by
. Contributions are normalized per classifier (axis): each contribution is divided by the maximum absolute contribution. The plots show the contributions obtained from classifiers trained using A) the protein amino acid composition (
) and B) the predefined amino acid cluster composition (
).
Figure 4.
Best performing amino acid clusters.
The heat maps show the combined result of the best performing clusters obtained in 10 CV-loops for both (A) and
(B). The values on the diagonals denote how often an amino acid ended up in a cluster (due to selecting the optimal clusters, amino acids might not be selected at all). The colors on the non-diagonal places denote how often two amino acids ended up in the same cluster. Complete linkage hierarchical clustering was used to cluster the heat map, using the euclidean distance as distance measure. The color of the amino acid letters indicates if the amino acid has a positive (green) or negative (red) contribution in Figure 3A.
Figure 5.
For the first three feature selection iterations (-axis), the bar plot shows how often features were selected in the 10 CV-loops for both
(A) and
(B). Features with a different shade of the same color are correlated (
). The letters between brackets in the legend are amino acids that denote either which amino acids are in the cluster, e.g. the basic cluster contains amino acids R, K, and, H, or for which amino acid a codon encodes, e.g. codon TAC encodes for Y.