Figure 1.
A method for prediction of thiol oxidoreductases in structure databases.
(A) Overview of the complete method for prediction of thiol oxidoreductases. Red rectangles correspond to the Active Site Similarity portion of the method, and blue rectangles to the Cys Reactivity portion, each composed of several steps as discussed in the text. Each step was carried out independently converging into the final scoring function (SF) that ranged from 0 to 6.0. For each of step, the range of values in the final SF is reported in brackets, reflecting the different weights of the components. (B) The Active Site Similarity method. This part of the complete procedure is illustrated with an example of a Cys (represented in sticks)-containing protein. The active site (AS) around the Cys is analyzed in the following independent steps: (i) amino acid composition; (ii) secondary structure content; (iii) 3D structural profile analysis. For each of the three steps, the range of values in the final scoring function (SF) is reported in brackets. The amino acid composition step is further subdivided in three subparts, as detailed in the text.
Figure 2.
Amino acid composition of thiol oxidoreductases and control proteins.
An average composition at 6Å (A) and 8 Å (B) distances from Cys for proteins in the thiol oxidoreductase and control datasets, separated by function and fold as described in the text. Abbreviations: Trx fold OxR, thioredoxin-fold thiol oxidoreductases; Non Trx fold OxR, other thiol oxidoreductases; FAD OxR, FAD-containing thiol oxidoreductases and Non OxR, proteins in the control dataset (Table S2). Random Cys refers to a set of 1800 proteins randomly selected from PDB, mentioned in the text. The calculated standard deviations are also shown.
Figure 3.
Calculated p-values for average occurrence of amino acids.
(A) p-values for occurrence of each amino acid within 6 Å of the catalytic Cys (Figure 2A) for different types of proteins in our dataset. Values were obtained using t-test (comparing the whole population of each type with the whole population of the Random Cys set) using GraphPad. (B) p-values for occurrence of each amino acid within 8 Å of the catalytic Cys (Figure 2B) for different types of proteins in our dataset. Residues with most significant p-values are highlighted in bold.
Figure 4.
Application of methods for active site recognition to thiol oxidoreductases.
Benchmarking of five publicly available programs for prediction of active site residues: Q-site finder, Pocket finder, THEMATICS, FOD and SARIG. For Q-site finder and Pocket finder, the percentage of correctly predicted proteins is shown wherein only the first three ranked sites (dark grey cylinder) or all 10 sites (light grey cylinder) are considered. For FOD, the dark grey bar shows a true positive rate using the standard cutoff, while the light grey bar represents a true positive rate when a more permissive cutoff is employed (i.e., active site residues have a normalized ΔH score>0.5). For SARIG, the dark grey bar bar represents a true positive rate with the standard cutoff, while the light grey bar shows a true positive rate when a more permissive cutoff values (closeness Z-score>0.75 and 3 Å2<RSA<200 Å2) are employed.
Figure 5.
The calculated curves are shown in red and the corresponding standard Henderson-Hasselbach (HH) titration curves in blue. (A) An example of a highly deviating theoretical titration. (B) An example of no deviation (this is the most common situation since most Cys behave in this manner).
Figure 6.
Effect of variation in parameter used on method performance.
In the figure, the difference between True Positive Rate (TPR) and False Positive Rate (FPR) for each parameter weight is plotted. True positives included correctly predicted thiol oxidoreductases from the Test Case (Table S3), and false positives control proteins (Table S3) predicted as thiol oxidoreductases (i.e., containing a Cys scoring higher than the cut-off). A Y-axis value of 1 indicates complete separation of the dataset. Starting from the optimized weights described in the text, each parameter was varied separately in the range 0–4 (X-axis values). The graph should be read as follows: each parameter is represented by a curve, and each point represents the TPR-FPR value (the higher the better) for a specific weight. Maxima represent the best scoring values (compare to Figure 1A). Decreased values to the right of the maximum reflect a tendency of the parameter to give False Positive predictions when over-weighted, while when on the left, it corresponds to a tendency to underestimate the number of True Positives when the parameter is under-weighted. This analysis provides insights into variability introduced in the performance when the parameter weights are changed, in particular showing that the optimized parameters cannot vary in a broad range of values. However, for two parameters alternative values were possible: the Profile scoring (values could vary in the range 2.5-÷2.75) and the H++ contribution (1÷1.25). For all other parameters, clear maxima (corresponding to the optimized values described in the text) were observed.
Figure 7.
Green circles show thiol oxidoreductases and grey circles other proteins of the Test Case. The Test Case is described in the text and in Table S3.
Figure 8.
Analysis of the Saccharomyces cerevisiae proteome.
The score for each protein of the yeast proteome (i.e., the score for the highest scoring Cys) is plotted (model index numbers follow the order in Table S5). Green circles represent thiol oxidoreductases, and yellow circles show proteins scoring with thiol oxidoreductases but not known to be redox catalysts. Gray circles represent other proteins. All but one known thiol oxidoreductase were detected.