Accurate Protein Structure Annotation through Competitive Diffusion of Enzymatic Functions over a Network of Local Evolutionary Similarities

doi:10.1371/journal.pone.0014286

Figure 1.

Overview of ETA Network Diffusion.

1A. We detect similarities between proteins using Evolutionary Trace Annotation (ETA), which consists of three steps. First, the Evolutionary Trace (ET) algorithm ranks positions in aligned sequences by the correlation of their variations with evolutionary divergence. These ranks of evolutionary importance are mapped onto the protein structure. Second, six amino acids are selected heuristically based on their evolutionary importance, proximity and surface exposure, forming a structural template (red spheres). Third, the template is matched against proteins with known function. These steps are repeated for the matched proteins in order to verify that the match is reciprocal. Significant matches are selected by an SVM (not depicted). 1B. We construct a graph using ETA matches so that nodes represent protein chains and edges represent evolutionary and structural similarity. We select an enzymatic function and apply one of three labels to every node in the network: blue if the node is known to have that function, white if it is known to not have that function, or “?” if it is unknown whether or not the node has that function. We then allow these labels to “diffuse” to all other nodes in the network based on the strength and number of connections. This results in a weight assigned to every node for all enzymatic functions present in our network. In a final step (not depicted) we normalize the weights assigned to a particular node with respect to all other un-annotated nodes in the network. The normalized weights (called z-scores) are compared. The functional label with the highest z-score is taken as the prediction, and the magnitude of the z-score is used as a measure of confidence.

More »

Expand

Figure 2.

Performance on the FLORA test set.

The diffusion method shows a clear improvement at higher sensitivities.

More »

Expand

Figure 3.

4 EC Performance on Structural Genomics test set.

3A. Accuracy/coverage tradeoffs of ETA network diffusion and nearest neighbors are shown in red and blue circles, respectively. Coverage (percentage of entire test set) increases as confidence decreases, so at 10% coverage we show the accuracy (# of true predictions/# of predictions made) of our 10% most confident predictions. Blue triangle shows performance of ETA. Diffusion gives clear accuracy advantages at most coverage cutoffs. 3B. Performance compared to the top match from a BLAST search of Swiss-prot. Diffusion on an ETA network clearly outperforms BLAST (black circles) at most coverages on this dataset, demonstrating the need for complementary structural based methods. 3C: Accuracies when the z score cutoff is varied. For each z score, we plot the accuracy of all predictions with that score or higher. Accuracy and z score show a positive correlation. Accuracy shows a steep decline after z = 0.4. 3D shows a magnified view of the beginning of the steep decline.

More »

Expand

Figure 4.

in vitro biochemical assay confirms the ETA network diffusion prediction of 3h04 as a carboxylesterase.

A) The prediction of carboxylesterase function for this unknown protein is based on ETA template matches to three chains, all of which have identical function and fold, and low sequence identity with the query protein. B) 10 µg of purfied 3h04 was run on a SDS-12% polyacrylamide gel and stained with Coomassie brilliant blue. The single band shown at 35 kDa corresponds to his-tagged 3h04. C) Plot of absorbance at 405 nm vs time for 3h04 (blue), esterase from porcine liver (Sigma, red), and BSA (Sigma, green). D) The specific activity of 3h04, 193±8 (blue), is similar to that of the esterase from porcine liver, 166±51 (Sigma,red). Specific activity is represented in Units (U) per mg of protein. All error bars depict standard deviation.

More »

Expand

Figure 5.

Performance penalty as edges are removed from a graph according to the sequence similarity of the nodes they connect for 4 EC predictions.

Accuracy/coverage tradeoffs of ETA network diffusion, nearest neighbor, and the top match from a BLAST search against Swiss-prot are shown in red, blue and black circles respectively. Coverage increases as confidence decreases, meaning at 10% coverage we show the accuracy of our 10% most confident predictions. Maximum allowed sequence identity is 80% in 3A, 60% in 3B, 40% in 3C and 20% is 3D. Accuracies decline with each removal, but ETA network diffusion maintains higher accuracy at high confidences/low coverage.

More »

Expand

Figure 6.

Network neighborhood of PDB structure 2dz9A.

Depicts the network neighborhood within 2 steps from structure 2dz9A. Structures in red are annotated as biotin—acetyl-CoA-carboxylase ligases (6.3.4.15). White structures have no function or are part of the test set. The nearest neighbor method leads to no prediction for 2dz9A because all matches are only to proteins without known function, but diffusion leads to a correct prediction because of the proximity to that functional label and high connectivity.

More »

Expand