BEE, MIJ, and SEB conceived and designed the experiments. BEE performed the experiments. KEM conceived, designed, and performed the experimental characterization of the adenosine deaminase protein. BEE, MIJ, and SEB analyzed the data. BEE, MIJ, KEM, and SEB contributed reagents/materials/analysis tools. BEE, MIJ, KEM, and SEB wrote the paper.
The authors have declared that no competing interests exist.
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5′-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from
New genome sequences continue to be published at a prodigious rate. However, unannotated sequences are of limited use to biologists. To computationally annotate a hypothetical protein for molecular function, researchers generally attempt to carry out some form of information transfer from evolutionarily related proteins. Such transfer is most successfully achieved within the context of phylogenetic relationships, exploiting the comprehensive knowledge that is available regarding molecular evolution within a given protein family. This general approach to molecular function annotation is known as phylogenomics, and it is the best method currently available for providing high-quality annotations. A drawback of phylogenomics, however, is that it is a time-consuming manual process requiring expert knowledge. In the current paper, the authors have developed a statistical approach—referred to as SIFTER (Statistical Inference of Function Through Evolutionary Relationships)—that allows phylogenomic analyses to be carried out automatically.
The authors present the results of running SIFTER on a collection of 100 protein families. They also validate their method on a specific family for which a gold standard set of experimental annotations is available. They show that SIFTER annotates 96% of the gold standard proteins correctly, outperforming popular annotation methods including BLAST-based annotation (75%), GOtcha (89%), GeneQuiz (64%), and Orthostrapper (11%). The results support the feasibility of carrying out high-quality phylogenomic analyses of entire genomes.
The post-genomic era has revealed the nucleic and amino acid sequences for large numbers of genes and proteins, but the rate of sequence acquisition far surpasses the rate of accurate protein function determination. Sequences that lack molecular function annotation are of limited use to researchers, so automated methods for molecular function annotation attempt to make up for this deficiency. But the large number of errors in protein function annotation propagated by automated methods reduces their reliability and utility [
Most of the well-known methods or resources for molecular function annotation, such as BLAST [
SIFTER (Statistical Inference of Function Through Evolutionary Relationships) takes a different approach to function annotation. Phylogenetic information, if leveraged correctly, addresses many of the weaknesses of sequence-similarity-based annotation transfer [
Other approaches, referred to as context methods, predict protein function using evolutionary information and protein expression and interaction data [
Phylogenomics is a methodology for annotating the specific molecular function of a protein using the evolutionary history of that protein as captured by a phylogenetic tree [
Phylogenomics applies knowledge about how molecular function evolves to improve function prediction. Specifically, phylogenomics is based on the assertion that protein function evolves in parallel with sequence [
It is broadly recognized that this method produces high-quality results for annotating proteins with specific molecular functions [
Bayesian methodologies have influenced computational biology for many years [
Three properties of the Bayesian approach make it uniquely suited to molecular function prediction. First, Bayesian inference exploits all of the available observations, a feature that proves to be essential in this inherently observation-sparse problem. Second, the constraints of phylogenomics—that function mutation tends to occur after a duplication event or that function evolution proceeds parsimoniously—are imposed as prior biases, not as hard constraints. This provides a degree of robustness to assumptions that is important in a biological context. Third, Bayesian methods also tend to be robust to errors in the data. This is critical in our setting, not only because of existing errors in functional annotations, but also because phylogeny reconstruction and reconciliation often imperfectly reflect evolutionary history.
The current instantiation of SIFTER uses Bayesian inference to combine all molecular function evidence within a single phylogenetic tree, using an evolutionary model of molecular function. A fully Bayesian approach to phylogenomics would integrate over all sources of uncertainty in the function annotation problem, including uncertainty in the phylogeny and its reconciliation, and uncertainty in the evolutionary model for molecular function. It is important to be clear at the outset that the current instantiation of SIFTER stops well short of full Bayesian integration. Rather, we have focused on a key inferential problem that is readily treated with Bayesian methods and is not accommodated by current tools in the literature—that of combining all of the evidence within a single inferred tree using probabilistic methods. Technically, this limited use of the Bayesian formalism is referred to as “empirical Bayes” [
Extensions to a more fully Bayesian methodology are readily contemplated; for example, we could use techniques such as those used by MrBayes [
All automated function annotation methods require a vocabulary of molecular function names, whether the names are from the set of Enzyme Commission (EC) numbers, Gene Ontology (GO) molecular function names [
SIFTER builds upon phylogenomics by employing statistical inference algorithms to propagate available function annotations within a phylogeny, instead of relying on manual inference, as fully described in
A “duplication event” captures a single instance of a gene duplicating into divergent copies of that gene within a single genome; a “speciation event” captures a single instance of a gene in an ancestral species evolving into divergent copies of a gene in distinct genomes of different species. Each of the internal nodes of a phylogeny represents one of these two events, although a standard phylogeny does not distinguish between the two. The reconciled phylogeny for a protein family, which discriminates duplication events and speciation events [
The available, or observed, function annotations, associated with individual proteins at the leaves of the phylogeny, are propagated towards the root of the phylogeny and then propagated back out to the leaves of the phylogeny, based on a set of update equations defined by the model of function evolution. The result of the inference procedure is a posterior probability of each molecular function for every node in the tree (including the leaves), conditioned on the set of observed functions. The posterior probabilities at each node do not actually select a unique functional annotation for that node, so functional predictions may be selected using a decision rule based on the posterior probabilities of all of the molecular functions. This procedure gives statistical meaning to the phylogenomic notion of propagating functional annotations throughout each clade descendant from a molecular function mutation event. We do not require that the mutation event coincide with a duplication event.
The inference algorithm used in SIFTER has linear complexity in the size of the tree and thus is viable for large families. The complexity of SIFTER is exponential in the number of possible molecular functions in a family, owing to the fact that we compute posterior probabilities for all possible subsets of functions. In the families that we studied, the number of functions was small and this computation was not rate-limiting; in general, however, it may be necessary to restrict the computation to smaller collections of subsets. The rate-limiting step of applying SIFTER is phylogeny reconstruction; a full-genome analysis, given limited computational resources, might use lower-quality or precomputed phylogenies along with bootstrapping, or a subset of closely related species for the larger protein families. We found that lower-quality trees do not significantly diminish the quality of the results (results will be detailed elsewhere).
In this report, we use only GO IDA- and IMP-derived annotations as observations for SIFTER, because of the high error rate and contradictions in the non-experimental annotations (i.e., all annotation types besides IDA and IMP). However, SIFTER can also incorporate other types of annotations, weighted according to their reliability.
We first present results for SIFTER's performance on a large set of proteins to show general trends in prediction and to evaluate the scalability of SIFTER. We then present results for a single protein family with a gold standard set of function characterizations to evaluate prediction quality in detail. We also describe results for the lactate/malate dehydrogenase family, although it does not have a gold standard dataset. The decisive benefit of a statistical approach to phylogenomics is evidenced on each of these different datasets.
To evaluate the scalability, applicability, and relative performance of SIFTER, we predicted molecular function for proteins from 100 protein families available in Pfam [
For each family in our 100-family dataset, we ran SIFTER on the associated reconciled tree with the experimental annotations (IDA and IMP) from the GOA database. SIFTER produced a total of 23,514 function predictions; we selected the subset of 18,736 that had non-experimental annotations from the GOA database and applied BLASTC, GOtcha, and Orthostrapper to this set (
Comparison of Predicted Annotations on 18,736 Proteins from 100 Pfam Families
We chose these 100 families to meet one of the following two criteria: (1) greater than 10% proteins with experimental annotations (and more than 25 proteins), or (2) more than nine experimental annotations. Families with fewer than two incompatible experimental GO functions were excluded. The families had an average of 235 proteins, ranging from 25 to 1,116 proteins. On average, 3.3% of the proteins in a family had IDA annotations, and 0.4% had IMP annotations. Both SIFTER and Orthostrapper relied on this particularly sparse dataset for inference; evaluative techniques involving the removal of any of these annotations from inference tended to trivialize the results (e.g., removing a lone experimental annotation for a particular function did not enable that function prediction for homologous proteins). Selecting well-annotated families via these criteria assists SIFTER, but it should also enhance the performance of all of the function transfer methods evaluated here. Note also that SIFTER does not require this level of annotation accuracy to be effective, as discussed below. Finally, it is important to note that many of the IEA annotations from the GOA database may come from one of the assessed methods, so we can expect consistency to be quite high.
Of the 8,501 SIFTER predictions that were either identical or incompatible to the GO non-experimental annotations, 83.1% were identical. The average percentage of identical function predictions by family was 82.9%, signifying that the size of the family does not appear to impact this percentage. The median identity by family was 90.7%, and the mode was 100% (representing 25 families). The minimum identity was 14.4% (Pfam family PF00536). We estimate that 38 of the families contained non-enzyme proteins, and we found no difference in the identity percentage of SIFTER on enzyme families versus non-enzyme families. Similarly, the total number of functional annotations used as observations in SIFTER does not appear to impact the identity percentage (although percentage of proteins with annotations does appear to impact identity percentage). These data suggest that a large percentage of incompatibility is concentrated within a few families. It is not entirely clear what property of those families contributes to the greater incompatibility; it may reflect how well studied the families are relative to the number of proteins in the family.
Not all of the annotation methods predict functions for 100% of the proteins. Indeed, as shown in
Prediction Coverage on 18,736 Proteins from 100 Pfam Families
The percentage of Orthostrapper predictions that were identical or compatible with the non-experimental GOA database function annotations in the 100-family dataset was 88%, but only 7% of proteins received Orthostrapper predictions (
The difficulties encountered by Orthostrapper arise from the small number of proteins that are placed in statistically supported clusters, and the lack of annotations in these clusters. The latter limits the usefulness of the method to protein families with a high percentage of known protein functions, or to observed annotations with a low error rate. These results highlight the impact of the modeling choices in SIFTER and Orthostrapper. SIFTER uses Bayesian inference in a single phylogeny, addressing uncertainty in the ancestral variables in the phylogeny but presently not addressing uncertainty in the phylogeny itself. In contrast, Orthostrapper's approach of bootstrapped orthology addresses uncertainty in the phylogeny, but neglects uncertainty in the ancestral variables. Our results indicate that the gains to be realized by treating uncertainty within a tree may outweigh those to be realized by incorporating uncertainty among trees, but it would certainly be of interest to implement a more fully Bayesian version of SIFTER that accounts for both sources of uncertainty.
We compared SIFTER's prediction (the function with the single highest posterior probability) to the top-ranked prediction from BLAST-based methods, the top-ranked prediction from GOtcha, the unranked set of non-experimental terms from the GOA database, and unranked Orthostrapper predictions. On this broad set of proteins, SIFTER's predictions were compatible or identical with the non-experimental annotations from the GOA database for 80% of the predictions, while 67% of BLAST-based predictions, 80% of GOtcha predictions, and 78% of GOtcha-ni predictions were compatible or identical to the non-experimental GOA database annotations. It is not entirely clear what these numbers represent, in particular because some unknowable fraction of the IEA annotations in the GOA database were derived using these or related methods. Orthostrapper predictions achieved 88% (Ortho) and 92% (Ortho-ns) compatibility or identity with the GOA database, but because of the small percentage of proteins receiving predictions using Orthostrapper, the absolute number of compatible or identical annotations is much lower. All pairwise comparison data are in
The number of incompatible annotations is noteworthy: exact term agreement ranges from 16% to 91%, and the percentage of compatible or identical terms ranges from 45% to 95%. Collectively the methods must be producing a large number of incorrect annotations as evidenced by the high percentage of disagreement in predictions. It appears that there is no gold standard for comparison in the case of electronic annotation methods other than experimental characterization.
We selected a well-characterized protein family, the adenosine-5′-monophosphate (AMP)/adenosine deaminase family, for evaluation of SIFTER's predictions against a gold standard set of function annotations. We assessed these using experimental annotations that we manually identified in the literature, accepting only first-hand experimental results that were successful in unambiguously characterizing the specific chemical reaction in question. References are provided in
The AMP/adenosine deaminase Pfam family contains 128 proteins. Based on five proteins with experimental annotations from the GOA database, we ran SIFTER to make predictions for the remaining 123 proteins. Of these remaining proteins, 28 had experimental characterizations found by the manual literature search. SIFTER achieved 96% accuracy (27 of 28) for predicting a correct function against this gold standard dataset. SIFTER performed better than BLAST, GeneQuiz, GOtcha, GOtcha-exp (GOtcha transferring only experimental GO annotations), and Orthostrapper (75%, 64%, 89%, 79%, and 11% accuracy, respectively). The comparative results are summarized in
Results for SIFTER, BLASTA (the most significant non-identity annotated sequence), BLASTB (the most significant non-identity sequence), GeneQuiz, GOtcha, GOtcha-exp (only experimental GO annotations used), Orthostrapper (significant clusters), and Orthostrapper-ns (non-significant clusters). The gold standard test set was manually compiled based on a literature search. All percentages are of true positives relative to the test set.
(A) Results for discrimination between just the three deaminase substrates, as a percentage of the 28 possible correct functions.
(B) Results for discrimination between the three deaminase substrates plus the additional growth factor domain, as a percentage of the 36 possible correct functions; for BLAST, GeneQuiz, Orthostrapper, and Orthostrapper-ns, we required the transferred annotation to contain both functions; for SIFTER, GOtcha, and GOtcha-exp we required that the two correct functions have the two highest ranking posterior probabilities or scores.
The general role of the AMP/adenosine deaminase proteins is to remove an amine group from the purine base of the substrate. The AMP/adenosine deaminase family has four GO functions associated with member proteins (
Double ovals represent the four functions, none of which are compatible, corresponding to the random variables associated with the random vector used for inference in SIFTER.
The prediction results, a subset of which are shown in
The reconciled phylogeny used in inference is shown, along with inferential results (both the posterior probabilities for the deaminase substrates and the function prediction based on the maximum posterior probability). Eight of the proteins in this tree were annotated with growth factor activity, with the second highest probability being adenosine deaminase. The function observations used for inference are denoted by filled boxes to the left of the column with the posterior probabilities. For each substrate specificity that arises, a single edge in the phylogeny identifies a possible location for that mutation. The highlighted sequences are discussed in the text. The blue vertices represent speciation events and the red vertices represent duplication events. The tree was rendered using ATV software, version 1.92 [
An alternate method to evaluate prediction accuracy is the receiver operating characteristic (ROC) plot.
These ROC curves were computed over the 28 proteins in the test set for the deaminase family. This figure presents the ROC plot for both the posterior probabilities produced by SIFTER (and normalized for SIFTER-N) and the
Unmodified posterior probabilities allow us to assess the quality of a functional prediction across a family. When considering subsets of functional predictions, however, the maximum posterior for a protein may be small compared to the maximum posterior for other proteins, but we still would like to select this functional assignment because the other posteriors for this protein are smaller still. This can be achieved by normalizing the posteriors for a given protein across the subset of functional assignments of interest. For discriminating the three deaminase substrates,
We compared SIFTER's predictions in this family to four available protein function annotation methods: BLAST (in two approaches called BLASTA and BLASTB, as described in
On this gold standard annotation dataset, SIFTER predictions were more accurate than the alternative methods. Caveats must be mentioned for two of the methods. We ran GOtcha in two different ways on this dataset, detailed in
If we accept compatible annotations, the accuracy for BLASTA improves by a single protein (Q9VHH7) to 79% (22 of 28), and BLASTB results remain the same (see
The deaminase results focus on a single homologous protein domain that deaminates three possible substrates (AMP, adenosine, and adenine). A few of the proteins in this Pfam family have an additional N-terminal domain. This extra domain (PB003508) has growth factor activity (GO:0008083, not an enzyme function), while the AMP/adenosine binding domain (PF00962) has adenosine deaminase function. We built the phylogeny for this family using only the common AMP/adenosine binding domain; the functional annotations, however, are affiliated with the entire protein sequence. Because the phylogenomic model currently does not explicitly address domain fusion events, we did not consider the molecular function associated with the additional domain in the analyses described thus far.
We reevaluated the results, requiring, where appropriate, growth factor activity to be in the transferred functional annotation (for BLAST, Orthostrapper, and GeneQuiz), or requiring “adenosine deaminase” and “growth factor activity” to have the two highest rankings (ranked by posterior probabilities for SIFTER, or scores for GOtcha and GOtcha-exp). This provided a total of 36 molecular functions to be annotated on the 28 proteins. When evaluating the ability of methods to also correctly annotate this additional role, the accuracy for every method decreases or remains consistent. SIFTER achieved 78% accuracy (28 of 36), while BLAST achieved 75% accuracy (27 of 36), and GeneQuiz achieved 58% accuracy (21 of 36). GOtcha predictions achieved 89% accuracy in the multi-domain setting (32 of 36), and GOtcha-exp achieved 75% (27 out of 36). Of the 11% of proteins that Orthostrapper annotated, none were in the set of proteins with growth factor (so overall accuracy is 8% of the 36 functions to annotate); considering non-statistically significant annotations, Orthostrapper (Ortho-ns) achieved 35% (58% accuracy for 61% of proteins annotated). These results are summarized in
We experimentally characterized the substrate specificity of a deaminase (Q8IJA9) from the human malarial parasite,
where
The open circles are individual data points, while the solid line is the fit of the data to Equation 1. The inset shows raw data for the deamination of three substrates by Q8IJA9_PLAFA as detected by loss of absorbance at 265 nm. The bold, thin, and dashed lines are data for 100 μM adenine, AMP, and adenosine, respectively. The reactions with adenine and AMP contained 860 nM enzyme, while the assay containing adenosine had only 17 nM enzyme. Reaction conditions for all assays were 25 °C in 50 mM potassium phosphate (pH 7.4).
A second family we chose for detailed analysis and validation is the lactate/malate dehydrogenase family. We used the Pfam family PF00056, representing the NAD+/NADP+ binding domain of this family of proteins. This Pfam family contains 605 proteins, 34 of which have function annotations supported by experimental evidence in the GOA database or in literature references; these 34 were used as evidence for SIFTER.
There are three GO functions associated with proteins in this family. L-lactate dehydrogenase (L-LDH) (GO:0004459; EC:1.1.1.27) catalyzes the final step in anaerobic glycolysis, converting L-lactate to pyruvate and oxidizing NADH [
An interesting property of the SIFTER analysis is that it reports three instances of convergent evolution in the dehydrogenase family, all of which are supported in the literature, but only one of which is explicitly documented as convergent evolution [
Although there is no gold standard dataset for this family of proteins, based on a manual literature search we gathered 421 proteins in this family that scientists have non-experimentally annotated with a specific function, including substrate and cofactor. Our comparison metric is consistency, or the percentage of protein predictions that are identical to the set of 421 available (although non-experimental) annotations. It appears that the task of discriminating the substrates of this enzyme, i.e., predicting LDH or MDH, is not a difficult one, as all of the methods achieve a high consistency: SIFTER achieves 97% consistency, BLASTA achieves 93% consistency, GeneQuiz achieves 98% consistency, and GOtcha achieves 95% consistency with a set of non-experimental annotations. The methods mentioned here made predictions for all of the 421 proteins.
When we changed the task to include discriminating function at the cofactor level (i.e., predicting one of L-LDH, MDH NAD+, and MDH NADP+, so predicting LDH or MDH is inconsistent), the prediction task became more difficult. On this task, BLASTA consistency drops to 32%, GeneQuiz consistency drops to 68%, and GOtcha consistency drops to 73%. SIFTER's consistency, however, drops only slightly, to 95%, on this set of 421 proteins. One of the primary advantages of SIFTER for scientists is the ability to produce specific function annotations, which was originally a motivation for performing a manual phylogenomic analysis. Often substrate specificity or cofactor changes the biological role of a protein significantly, as in this case and in the deaminase family; being able to differentiate between protein functions with greater precision facilitates characterization and allows subtle but significant functional distinctions to be made.
This point can also be illustrated on a larger scale using the 100-family dataset. For each compatible (but not identical) pair of predictions, we checked which of the two predictions was more specific in the GO DAG. The results are shown in
Comparison of Compatible (but Not Identical) Predicted Annotations on 18,736 Proteins from 100 Pfam Families
Annotation of protein function through computational techniques relies on many error-prone steps and incomplete function descriptions; SIFTER is no different than other methods in this regard. But a significant component of SIFTER is a statistical model of how protein function evolves. Devos and Valencia postulate that “the construction of a complete description of function requires extensive knowledge of the evolution of protein function that is not yet available” [
The accuracy of SIFTER's results lets us revisit the assumptions of phylogenomics with an eye towards lessons about molecular function evolution. The improvement obtained by using a tree-structured evolutionary history and evolutionary distance, as in SIFTER, versus a measure of evolutionary distance alone, as in BLAST, GOtcha, or GeneQuiz, implies that the information in evolutionary tree structure corrects the systematic errors inherent in pairwise distance methods [
Comparing SIFTER to BLAST, GeneQuiz, and GOtcha at the cofactor level for the dehydrogenase family (95% consistency versus 32%, 68%, and 73%, respectively) exemplifies the power of SIFTER over other methods for specific function prediction, and the results from the 100-family dataset lend further strength to this comparison. The difference in consistency of BLAST predictions for general substrate discrimination versus specific cofactor discrimination (61% difference) reflects the disparity between the availability of general function descriptions and specific function descriptions evolutionarily proximate to each query protein. By employing phylogenomic principles, SIFTER leveraged evolutionarily distant function observations, incorporating more specific but sparser annotations and enabling SIFTER to make specific function predictions across an entire family. BLAST, GeneQuiz, and GOtcha were limited in their ability to detail molecular function at the cofactor level because of the relative sparsity of functions reported at the cofactor level.
A single protein sequence may contain multiple domains with several functions. There are many cases of individual domains in multi-domain proteins assuming a diverse set of functions, depending on the adjoined domains (e.g., [
The sparsity of reliable data is inherent to the task of predicting protein function. In the case of the 100-family dataset, 3.7% of proteins (on average) had experimental function annotations; in the AMP/adenosine deaminase family, 2.6% of proteins had experimental function annotations. Despite this sparseness, SIFTER achieves 96% accuracy in predicting function for homologous proteins for the latter gold standard dataset. Relying exclusively on evidence derived from experimental assays ensured that the quality of the annotations was high. For the AMP/adenosine family in Pfam, there were 348 non-experimental GO annotations (for 127 of the proteins) versus three proteins with IDA (inferred from direct assay) annotations and two proteins with IMP (inferred from mutant phenotype) annotations.
There are methods of extracting annotations from literature (e.g., [
SIFTER's primary role may be to reliably predict protein function for many of the Pfam families or more generic sets of homologous proteins. The argument can be made that no automated function annotation method should be used in some of these cases because the data within a family are too sparse to support annotation transfer. Thus, a second role for SIFTER may be to quantify the reliability of function transfer in under-annotated sets of homologous proteins, by using the posterior probabilities as a measure of confidence in annotation transfer. A third role may be to select targets for functional assays so as to provide maximum coverage based on function transfer for automated annotation techniques. Because of its Bayesian foundations, SIFTER is uniquely qualified to address these alternate questions in a quantifiable and robust way.
Molecular function predictions cannot replace direct experimental evidence for producing flawless function annotations [
In this section we first present the modeling, algorithmic, and implementational choices that were made in SIFTER. We then turn to a discussion of the methods that we chose for empirical comparisons. Finally, we present the protocol followed for the deaminase activity assays.
In classical phylogenetic analysis, probabilistic methods are used to model the evolution of characters (e.g., nucleotides and amino acids) along the branches of a phylogenetic tree [
SIFTER borrows much of the probabilistic machinery of phylogenetic analysis in the service of an inference procedure for molecular function evolution. The major new issues include the following: (1) given our choice of GO as a source of functional labels, functions are not a simple list of mutually exclusive characters, but are vertices in a DAG; (2) we require a model akin to Jukes–Cantor but appropriate for molecular function; (3) generally only a small subset of the proteins in a family are annotated, and the annotations have different degrees of reliability. We describe our approach to these issues below.
The first step of SIFTER is conventional sequence-based phylogenetic reconstruction and reconciliation.
Phylogenetic reconstruction is the computational bottleneck in the application of SIFTER. Thus, in the current implementation of SIFTER we have made use of parsimony methods instead of more computationally intense likelihood-based or Bayesian methods in phylogenetic reconstruction. This “empirical Bayes” simplification makes it possible to apply SIFTER to genome-scale problems.
In detail, the steps of phylogenetic reconstruction implemented in SIFTER are as follows. Given a query protein, we (1) find a Pfam family of a homologous domain [
The result of this procedure is a “reconciled phylogeny,” a rooted phylogenetic tree with branch lengths and duplication events annotated at the internal nodes [
Subsequent stages of SIFTER retain these structural elements of the phylogeny, but replace the amino acid characters with vectors of molecular function annotations and place a model of molecular function evolution on the branches of the phylogeny.
We use the following process to define a vector of candidate molecular function annotations for a given query protein and for the other proteins in the phylogeny.
Given a Pfam family of a homologous domain for a query protein, we index into the GOA database [
We treat the elements of this nad subset as the indices of a vector of candidate functional annotations, to be referred to as the “annotation vector” in the remainder of this section. Subsequent inferential stages of SIFTER treat this vector as a Boolean random vector. That is, we assume that each function can be asserted as either present or absent for a given protein, and we allow more than one function to be asserted as being present.
The goal is to compute posterior probabilities for all unannotated proteins in the family of interest, conditioning on experimentally derived annotations (i.e., IDA or IMP annotations) associated with some of the proteins in the family. To accommodate the fact that IDA annotations may be more reliable than IMP annotations according to the experiments by which they are generated, and to allow users to make use of other, possibly less reliable, annotations, SIFTER distinguishes between a notion of “true function” and “annotated function,” and defines a likelihood function linking these variables. In particular, the current implementation of SIFTER defines expert-elicited probabilities that an experimentally derived annotation is correct given the method of annotation: IDA annotations are treated as having a likelihood of 0.9 of being correct, and IMP as having a likelihood of 0.8.
GOA database annotations are not restricted to the leaves of the ontology but can be found throughout the DAG. To incorporate all such annotations in SIFTER, we need to propagate annotations to the nad subset. In particular, annotations at nodes that are ancestors to nad nodes need to be propagated downward to the nad nodes. (By definition, there can be no annotated descendants of nad nodes.) We do this by treating evidence at an ancestor node as evidence for all possible combinations of its descendants, according to the distribution
We turn to a description of the model of molecular function evolution that SIFTER associates with the branches of the phylogeny. For each node in the phylogeny, corresponding to a single protein, this model defines the conditional probability for the vector of function annotations at the node, conditioning on the value of the vector of function annotations at the ancestor of the node.
(A) Two proteins, Q9VFS0 and Q9VFS1, both from
(B) Protein Q9VFS1 has a functional observation for adenosine deaminase (the center rectangle). Also shown are the posterior probabilities for each molecular function as grayscale (white indicating zero and black indicating one) of the annotation vector after inference. Each component of the vector corresponds to a particular deaminase substrate.
(C) The noisy-OR model that underlies the inference procedure. We focus on the adenosine deaminase random variable in protein Q9VFS0. The transition probability for this random variable depends on all of the ancestor random variables and the transition parameters
We chose a statistical model known as a loglinear model for the model of function evolution. We make no claims for any theoretical justification of this model. It is simply a phenomenological model that captures in broad outlines some of the desiderata of an evolutionary model for function and has worked well in practice in our phylogenomic setting.
Let
where
To capture the notion that a transition should be less probable the less “similar” two functions are, we defined
We also need to parameterize the transition rate as a function of the branch length in the phylogeny. This is achieved by defining
Having defined a probabilistic transition model for the branches of the phylogeny, and having defined a mechanism whereby evidence is incorporated into the tree, it remains to solve the problem of computing the posterior probability of the unobserved functions in the tree conditional on the evidence.
This problem is readily solved using standard probabilistic propagation algorithms. Specifically, all posterior probabilities can be obtained in linear time via the classical pruning algorithm [
The BLAST version 2.2.4 [
To build the ROC plots for the BLASTC comparison, for each protein in the selected families we searched the BLAST output for the highest scoring sequence (with the most significant
To build the BLASTC set of annotations for the set of 18,736 proteins from the 100-family dataset, we built a keyword search with 260 GO terms, including all of the terms from the SIFTER analysis and other terms common to the BLAST search results. From this keyword search we extracted a set of terms ranked by
For GeneQuiz, we ran each member of the AMP/adenosine deaminase and lactate/malate dehydrogenase families on the GeneQuiz server (publicly available at EBI) from August 22, 2004, to September 1, 2004 [
For GOtcha, we ran the first publicly available version of the GOtcha software [
For Orthostrapper [
In each cluster, we transferred all experimentally derived GO annotations from member proteins onto the remaining proteins without experimentally derived GO annotations. If a cluster did not contain a protein with an experimentally derived GO annotation, no functions were transferred; if a protein was present in multiple clusters, it would receive annotations transferred within each of those clusters. This method yields an unranked set of predictions for each protein.
Purified Q8IJA9_PLAFA was the kind gift of Erica Boni, Chris Mehlin, and Wim Hol of the Stuctural Genomics of Pathogenic Protozoa project at the University of Washington. Adenosine and adenine were from Sigma-Aldrich (St. Louis, Missouri, United States), AMP was from Schwarz Laboratories (Mt. Vernon, New York, United States), and monobasic and dibasic potassium phosphate were from EMD Chemicals (Gibbstown, New Jersey, United States).
The loss of absorbance at 265 nm was monitored with an Agilent Technologies (Palo Alto, California, United States) 8453 spectrophotometer. The Δɛ between substrate adenosine and product inosine is 7,740 AU M−1 cm−1 [
(7.5 MB TGZ)
The Swiss-Prot (
This material is based upon work supported under a National Science Foundation Graduate Research Fellowship, National Institutes of Health (NIH) grant K22 HG00056, the Searle Scholars Program (1-L-110), a grant from Microsoft Research, a grant from the Intel Corporation, NIH grant R33 HG003070, an IBM SUR grant, and NIH grant GM35393 to J. F. Kirsch, whose generous help enabled the rapid characterization of Q8IJA9_PLAFA. Purified Q8IJA9_PLAFA was the kind gift of Erica Boni, Chris Mehlin, and Wim Hol of the Stuctural Genomics of Pathogenic Protozoa project at the University of Washington.
adenosine-5′-monophosphate
directed acyclic graph
Enzyme Commission
Gene Ontology
Gene Ontology annotation
lactate dehydrogenase
malate dehydrogenase
receiver operating characteristic