A Factor Graph Approach to Automated GO Annotation

As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.


Introduction
A fundamental step, and significant bottleneck, in the acquisition of biological knowledge from genomic data is the characterization of gene products properties. The Gene Ontology (GO) Consortium provides an ontology of terms for describing gene products properties and their relationships in a species-independent manner. The automated association between ontologies and genes and gene products, i.e., the automated annotation of genes, is one of the great predictions were required, sigmoid fitting over SVM outputs was performed [24]. Recently, a revised version of the TPR heuristic valid for DAG structured ontologies like GO has been presented [25]. In this new version, a two-way flow of information is performed. Experimental results on a human annotation problem involving thousands of classes from the human phenotype ontology [26] suggest that the TPR algorithm is indeed effective for improving raw class predictions so they can consistently match a predefined target ontology. It should be noted, however, that consistent TPR class predictions may not be unique and may not be optimal with respect to the minimization of the probability of erroneous class predictions.
In this paper, a factor graph approach to the automated GO Annotation is presented. Briefly, a factor graph is a "bipartite graph that expresses how a global function of several variable factors into a product of local functions" [27]. Among their many applications [28,29], factor graphs are well suited for behavioral system modeling. In this type of application, a boolean function over the variables of a system describes its valid configurations. If such a boolean function can be expressed as a set of predicates over subsets of system variables, a factor graph representation follows. Recalling that any boolean function can be represented as a rooted DAG and that domain-specific GO structures are rooted DAGs, it follows that factor graphs can also be used for GO modeling. On the other hand, factor graphs are also well-suited for the probabilistic modeling of errors arising in problems of information transmission in the presence of noise. Reminding that misclassification errors of practical binary classifiers can be formulated as an instance of such class of information theory problems, factor graphs can be also used for the probabilistic modeling of noisy GO-term predictions. Having a unique factor graph that formally includes the TPG constraint and a model of prediction noise at binary classifiers, latent GO-term predictions can be obtained from their maximum a posteriori (MAP) probability estimates. These probabilities can be in turn computed by an iterative message passing algorithm between nodes of the factor graph. To validate our proposal, the annotation of protein sequences of three biological models, S. cerevisiae, A. thaliana, and D. melanogaster, in the GO Molecular Function was first considered. These model organisms were chosen aiming to encompass the tree of life, representing unicellular (prokaryotic) and multicellular (plants and animals) organisms. To conclude, the annotation of four unknown proteins in Solanum lycopersicum with A. thaliana annotation data was analyzed.

Method
We devise a method, hereafter called Factor Graph GO Annotation (FGGA) that exploits the ability of factor graphs for modeling logical and statistical constraints over system variables, e.g., the key TPG constraint over GO-term annotations or a convenient probability distribution of raw GO-term predictions over actual GO-term annotations. The FGGA approach is split into three steps. The first one deals with the construction of a core Factor Graph (FG) from a predefined GO-DAG. The second one deals with the enrichment of the core factor graph to take into account the noisy nature of GO-term predictions. Finally, the third step deals with the proper setting of a message passing algorithm to infer GO annotations of unannotated samples.

Matching a GO-DAG to a core Factor Graph
Given a GO subgraph, GO-terms GO:i are mapped to binary-valued variable nodes x i of a factor graph while relationships between GO-terms are mapped to logical factor nodes f k describing valid GO:i configurations under the TPG constraint. This matching task can be accomplished through an adapted version of the Breadth-First Search (BFS) [30] algorithm. As shown in Fig 1a, starting from the top-root node in a given GO-DAG, the identity of visited child nodes is preserved so that a new factor node between a parent and a child node is introduced only when the child node has not been previously visited; otherwise, the parent node is attached to the early created factor node for the revisited child node.
Practically, logical factor nodes f k are implemented with truth tables of 2 #child+#parents entries. At each of these entries, the specific parent/child role of participating variable nodes is required to check the TPG constraint. As shown in Table 1 where 1/0 denotes positive/negative annotation respectively, the logical factor f 3 in Fig 1b ensures that the TPG constraint over variable nodes x 2 , x 3 and x 4 is fulfilled whenever x 4 is a child node and x 2 and x 3 are its parent nodes (multiple inheritance over x 4 ).
Formally, logical factor nodes f k over subsets of variable nodes ensure the local satisfiability of the TPG constraint. With this aim, two logical rules are repeatedly evaluated. Specifically, if a child GO-term is annotated positive, then its parent GO-term(s) must also be annotated positive. On the other hand, if a parent GO-term is annotated negative, then its children GO-term must also be annotated negative. In predicate logic language [31], let is_a(GO:j, GO:i) denotes GO:i parent of GO:j child. Similarly, let annotated(Á) denote the positive annotation of any GOterm. As a result, at least one of the following rules Eqs (1) or (2) must be active and fullfilled by any pair of GO-terms involved within a is_a relationship: Regarding Table 1, let "x 4 is the child of x 3 " and "x 4 is the child of x 2 " denote the two is_a relations between GO:4, GO:3 and GO:2 terms. For both these relations, rule 2 is active and fulfilled at row 1 and thus, f 3 is true. On the other hand, rule 1 is active for both relations at row 4 but fulfilled by only one of them and thus, f 3 is false.

Enrichment of the core GO-FG
In practice, actual GO:i annotations of unannotated samples are estimated as accurately as possible with binary classifiers. Therefore, variable nodes x i in the core GO-FG must be considered latent behind a new class of variable leaf nodes y i modeling observable, but uncertain, GO:i term predictions. The enrichment of the core GO-FG with variable leaf nodes y i requires the introduction of a new class of probabilistic factor nodes g i modeling their statistical dependence on latent variable nodes x i (see Fig 2). For this purpose, a communication channel model between latent binary inputs x i and observable real-valued outputs y i can be assumed. Hence, latent binary inputs x i (±1) corrupted with additive white zero mean Gaussian noise z i of variance η i , so that y i = x i +z i holds, can be assumed [32]. In such a case, probabilistic factor nodes g i can be set [33] to pðx i jy i Þ ¼ To completely describe a FG model for the automated GO annotation problem, the estimation of noise variances η i remains to be done. Under the assumption of symmetrical conditional probability distributions for predictions y i over latent annotations x i , variances η i can be easily estimated using a validation dataset of positively/negatively annotated samples. Specifically, let D be a validation dataset with L + positively annotated samples. Hence, is the positive annotation of the l-th data sample to the GO:i term and y l i is the corresponding real-valued classifier prediction.

Message passing algorithm in FGGA
Once a factor graph model F for the automated GO annotation problem has been defined, an iterative message passing algorithm [34] between nodes of F can be used to compute maximum a posteriori (MAP) estimatesx i of variable nodes x i modeling actual GO:i annotations. Given an unannotated input sample s, all variable leaf nodes y i in F gets instantiated by a set Y(s) of real-valued GO-term predictions issued by a set of base binary classifiers. Without loss of generality, let binary classifier's outputs be characterized by a set of variances η describing conditional Gaussian probability distributions of real-valued GO-term predictions y i over latent annotations x i . Hence, starting from instantiated variable leaf nodes y i , an iterative message  passing algorithm is performed until some convergence criteria ξ, e.g. ξ = 10 −3 , or a maximum number I max of iterations is set, e.g. I max = 50. At this stage, marginal probabilities p(x i |Y(s)) can be approximated by their iterative version p t (x i |Y(s)) provided by a Sum-Product algorithm at the t − th iteration step. In message passing terms, probabilities p t (x i |Y(s)) follow from the product of the last incoming and outgoing messages at any of the x i linking edges. From these probabilities, MAP estimatesx i can be computed and used to provide consistent GO:i annotations of unannotated samples s in a predefined GO domain. As a result, the following FGGA algorithm (see Alg. 1) follows:

Input:
GO factor graph F with n GO-terms, a sample s to be annotated, a set Y of n predictions over s, a set η of prediction noise variances, a convergence criteria ξ, a maximum number I max of iteration steps Output: MAP estimatesx i actual GO-term annotations on s, i = 1, . . ., n 1: for t = 1 to I max do 2:

Results and Discussion Experimental Protocol
Three models organisms, S. cerevisiae [35], A. thaliana [36], and D. melanogaster [37] were considered. For each of them, robust and loose annotation datasets in the GO Molecular Function domain were generated using different subsets of GO annotation evidence codes (see Table 2). Robust annotation datasets were built from protein sequences with defined GO experimental evidence codes (http://geneontology.org/page/guide-go-evidence-codes), i.e., inferred from mutant phenotype (IMP), inferred from genetic interaction (IGI), inferred from physical interaction (IPI), inferred from expression pattern (IEP) and inferred from direct assay (IDA). On the other hand, loose annotation datasets were built from protein sequences with former experimental evidence codes, traceable author statement (TAS) evidence codes and inferred from electronic annotation (IEA) evidence codes. Recalling that a minimum amount of annotation data is required for the prediction of individual GO-terms, GO sub-graphs with a minimum of 50/10 positively annotated protein sequences per individual GO-term were respectively considered for robust/loose annotation datasets. To assemble conveniently balanced binary training datasets [38], positive annotated protein sequences to individual GOterms were complemented with negative annotated instances using the inclusive separation policy described in [39].
Concerning characterization methods of individual protein sequences in terms of a fixed number of input features, the presence/absence of conserved domains in the Pfam database [40] and the measurement of 457 physicochemical/secondary structure properties, 453 of the physicochemical type [41] and four of the secondary structure type [42,43], were considered. Pfam data was obtained for each protein as a binary vector where each element indicates the presence/absence of domains. Physicochemical and secondary structure data was obtained for each protein as a real vector where each element indicates the value of a physicochemical/secondary structure property. The choice of these baseline characterization methods was guided by the desire to develop in-silico annotation methods of broad applicability across organisms, including non-model instances for which more advanced characterizations, e.g., gene expression or protein-protein interaction data [44], are unlikely to be available. Actually, for many genes coding for proteins of unknown function and thus no significant Pfam hit, the naive physicochemical and secondary structure characterization, hereafter "Physicochemical + ", remains valid. Practically, protein sequence characterization methods were implemented with the help of the EMBL-EBI Pfam [45] database services and the Bio.SeqsUtils [46] package. Concerning baseline binary classifiers, differently from [18,19] where the costly bootstrap aggregation of hard-margin linear Support Vector Machines (SVM) classifiers with unthresholded outputs was used to fulfill the Gaussian assumption of prediction noise, single softmargin linear SVMs with default constant complexity C = 1 were used for both the FGGA and TPR-DAG methods. To fulfill the Gaussian assumption in the FGGA approach, real valued Y i predictions were set to the margin of SVM classifier outputs. On the other hand, probabilistic linear SVM outputs required by the TPR-DAG method were computed using the standard Platt's sigmoid fitting approach. Practically, SVMs were implemented with the e-1071 R package [47].
Concerning the TPR-DAG method, the algorithm described in [25] was implemented in C+ +. Briefly, for each GO-term in a given GO subgraph, its maximum distance to the root node is first computed. Starting from the set of most distant GO-terms, flat SVM predictions of individual GO-terms are updated using the TPR heuristic. Therefore, a consensus prediction for each GO-term is obtained by averaging its flat SVM prediction and those of positive child GOterms. Without loss of generality, the threshold for positive predictions is set to 0.5. This bottom-up update process over flat SVM predictions is repeated until the root node is reached. To accomplish a consistent set of hierarchical GO-term predictions, a final top-down update process on consensus GO-term predictions is performed. As a result, a child GO-term with a consensus prediction stronger than any of its parents is forced to update its value with that of its weakest parent. This process is repeated until most distant GO-terms are reached.
Concerning the evaluation of the predictive performance of the FGGA approach, a 5-fold cross-validation test was performed using the TPR-DAG algorithm as a reference comparison. To shed light on the absolute improvements of FGGA predictions, baseline SVM classifiers were also evaluated. Aiming to rank the prediction accuracy of FGGA, TPR-DAG leveraged binary classifiers and baseline SVM classifiers, per GO-term average AUC scores [48] were computed. Taking into account that GO annotation gets harder as deeper levels of the hierarchy are considered [49], prediction performance was measured by means of the hierarchical precision (HP), the hierarchical recall (HR), and the hierarchical balanced F-score (HF) reflecting their trade-off. Comparisons between the FGGA and TPR-DAG methods were performed with the Wilcoxon rank sum test at the p value = 0.01 significance level. Finally, to evaluate the ability of the FGGA approach to extend biological knowledge, a molecular function annotation problem in tomato (Solanum lycopersicum cv. Heinz 1706) [50] was considered.

Prediction performance on held-out data
Whatever the organism, characterization and training data policy, FGGA improved both baseline SVM and TPR-DAG classifiers. This was particularly evident in the annotation of D. melanogaster protein sequences for which the deeper, broader and richer, in terms of jumping edges, GO-DAGs were observed. As shown in Fig 3, with Pfam characterization and loose annotation data, both TPR-DAG and FGGA classifiers improve the average AUC of their baseline SVM classifiers. However, FGGA improvements are remarkably higher. Similar results were observed in other experimental conditions (see, e.g., S1 Fig).
Specific average AUC improvements of FGGA over TPR-DAG classifiers for D. melanogaster protein sequences and Pfam characterization are shown in Fig 4. FGGA improvements reached roughly 87% of the GO-terms, independently of the use of robust (198 out of 226 GOterms) or loose (581 out of 656 GO-terms) annotation data at the training stage. In the latter case, a closer look revealed 21 GO-terms belonging to the deeper levels of the GO hierarchy, their minimum level was 6, with an average AUC above the 10% margin. On the other hand, only 8 GO-terms above the 10% margin where identified for TPR-DAG classifiers. Conversely, these GO-terms belonged to rather superficial levels, their maximum level was 5, of the GO hierarchy. Similar results were obtained for A. thaliana (see S1   Average AUC results per individual GO-term correlate well with hierarchical metrics of annotation performance. As shown in Table 3, whatever the experimental arrangement, FGGA outperforms TPR-DAG based on hierarchical F-score results (p value = 0.01). This is consistent with hierarchical precision and recall results showing FGGA improvements over TPR-DAG in at least 8 of 12 experimental instances and equaling in remaining cases. Furthermore, although the use Physicochemical + characterization strongly reduces hierarchical precision levels in both methods, a more precise annotation performance is still accomplished by FGGA which outperforms TPR-DAG in 5 of the 6 experimental instances (p value = 0.01). For the sake of completeness, hierarchical F-scores of baseline SVM classifiers were also evaluated. As expected, the value of consistent FGGA predictions against independent ones in baseline SVM classifiers can be clearly appreciated (see S1 Table).
Complementary evaluations were performed to shed some light on the relation between the variance of base binary classifiers and FGGA hierarchical F-scores when increasing annotation noise. This would be the case of using the naive Physicochemical + characterization or the loose training policy. As expected, whatever the organism, the variance of base binary classifiers augmented with the Physicochemical + characterization and these variations were more evident with the loose training policy. For D. melanogaster and robust training, variances of base binary classifiers grew from an average of 0.27 to 0.33 when changing from the Pfam to the Physicochemical + alternative. This was consistent with a reduction of the hierarchical F-score from 0.75 to 0.55. Likewise, with loose training, variances grew from an average of 0.30 to 0.41 with a reduction of the F-scores from 0.65 to 0.50. Similar results were observed for the other two model organisms. Overall, these results suggest that within the FGGA framework, augmenting the confidence of binary base classifiers by reducing their variances may be effectively rewarded by improving annotation performance.

Annotation of proteins of unknown function in plants
The physical distribution of gene in plants seems to be not random and physical clusters of genes with related functional classes can be expected [51]. We consider a GO Molecular Function annotation problem in Solanum lycopersicum cv Heinz 1706. Four small heat shock protein (shsp) genes [52], Solyc06g076520, Solyc06g076540, Solyc06g076560 and Solyc06g076570, of well-known chaperone function in fruit ripening and heat shock stress [53,54] map together to a * 15 Kbp region in chromosome 6 suggesting the existence of a region of functional related genes. In a wider region of * 30 Kbp, these genes map together with a Phosphoserine phosphatase SerB gene (Solyc06g076510) and four genes of unknown function, Solyc06g076500, Solyc06g076530, Solyc06g076580 and Solyc06g076590. We hypothesize that FGGA classifiers trained on loose A. thaliana annotation datasets characterized by naive physicochemical/secondary structure properties may shed some light on the four genes of unknown function in Solanum lycopersicum. To support this hypothesis, Solyc06g076540 and Solyc06g076510 were first used as positive controls. Since Solyc06g076540 lacks of a GO molecular function annotation, the "protein-self association" annotation of its HSP17.8 ortholog in A. thaliana was used. On the other hand, "phosphatase activity" and "magnesium ion binding" GO annotations were used for Solyc06g076510. Recalling that for hierarchical classifiers, a prediction is considered correct provided it is included in the predicted graph, the two controls were satisfied (see S5 and S6 Figs). Based on these positive annotation results, FGGA predictions on the four genes of unknown function were performed.
Aiming to recover most specific and confident FGGA predictions for guiding experimental studies, a cut threshold of 0.95 for leaf nodes was set for the analysis predicted graphs, i.e., leaf GO-terms whose estimated probabilities were below the threshold were disregarded. For Solyc06g076500, a subgraph containing 122 out of the 659 original GO-terms, 54 of them being leaf GO-terms, was recovered (see S7 Fig). Among the top five scoring leaf GO-terms, GO:0016893 -endonuclease activity, active with either ribo-or deoxyribonucleic acids-whose ancestor is GO:0004518 -nuclease activity-appears as a candidate annotation term [55]. For Solyc06g076530, a subgraph containing 185 out of the 659 original GO-terms, 75 of them being leaf GO-terms, was recovered. Among the top five scoring leaf GO-terms, GO:0004722 -protein serine/threonine phosphatase activity-whose ancestor is GO:0016791 -phosphatase activity-appears as a candidate annotation term [56]. For Solyc06g076580, a subgraph containing 52 out of the 659 original GO-terms, 21 of them being leaf GO-terms, was recovered. Among the top five scoring leaf GO-terms, GO:0016209 -antioxidant activity-appears as a candidate annotation term [57]. Finally, for Solyc06g076590, a subgraph containing 45 out of the 659 original GO-terms, 31 of them being leaf GO-terms, was recovered. Among the top five scoring leaf GO-terms, GO:0046983 -protein dimerization activity-whose ancestor is GO:0005515 -protein binding-appears as a candidate annotation term [58]. Altogether, these results suggest that all genes in the target region could be involved in a chaperone network induced during fruit ripening or heat shock stress [59,60].

Conclusions
A factor graph based method for automated GO annotation has been presented. The method, called FGGA, embodies elements of predicate logic, communication theory, supervised learning and inference in graphical models. Elements of predicate logic allow a formal, yet intuitive, treatment of relationships between GO-terms. Although only the main is_a relationship has been practically considered, other types of transitive relationships, such as part of or has part, are also possible. Likewise, elements of communication theory allow a formal, yet practical, treatment of the uncertainty in practical GO-term predictions. Since these predictions are issued by practical binary classifiers, key factors affecting the generalization performance of the resulting ensemble can then be practically considered. In particular, under the assumption of a Gaussian communication channel model corrupting ideal GO-term predictions, the variances of base binary classifiers can be used to model their individual confidences. Similarly, under the assumption of linear soft-margin SVM binary classifiers, observed margins can be used to model the confidence of GO-term predictions. Both types of confidences, known to affect the generalization of overall ensemble classifiers [61,62], are fully exploited within the FGGA framework. This is accomplished at the FGGA inference stage with an adapted version of the widely used sum-product algorithm of factor graphs. This iterative message passing algorithm is used to approximate MAP of latent GO-term annotations. Evaluations on S. cerevisiae, A. thaliana, and D. melanogaster protein sequences suggest that improvement of the correctness (precision) and the completeness (recall) of annotation results with respect to the TPR-DAG heuristic is the payoff for FGGA modeling efforts. In this regard, an insight into the power of the FGGA approach for studying proteins of unknown function has been presented.
Although throughout this paper only the automated annotation of protein sequences has been practically considered, the annotation of other types of striking functional gene products is also possible, e.g., long non-coding RNAs [63]. Since these RNA sequences are weakly conserved across species [64] except in mammals [65], uncovering their functional annotation entails a challenging bioinformatics problem [66]. FGGA may bring an opportunity for improving the annotation of long non-coding RNA sequences through boosting the confidence of base binary classifiers by the integration of multiple sources of annotation data, e.g., Rfam database [67]. Interestingly, the complexity of such integration process could remain hidden at base binary classifiers.  Table. Table FGGA and