Statistical Tests for Associations between Two Directed Acyclic Graphs

Biological data, and particularly annotation data, are increasingly being represented in directed acyclic graphs (DAGs). However, while relevant biological information is implicit in the links between multiple domains, annotations from these different domains are usually represented in distinct, unconnected DAGs, making links between the domains represented difficult to determine. We develop a novel family of general statistical tests for the discovery of strong associations between two directed acyclic graphs. Our method takes the topology of the input graphs and the specificity and relevance of associations between nodes into consideration. We apply our method to the extraction of associations between biomedical ontologies in an extensive use-case. Through a manual and an automatic evaluation, we show that our tests discover biologically relevant relations. The suite of statistical tests we develop for this purpose is implemented and freely available for download.


Introduction
An increasing number of discoveries, particularly in biomedicine, are facilitated by statistical analyses of data annotated to biomedical ontologies [1]. Biomedical ontologies are generally represented as DAGs, and specific domains are usually represented in distinct, separate DAGs [2][3][4].
Statistical tests that utilize a single graph can only consider the given domain. However, entities from different domain are linked via biomedical relations [5]. These relations can be vital for the discovery of novel biomedical knowledge. We have designed a family of novel statistical tests to identify strong associations between nodes from two directed acyclic graphs. The tests combine measures of relevance and specificity.
We evaluated our statistical method through an extensive usecase in which we applied our tests to the detection of strong semantic associations between the Gene Ontology [3] and the Celltype Ontology [6] based on co-occurrence in scientific literature. In this use-case, we annotated the ontologies with occurrence and cooccurrence count data of the ontologies category labels in full text scientific articles. The strongest associations identified through our tests are biologically relevant relations.
An implementation of the six novel statistical tests to identify associations between directed acyclic graphs is available as free software from our project webpage at http://bioonto.de/pmwiki. php/Main/ExtractingBiologicalRelations.

State of the art
Our approach to the computation of the strength of the association between two graphs relies on approaches for capturing the semantic similarity between categories in ontologies and for propagating these similarities within DAGs. In the following, we give a brief overview of methods for computing the similarity of categories (a more complete overview can be found in [7]). Most of the existing semantic similarity approaches assume that ontologies contain categories C i that are annotated with terms t i1 :::t in~w (C i ). Based on this assumption, the computation of the semantic similarity of two categories C 1 and C 2 can be carried out by using the structure of the ontology to which C 1 and C 2 belong (edge-based approaches), the nodes and their properties (e.g., similarity between w(C 1 ) and w(C 2 )) (node-based approaches) or by combining structural knowledge and annotations (hybrid approaches).
The most common edge-based approach consist of using a function of the number of edges between C 1 and C 2 as semantic similarity measure [8,9]. Other approaches combine the previous approach with the lenght of the path from the most specific common ancestor of C 1 and C 2 and the root node [10,11]. Edgebased approaches rely on the nodes being elements of the same graph. Thus, they cannot be utilized when trying to compute the similarity of two nodes from distinct DAGs.
The second category of approaches, the node-based approaches, use the properties of the nodes themselves to compute their similarity. One of the central concept for using annotations to compute similarity is that of information content, which is the negative loglikehood {log(p(C i )) of a term C i where p(C i ) is the probability of occurrence of the terms in w(C i ) in a certain corpus. Based on this value, several similarity metrics have been developed including the information content of the most informative common ancestor used in [12,13] or of the disjoint common ancestors [14].
In recent years, hybrid similarity measures that combine nodeand edge-based approaches have been developed. Most of these approaches utilize the information content. For example [15] utilize a combination of edge weights based on node depth and node link density and of the difference of information content of the nodes linked by that edge. Other approaches such as that described in [16] compute edge weights by using a scheme that takes the type of the edge into consideration. The semantic similarity between two terms is set to a function of the maximum of the product of best path between the terms. Again, these approaches can only compute the similarity of terms from the same DAG.
The aim of our approach is to provide a means for the computation of the association between nodes from 2 DAGs, which are, in general, distinct. We do not make similar assumptions about the annotation of edges and nodes as other approaches to semantic similarity. Instead, we go beyong current semantic similarity measures by providing a measure of statistical significance in a distribution of arbitrary node and edge annotations. When applying out method to semantic similarity between ontologies, we can compute initial semantic similarity values for categories which do not belong to the same ontologies.

Statistics on graphs
Preliminaries of directed acyclic graphs. Our tests take as input two directed acyclic graphs, G 1~( V 1 ,E 1 ) and G 2~( V 2 ,E 2 ) that are disjoint (V 1 \V 2~1 ). From these two graphs, a graph G~(V 1 |V 2 ,E 1 |E 2 |C) with C~V 1 |V 2 |V 2 |V 1 is constructed. We denote an edge as an ordered pair of vertices. If an edge connects v 1 and v 2 , e~(v 1 ,v 2 ), we call v 2 the child of v 1 and v 1 the parent of v 2 . If there is a path from v 1 to v 2 , we call v 1 a predecessor of v 2 and v 2 a successor of v 1 .
In addition to the two graphs, two functions d' 1 and d' 2 are given as input such that d' 1 :V 1 |V 2 .R and d' 2 :C.R. From these two functions, a graph decoration for G is constructed based on the assumption that the two input functions are transitive over the DAG: the decoration d 1 (v) of a vertex v [ V 1 |V 2 is the union of d' 1 (v) and the values of d' 1 (u) for all successors u of v. Similarly, the decoration d 2 (e) of an edge e~(v 1 ,v 2 ) for e [ C is the union of d' 2 (e) and the values of d' 2 (f ) for all edges f between the successors of v 1 and v 2 .
The third component of the input is a score function score : V 1 |V 2 |V 2 |V 1 .R. We assume that the value of the score function between the vertices v 1 and v 2 depends only on the graph decorations d 1 (v 1 ) of v 1 and d 1 (v 2 ) of v 2 as well as the decoration d 2 (e) of the edge e~(v 1 ,v 2 ).
The score function is not symmetric, i.e., it is not necessary that score(x,y)~score(y,x). It is intended to measure the association strength between two vertices from the input graphs. Our method identifies whether the score between two vertices is significantly high. A graphical overview of our test method is shown in Figure 1.
Determining the Random Distribution. The score between two vertices v 1 and v 2 is influenced by the topology of the input DAGs: a vertex v that is more general has a larger decoration set d 1 (v) due to our basic assumption about transitivity of input graph decorations. Similarily, the cardinality of the decoration set of the edges between nodes from the two input DAGs is larger when the edges connect more general vertices. Therefore, it is insufficient to test for a high score between vertices to consider the score between two vertices as significantly high. A random distribution of the scores of each pair of vertices v 1 and v 2 provides a means for determining the significance of the score between v 1 and v 2 . This random distribution depends on the functions d' 1 and d' 2 , the score function and the topology of the input graphs. Hence, we cannot assume any statistical distribution of scores ab initio. Instead, we simulate the random distribution of the scores between each vertex pair through multiple random permutations: the d' 1 -values that are given as input for our method are randomly swapped with the d' 1 -values of vertices in the input DAG from which they originate. There are two options for permutating the d' 2 -values for edges: either they are, mutatis mutandis, permutated similarily to the d' 1 -values of the vertices, or they are permutated depending on the permutation of d' 1 -values; in the latter case, when the d' 1 -values of v 1 and v 2 are swapped, so are the values of d' 2 (v 1 ,x) and d' 2 (v 2 ,x) for any vertex x.
Because our test is intended to identify associations between vertices, we do not assume that the values of d' 1 and d' 2 are independent. We therefore prefer to use the second option, i.e., that the permutation of the d' 2 values depends on the permutation of the d' 1 -values. Based on these permutations, we first rebuild the graph decorations d 1 and d 2 . Then, we calculate and record the values of the score function score(v 1 ,v 2 ) for all pairs of vertices v 1 and v 2 . In addition, for each vertex u, such that v 1 is a direct successor of u, we calculate and record the score difference score(v 1 ,v 2 ){score(u,v 2 ). Further, for each vertex w with the direct predecessor v 1 , we calculate and record the difference Hence, the results of this step are threefold. First, we approximate the random score distribution for each pair of vertices through multiple random permutations. Second, each triple of vertices u, v and w [ children(u) gives rise to a random distribution of score differences between (u,v) and (w,v). Third, each triple u, v and w [ parents(u) yields a random distribution of score differences between (w,v) and (u,v).

Ontologies as graphs
While the tests we develop can be applied to any DAG that satisfies the conditions specified above, their primary application is to test the significance of an association between categories from two ontologies. An ontology is the specification of a conceptualization of a domain [17,18]. Many biological ontologies are represented as directed acyclic graphs (DAGs) and are available in the OBO flatfile format [2]. In these DAGs, nodes represent categories and edges represent relations between these categories. A category, also called kind, class or universal, is an entity that is general in reality. Examples are dog, apoptosis or red. Categories may have instances, of which some may not be further instantiated. These are called individuals. We call the set of all categories in an ontology O Cat(O).
Categories may be related to other categories. The most important relation between two categories A and B is the isA relation, isA(A,B). The relation isA(A,B) can be defined by using the instantiation relation: when isA(A,B), then all instances a of A are instances of B [18]. This definition implies that the isA relation is reflexive, transitive and antisymmetric.
A set of categories with the isA relation among them form a taxonomy. These taxonomies are often the backbone of the OBO ontologies' DAG structure. We call the set of all successors of a category A the sub-categories subcat(A)~fBDisA(B,A)g and its predecessors the super-categories supcat(A)~fBDisA(A,B)g. The direct successors of A in the taxonomy are called children (children , while the direct predecessors are called parents. In the OBO flatfile format, ontologies are assigned a namespace. Category identifiers are prefixed with the namespace of the ontology to which they belong. Identifiers are therefore unique within the OBO ontologies. In addition to a unique identifier, categories are assigned a name and a set of synonyms. Neither the name nor the set of synonyms must be unique.

Statistics on graphs
To identify strong associations, we designed a family of tests for the score of each edge between the two input DAGs that considers a fragment of the path in the DAG. The tests are designed to measure the significance of the score between vertices v 1 and v 2 based on three criteria: (1) the score score(v 1 ,v 2 ) for the association should be higher than expected; (2) for each child u of v 1 , score(v 1 ,v 2 ){score(u,v 2 ) should be higher than expected; and (3) for each parent w of v 1 , score(w,v 2 ){score(v 1 ,v 2 ) should be lower than expected.
The first criterion of our tests identifies hypothetical associations between nodes from two graphs. The second and third criteria are used to verify whether the pair is the best selection, or whether a more specific or more general association is preferable. For this purpose, the second and third criteria test for novelty of the association (compared to the child and parent nodes).
Within this section, let u and v be fixed vertices from the DAGs G 1 and G 2 , respectively. Furthermore, let N be the number of permutations that were used to determine the random distributions. The first test we designed, H 1 , depends on the vertices u and v, the DAG structure and the number of permutations N. It tests for the following properties: N the score between u and v is high, N the difference between score(u,v) and score(u',v) for every child u' of v is high, N the difference between score(u,v) and score(u'',v) for every parent u'' of v is low.
''Being high'' and ''being low'' are captured using the values of the cumulative distribution functions (CDFs) obtained by the N permutations performed in the previous step: one function for each pair of categories u and v, one function for each triple of categories u, v and u' where u' is a child of u, and one for each triple u, v and u'' where u'' is a parent of u. We combine the pvalues of the score differences to children in a single value using their geometric mean. A similar combination of the score differences' p-values to the parent categories of u is carried out: here, the combined value is the geometric mean of 1{x, where x is the p-value in the corresponding CDF.
Formally, let u and v be fixed vertices from the directed acyclic graphs G 1~( V 1 ,E 1 ) and G 2~( V 2 ,E 2 ), respectively, and let N N be the number of permutations, N score n (u,v) be the score between u and v in the n th N DQ uj (x,u,v)~P(score n (u,v){score n (u j ,v)ƒx), 1ƒnƒN, be the CDF of the difference between the vertex u and its j th child vertex, N DQ(x,u,v)~fDQ uj (x,u,v)Du j [ child(u)g, N MQ u k (x,u,v)~P(score n (u k ,v){score n (u,v)ƒx), 1ƒnƒN, be the CDF of the score difference between the vertex u and its k th parent vertex, , be the CDF of the variances Var of the distribution NQ(x,x 1 ,x 2 ), and VQ DQ and VQ MQ for the distributions DQ(x,x 1 ,x 2 ) and MQ(x,x 1 ,x 2 ), respectively.
For each child u j of u, we calculate the difference in scores d d (u j )~score(u,v){score(u j ,v). Then, we compute the geometric mean j of all values DQ(d d (u j ),u,v). Similarly, we calculate d m (u k )~score(u k ,v){score(u,v) for each parent u k of u, and the geometric mean y of all values MQ(1{d m (u k ),u,v). Then we define as our first test All other tests are extensions of the first test. The second test, H 2 , uses the minimum function instead of the geometric mean to combine the p-values in the CDFs of the score differences to parents and children.
The first two tests H 1 and H 2 do not consider the variances of the distributions of scores, differences in scores to children and differences in scores to parents. Therefore, we extend these tests by weighting all three components of the tests with the variances of their corresponding distributions. In these tests, high variance lowers the impact of the result, while lower variance strengthens it.
We define three new distributions for the variances and choose the p-value in the respective CDF as a weight in our tests. We compute the scores for each pair of category N times, resulting in one distribution of scores for each pair of categories. Each of these distributions has a variance. The score variance distribution is the finite distribution (containing N elements) of the variances of each of these distributions. We define the variance distribution for score difference to parent and child analogously.
The tests H 3 and H 4 use only the variance distribution of scores, while H 5 and H 6 use all three variance distributions. These tests are one-sided, i.e., they are not symmetric. We define two-sided, symmetric tests t i (u,v) for all vertices u and v as Table 1 lists the combination of properties for all tests. The precise formulation of all six tests can be found in the supplement S1.

Application to biomedical ontologies
Occurrence and co-occurrence count data as graph decoration. To verify whether the tests we designed yield reasonable results, we applied our method to the detection of significant co-occurrences between ontological categories in natural language texts, as a precursor to the detection of relations between ontological categories. For this purpose, we make the following assumptions: 1. A term occurs in a portion of text if it is an exact substring of this portion of text. 2. Terms can designate ontological categories; the terms that designate the same category are henceforth called the category's synset. Every occurrence of an element of the category C's synset is called an occurrence of C. Every cooccurrence of an element of the category C's synset with an element of the category D's synset is called a co-occurrence of C and D.

If
A is a sub-category of B, then every co-occurrence of A with C is a co-occurrence of B with C. Additionally, every occurrence of A counts as an occurrence of B.
To test our method, we used the Gene Ontology (GO) [3] and the Celltype Ontology (CL) [6] as input DAGs. The GO is an ontology specifically designed to describe gene products. It contains three separate ontologies: the biological process, molecular function and cellular component ontologies. Gene products can be tagged with ontology categories to describe and classify them. The CL is an ontology for types of cells. It classifies cells based on criteria such as structure or function.
Based on the input requirements of our test, we constructed synsets from the synonyms attached to each category in the input ontologies, and counted the occurrences and co-occurrences of the categories based on two contexts: single sentences and sentences in documents. The second context refers to whole documents, but cooccurrence is based on single sentences. Therefore, when two terms co-occur in two or more sentences within one document, their co-occurrence is only counted once. The functions that assign the occurrence and co-occurrence count values to a synset of a category for each context are called d and f , respectively.
We used exact string matching to identify terms in text. Our evaluation was conducted using a 2.2 GB text corpus containing 60143 fulltext articles from Open Access journals listed in Pubmed Central. The aim of our method is to test for significant cooccurrences between categories.
Text Processing. First, we counted the number of occurrences and co-occurrences of the terms contained in synsets of categories from the input ontologies. Table 2 shows examples for the synsets of categories. We counted the total number of sentences and documents in which at least one element of a synset was found by using exact matching. For each pair of categories, we counted the total number of co-occurrences of elements of their respective synsets in sentences. Furthermore, we counted the number of documents in which they co-occured within at least one sentence. We used exact matching and abstained from using any more sophisticated methods for recognizing the ontologies' categories in text [19,20] to evaluate our method. Exact matching provides a large dataset for the evaluation of our method. For practical applications such as relationship extraction, more advanced methods should be chosen.
The text processing yielded, for each category C, both its frequency f (C) and the total number of documents in which C occurred, d(C). Furthermore, for each pair of categories C 1 and C 2 , we obtained both the total number of co-occurrences in sentences f (C 1 ,C 2 ) and the total number of documents containing these co-occurrences d(C 1 ,C 2 ).
Count data over ontologies. The first component in our method implements the assumption that the input graph decorations are transitive over the DAG structure. In the case of ontologies, this implements the assumption that occurrence and co-occurrence between categories is transitive over the isA relation between categories.
We assumed that when two categories C and C' stand in the isA relation, isA(C,C'), then every occurrence of C is also an occurrence of C'. This means that the synset-closure synclos(C) of a category C can be constructed as follows: isA(C,C')?(syn(C)(synclos(C')) ð4Þ For count data, the decoration value of a vertex v in the DAG is equal to the sum of the input value pair d(v) and f (v) and the corresponding input values for v's successors. Therefore, for all categories C, we define f t (C) and d t (C) to represent the sum of the values f (C') and d(C') over all of C's sub-categories C'. Furthermore, for all categories C 1 and C 2 , we compute the cumulated f -and d-values dubbed f t (C 1 ,C 2 ) and d t (C 1 ,C 2 ): Again, for count data, co-occurrence values between nodes v 1 and v 2 can be summed up over the successors of v 1 and v 2 to yield the decoration of the edge between v 1 and v 2 .
A score for occurrences and co-occurrences. For all categories C 1 and C 2 , we defined the following score function: score(C 1 ,C 2 )~l og f t (C 1 ,C 2 ) log(1zf t (C 1 ))zlog(1zf t (C 2 )) : log(d t (C 1 ,C 2 )) log(1zmax(d t (C 1 ),d t (C 2 ))) ð7Þ The first component of the score function implements the natural logarithm of the Pointwise Mutual Information (PMI) [21] score achieved by the categories with respect to their co-occurrence within sentences. PMI has been successfully used in several text mining tools (see, e.g., [22]). To avoid divisions by 0, the denominators of all members of the score function were incremented. The second component measures a similar value using documents as context. The aim of the score function is to ensure that categories that co-occur relatively often are assigned a high score. The range of the score function is between 0 and 1.

Evaluation
We applied the tests to the biological process (BP) branch of the GO and the CL. To recognize the categories in text, we used the identifier of the category, the name and all exact synonyms of the category. On average, every category had 2.1 synonyms. Using exact matching, we identified 3,751 out of BP's 14,542 (26%) categories in our text corpus. We found 491 of 754 (65%) categories from the CL. Categories from the BP co-occurred 70,967 times with CL categories.
Using our method, we identified a total number of 202,627 cooccurrences between categories. After applying our tests, 157,894 co-occurrences produced test values distinct from 0. The remainder obtained a test value of 0 due to numerical restrictions.
They were subsequently excluded, because they were indistinguishable from the absence of co-occurrence. We illustrate the quantiles obtained for different p-values in our six tests, t i , in Table 3. The distribution of scores for t 1 and t 6 are shown in Figure 2. The remaining plots are included in the supplement S1.
We found that the tests using the minimum instead of the geometric mean of p-values of score differences to parent and child categories are generally more restrictive, i.e., they include fewer co-occurrences for a given cutoff. Similarly, tests including the variance for scores are generally more restrictive than tests that are not weighted by the variance of score distributions. In this sense, the tests t 5 and t 6 are the most restrictive. Table 4 shows example associations, and Table 5 shows the kind of relationship between categories that our tests identified for the 100 top-scoring results with respect to the test t 1 . The hasparticipant relation is defined in the OBO Relationship Ontology (RO) [5] as a relation that holds between two categories, where every instance of one category participate in some instance of the other. We define the Participates-in relation as a relation between two categories: C 1 Participates-in C 2 uVx,t 1 (instanceOf (x,C 1 ,t 1 )? At 2 ,y (instanceOf (y,C 2 ,t 2 )^participatesin (x,y,t 2 ))), where participates-in is the primitive participation relation between individuals as defined in the RO. We extend the definition of located-in in the RO to a relation Located-in between processes and objects, which holds when all participants of a process are located-in a structure during the entire duration of the process.
In our sample, 38 associations do not fall under one of the three relations that we investigated. We discovered several kinds of unclassified relations. First, mismatches in granularity lead to strong associations for unrelated categories. For example, xanthine transport and erythrocyte are closely related according to t 1 . Erythrocytes are involved in the transport of xanthine. However, the GO category xanthine transport refers to the inter-and intracellular level of granularity, while erythrocytes transport nutrients between organs. Second, some categories are indirectly related via another category. For example, osteoclasts and lymph node development are related via the protein RANK. Third, when cells have closely related functions, we sometimes identify too specific or too generic cell types as in the case of the association between basophil degranulation and mast cell. Finally, 6 out of 100 associations in our sample seem erroneous. We were not able to compute precision or recall for our method due to the absence of a gold standard. However, we compared our method with the GO-CL crossproducts available from the OBO Foundry. The dataset contains manually verified relations between categories from the GO and the CL that have been extracted using pattern matching on category names [23]. As this method is based on the compositional nature of terms in the GO, it exclusively identifies relations in which one category name (usually a type of cell) is a substring of another category name (usually a GO category).
The GO-CL crossproduct contains 396 relations between GO and CL categories. From these 396, we identified 73 that co-occurred in our text corpus. Table 6 shows the percentage of significant co-occurrences within these 73 relations for different cutoffs in our six tests. Figure 2 shows the distribution of the 73 pairs with respect to t 1 and t 6 .
As our method relies exclusively on the distribution of terms and not on their syntactic structure, it permits the recognition of associations between categories that cannot be recognized using syntactic patterns. An example of such an association is myoepithelial cell (cells located in the mammary gland) and milk ejection.
Important potential applications for our tests arise from the fact that annotations of a large set of biomedical ontologies satisfy the conditions for our tests. Annotations satisfy the True Path Rule [3]: if two categories C and D stand in the is-a or part-of relation, then any annotation of C is also an annotation of D. Therefore, if gene annotations are used as graph decorations for the two input graphs of our method, the conditions for applying our tests are satisfied. For detecting associations between annotations, an appropriate score function must be chosen based on the hypothesis that is to be tested.
Another potential application of our tests lies in the field of relation extraction. The evaluation of our tests with the GO and The plot on the left shows the distribution of the test results for t 1 . On the right, the same is shown for t 6 . It can be seen that a test using the minimum function (t 6 ) is more restrictive than a test using the geometric mean (t 1 ). Furthermore, weighting the tests with the CDFs of the variances (t 6 ) produces stronger results than the basic test (t 1 CL reveals that we are able to detect biologically relevant associations between these ontologies. 94 of the best 100 associations retrieved by t 1 have biological meaning, as shown in Table 5. Although our approach is unable to detect the types of the biological relations, the associations provide a good starting point for an elaborate approach to the extraction of biological relations.
Our method is designed for the detection of associations between two DAGs. However, it can be generalized to test for associations between n graphs. The result of the tests would then be significant n-ary associations between n nodes from n graphs.

Conclusions
We developed a family of novel statistical tests for associations between two directed acyclic graphs. The tests account for the graphs' topologies and test for relevance and specificity of associations. The tests are suitable for the detection of associations between categories from two biomedical ontologies, in particular those which comply with the OBO criteria [24].
In an extensive use-case, we applied our tests to the discovery of associations between categories from the Gene Ontology and the Celltype Ontology that were decorated with the number of occurrences and co-occurrences of the categories' labels in a large corpus of full-text articles. Our results show that a large proportion of the associations discovered by our tests are biologically relevant relations.
The family of tests is implemented in a Java library, which is available as free software from our project webpage at http:// bioonto.de/pmwiki.php/Main/ExtractingBiologicalRelations.

Supporting Information
Supplement S1 Statistical tests for associations between two directed acyclic graphs and their application to biomedical ontologies. Found at: doi:10.1371/journal.pone.0010996.s001 (0.14 MB PDF) Table 6. Evaluation of our approach with respect to the GO-CL dataset [23]. The dataset we used for comparison consists of the 73 relations from the GO-CL crossproduct [23] found in our text corpus. Columns two to seven show the cutoff values required to identify the percentage given in column one of associations as significant using tests one to six. For example, at a cutoff of 0:502, 50% of the relations found in the dataset were significant according to test t 1 . doi:10.1371/journal.pone.0010996.t006