Optimal Threshold Determination for Interpreting Semantic Similarity and Particularity: Application to the Comparison of Gene Sets and Metabolic Pathways Using GO and ChEBI

Charles Bettembourg; Christian Diot; Olivier Dameron

doi:10.1371/journal.pone.0133579

Abstract

Background

The analysis of gene annotations referencing back to Gene Ontology plays an important role in the interpretation of high-throughput experiments results. This analysis typically involves semantic similarity and particularity measures that quantify the importance of the Gene Ontology annotations. However, there is currently no sound method supporting the interpretation of the similarity and particularity values in order to determine whether two genes are similar or whether one gene has some significant particular function. Interpretation is frequently based either on an implicit threshold, or an arbitrary one (typically 0.5). Here we investigate a method for determining thresholds supporting the interpretation of the results of a semantic comparison.

Results

We propose a method for determining the optimal similarity threshold by minimizing the proportions of false-positive and false-negative similarity matches. We compared the distributions of the similarity values of pairs of similar genes and pairs of non-similar genes. These comparisons were performed separately for all three branches of the Gene Ontology. In all situations, we found overlap between the similar and the non-similar distributions, indicating that some similar genes had a similarity value lower than the similarity value of some non-similar genes. We then extend this method to the semantic particularity measure and to a similarity measure applied to the ChEBI ontology. Thresholds were evaluated over the whole HomoloGene database. For each group of homologous genes, we computed all the similarity and particularity values between pairs of genes. Finally, we focused on the PPAR multigene family to show that the similarity and particularity patterns obtained with our thresholds were better at discriminating orthologs and paralogs than those obtained using default thresholds.

Conclusion

We developed a method for determining optimal semantic similarity and particularity thresholds. We applied this method on the GO and ChEBI ontologies. Qualitative analysis using the thresholds on the PPAR multigene family yielded biologically-relevant patterns.

Citation: Bettembourg C, Diot C, Dameron O (2015) Optimal Threshold Determination for Interpreting Semantic Similarity and Particularity: Application to the Comparison of Gene Sets and Metabolic Pathways Using GO and ChEBI. PLoS ONE 10(7): e0133579. https://doi.org/10.1371/journal.pone.0133579

Editor: Christos A. Ouzounis, Hellas, GREECE

Received: March 21, 2014; Accepted: June 30, 2015; Published: July 31, 2015

Copyright: © 2015 Bettembourg et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Funding: CB was supported by a fellowship from the French ministry of research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Need for thresholds

Comparing several gene sets to identify and quantify the features they share and the features that differentiate them is central to the functional analysis of gene sets [1–3]. These operations hinge on comparing sets of Gene Ontology (GO) terms [4]. The links between genes and GO terms are provided by the Gene Ontology Annotation (GOA) database for multiple species [5]. Numerous semantic similarity measures have been developed [6–8]. We recently proposed to combine semantic similarity measures and a new semantic particularity measure to improve the results of gene set analysis [9]. The analysis of results on similarity and particularity is based on an interpretation that contrasts the genes with particular functions among similar genes. The main focus of studies to date has been on defining the measures, but there is no extensive study on the interpretation of the values obtained with these measures. As a result, interpretation is frequently based on either an implicit threshold (for example: “a similarity of 0.83 is high enough to consider that two genes are similar”) or an arbitrary one (typically 0.5 for measures in [0;1] even though no mathematical property of the measures supports this choice). Moreover, the value of these thresholds may vary over time, as both GO and GOA evolve [10]. Here, we propose a method to define suitable thresholds based on analysis of the distributions of similarity values. We then extend this method to the semantic particularity measure and to a similarity measure applied to the Chemical Entities of Biological Interest ontology (ChEBI) [11].

Metrics background

The GO terms annotating genes describe the biological processes, molecular functions and cellular components each gene is involved in. If these terms were independent, functional gene characterization could be performed by a straightforward set-based approach such as the Jaccard index or Dice’s coefficient. However, GO terms are hierarchically-linked, which means the characterization needs to take into account the underlying ontological structure of the GO annotations [12]. There are several semantic similarity measures that exploit the formal representation of the meaning of the terms by considering the relations between the terms.

Classification of semantic similarity measures

Pesquita et al. classified semantic similarity measures into two categories: node and edge-based measures, with some hybrid measures [6].

Node-based measures assign an Information Content (IC) value to each ontology term, with the least-frequent terms given the highest IC value. This IC concept, borrowed from Shannon’s information theory [13], was used to measure similarities using ontologies [14–16] such as WordNet [17]. Node-based measures consider that the similarity between two terms relies on their most informative common ancestor. These measures developed in linguistics have been applied to GO [18, 19], where the IC of a GO term is inversely proportional to the frequency with which it annotates a gene using the Gene Ontology Annotations (GOA) database [5]. In the context of gene comparisons, IC-based measures carry three main limitations tied to their dependence on a GOA-based corpus. First, it can prove difficult or even impossible to obtain a relevant corpus. GOA provides single and multi-species tables of annotation. Although using a species-specific table is well suited to intra-species comparisons, it becomes problematic for inter-species comparisons. Second, using a multi-species table (like the UniprotKB table) for cross-species studies is biased towards the most extensively annotated species such as humans or mice. Third, the most extensively studied areas of biology have high annotation frequencies and are therefore less informative and see their importance downgraded, whereas the less-studied areas are artificially emphasized [20–22].

Edge-based measures compute a distance between GO terms using the directed graph topology. This distance can be the shortest path between two compared terms [23] or the length of the path between the root of the ontology and the lowest common ancestor of the compared terms [24–28]. This root to ancestor distance makes terms with a deep common ancestor more similar than terms with a common ancestor close to the root. Unlike node-based measures, edge-based measures are not corpus-dependent. However, granularity is not uniform in GO, so terms at the same depth can have different levels of specificity [29].

Hybrid measures combine different aspects of node-based and edge-based measures. Wang et al.’s measure assigns each term a “semantic value” that represents how informative the term is, which conforms to the node-based approach [30]. However, the semantic value of a term is obtained by following the path from this term to the root and summing the semantic contributions of all the ancestors of this term. As semantic value depends on ontology topology, it also conforms to the edge-based approach. Most hybrid measures are designed to compare terms but not sets of terms (as needed to compare genes). Common approaches proposed to compare genes consider the average [18], the maximum [31] of all pairwise similarities, or only the best matching pairs [32, 33]. Pesquita et al. concluded that best-match average variants are the best overall. They also highlighted a graph-based groupwise approach that avoids combining pairwise similarities between terms. Several measures employ this groupwise approach [34–37], including the simUI and simGIC measures used by Ferreira et al. to compute similarities on ChEBI [38]. Pesquita et al. do not single out any specific semantic similarity measure as the best, as the optimal measure will depend on the data to compare and the level of detail expected in the results. The main advantage of Wang’s measure over pure node-based measures is that unlike the IC, the semantic value is not GOA-dependent, which thus makes it well suited to cross-species comparisons.

Semantic similarity measures typically focus on what is common between the two compared entities. We recently developed a semantic particularity measure to also take into account what distinguishes each compared entity from the other one [9]. The semantic particularity of a set of GO terms “Sg1” compared to another set of GO terms “Sg2” depends on the informativeness measure of the “Sg1” terms that are not in “Sg2”. This informativeness measure is Wang’s semantic values or an IC value. This particularity concept should be used in combination with semantic similarity in order to improve the functional analysis of gene sets.

Data analysis often hinges on a qualitative interpretation of the similarity values in order to contrast similar and dissimilar pairs of genes. This discretization of the similarity and particularity values makes the interpretation easier. It determines whether a functional difference between two genes is or is not marginal. However, there has never been a systematic analysis of the optimal threshold value separating similar from dissimilar. Some studies avoid the problem by focusing only on “high” or “low” values (without mentioning when a value reaches this point). Other studies draw the line at 0.5 (for no other reason than the fact that 0.5 is the mid-range value of the similarity interval). There are cases where a threshold of 0.5 may be ill-adapted. For example, the similarity value between protein tyrosine kinase 2 (PTK2) and Ubiquitin B (UBB) is 0.502 using Wang’s similarity measure on their Biological Processes (BP) annotations. This value is just above the intuitive mid-interval threshold. These two genes are well annotated, with 73 and 79 distinct BP annotations, repectively. According to Entrez Gene, PTK2 is involved in cell growth and intracellular signal transduction pathways triggered in response to certain neural peptides or cell interactions with the extracellular matrix while UBB is required for ATP-dependent, nonlysosomal intracellular protein degradation of abnormal proteins and normal proteins with rapid turnover. These processes cannot be considered “similar”. Consequently, the 0.502 value of similarity should not lead to consider PTK2 and UBB as similar genes according to the BP they participate in.

The main factors influencing the similarity values are: granularity differences in GO, GO topology differences between BP, MF and CC, quantity and “quality” of gene annotations, GO temporal evolution [10]. There is a need for a systematic study of semantic measure values in order to determine optimal similarity and particularity thresholds for the qualitative part of functional gene set analysis. Note that the method for determining these thresholds should also be applicable to all semantic similarity categories as well on other ontologies outside GO.

Here we propose a generic method to define a threshold. We applied this method to a node-based and a hybrid semantic similarity measure as well as to the corresponding semantic particularity measures. All these measures are able to compare two genes. When comparing more than two genes, the measures have to be applied on each pair of genes. These measures are described below.

Semantic similarity

Lin developed a widely-used node-based similarity measure that employs the IC concept [15]. Several of the tools available have implemented this measure. The IC of a term t depends on its log probability P(t). Working with GO terms, this IC is inversely proportional to the frequency with which the terms annotate a gene using the Gene Ontology Annotations (GOA) database. When comparing two GO terms t1 and t2 having a most informative common ancestor t0, Lin defines their similarity as follows:

Wang’s hybrid measure depends solely on GO graph and does not need an annotation corpus, thus allowing cross-species comparisons [30]. For each term, the first step of the measure is to compute the semantic contributions of its ancestors, following: where S_A(t) is the semantic contribution of term t to term A and w_e is the semantic contribution factor for edge e linking term t to its child term t’. Following Wang, we used a semantic contribution factor of 0.8 for the “is a” relations and 0.6 for the “part of” relations, and we added a 0.7 factor for the “[positively] [negatively] regulates” relations. Then, for each target term to compare, the semantic value (SV) is the sum of the semantic contributions of all its ancestors:

The comparison of two terms A and B is computed as follows:

The similarity between a GO term “go” and a set of GO terms “Sg” is:

Finally, the similarity between two genes G1 and G2 is: Gentleman developed a graph-based measure for the R package GOstats called simUI [36]. simUI defines the semantic similarity between two sets of terms corresponding to two sub-graphs of the ontology as the ratio of the number of terms in the intersection of those graphs to the number of GO terms in their union.

Pesquita et al. proposed simGIC, a method combining the graph-based simUI metric with the IC of the terms involved in the computation [37]. In simGIC, each term is weighted by its IC.

Semantic particularity

In a previous article, we defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2 [9].

Some of the terms of Sg1 that are not members of Sg2 may be linked in the graph. Taking several linked terms into account would result in considering them several times over. To overcome this issue, the particularity measure focuses only on those terms of Sg1 that do not have any descendant in Sg1 and that are not members of Sg2. Some of these terms might be ancestors of terms of Sg2 and should be considered common to Sg1 and Sg2. Sg* is the union of Sg and the sets of ancestors of each term of Sg. MPT(Sg1, Sg2) is the set of the most particular terms of Sg1 compared to Sg2, i.e. the set of terms of Sg1 that do not have any descendant in Sg1 and that are not members of Sg2*. PI(Sg1, Sg2) is the particular informativeness (PI) of a set of GO terms Sg1 compared to another set of GO terms Sg2, i.e. the sum of the differences between the informativeness (I) of each term t_p of MPT(Sg1, Sg2) and the informativeness of the most informative common ancestor (MICA) between t_p and Sg2. The informativeness measure can be a Wang’s semantic value or an IC value. The PI of a set of terms is the information that is not shared with the other set.

PI is normalized to compute Par(Sg1, Sg2), the semantic particularity of the set of GO terms Sg1 compared to the set of GO terms Sg2. MCT(Sg1, Sg2) is the set of the most informative common terms of Sg1 and Sg2, i.e. the set of the terms belonging to the intersection of Sg1* and Sg2* that do not have any descendant in either Sg1* or Sg2*. Par(Sg1, Sg2) is the ratio of PI(Sg1, Sg2) and the sum of the informativeness of most informative Sg1 terms (i.e. those that are Sg1-specific and those that are common with Sg2; the MICA in the PI formula for Sg1-specific terms guarantees that the informativeness of common terms is not counted twice).

Method

We first describe our generic method for determining the optimal threshold for a semantic similarity measure. We then used it on GO for a node-based measure and for a hybrid measure. Finally, we generalize our approach by applying the method to another semantic measure of particularity and to another ontology.

Similarity threshold determination process

Fig 1 illustrates the process for determining a similarity threshold. This process is composed of three steps:

Download:

Fig 1. Flowchart for threshold determination.

1) Define at least two distinct groups of genes expected to be similar. 2) Compute the intra- and inter-group similarities and compile the results into S and N distributions. If these two distributions are significantly different, the groups of genes are relevant. 3) If S and N do not overlap, define threshold τ_sim using any value between τ_S (the lowest value of S) and τ_N (the highest value of N). Else, considering every value under the threshold as FN and every value above the threshold as FP, compute the FN proportion in the S distribution (3a) and the FP proportion in the N distribution (3b) for all samples of the similarity threshold between τ_N to τ_S. 3c) For each possible threshold value, sum the FN and FP proportions obtained in steps 3a and 3b. The similarity threshold τ_sim is the one that minimizes this sum.

https://doi.org/10.1371/journal.pone.0133579.g001

Define at least two different groups of genes for species of interest. Within a group, the genes should share some common characteristics. Genes from different groups should share as few characteristics as possible.
1. In each group, compute the similarities between each pair of genes (i.e. the intra-group similarities). Gather all the similarity results to obtain an S distribution of similar genes.
2. Compute the similarities between each combination of a gene from the first group and a gene from a second group (i.e. the inter-group similarities). Gather all the similarity results to obtain an N distribution of non-similar genes.
If the S and N distributions have no overlap between the ranges (min, max), define the threshold τ_sim using any value between τ_S (the lowest value of S) and τ_N (the highest value of N). Else, there are some false negatives (FN) and some false positives (FP):
1. Compute the proportion of FN in the S distribution for all samples of the similarity threshold between τ_N to τ_S. In this step, consider every value under the similarity threshold as a FN.
2. Compute the proportion of FP in the N distribution for all samples of the similarity threshold between τ_N to τ_S. In this step, consider every value above the similarity threshold as a FP.
3. For each possible threshold value, sum the FN and FP proportions obtained in steps 3a and 3b. The similarity threshold τ_sim is the threshold that minimizes this sum.

We ran a statistical test to determine whether the S and N distributions obtained at step 2 are significantly different. As we cannot consider that the S and N variances are similar, we used an unequal variance t-test (Welch’s t-test) which is the recommended test when considering different-sized distributions like S and N. Welch’s t-test performs better than Student’s t-test when the variances are unequal yet still performs on a par with the Student’s t-test when the variances are equal [39]. If the test concludes that the S and N distributions are non significantly different, the process has to be restarted at its first step.

The minimization at step 3c has to be done on FN and FP proportions as the N and S distributions have different sizes.

We applied this method to compute Lin’s and Wang’s semantic similarity thresholds on GO, the corresponding IC-based and SV-based semantic particularity thresholds on GO, and the simUI and simGIC thresholds on ChEBI. For all the pairs of genes compared, we used the GO annotations from the August 2013 version of GOA. We computed Lin’s similarity with the GOSemSim R package [40] (version 1.18.0) using its GO and IC tables and the best-match average approach to compare genes. Pesquita et al. showed that the best-match average approach performs best [6]. We computed Wang’s similarity, IC-based particularity and SV-based particularity using an in-house implementation of each measure and the August 2013 version of GO. We computed simUI and simGIC similarities using the web tool CMPSim provided by the XLDB research group [41]. CMPSim implements both measures for ChEBI.

Similarity threshold determination using two groups of similar genes

We first applied our method to determine the similarity threshold for the Biological Processes (BP) using two groups of similar genes. We determined thresholds using first Wang’s and then Lin’s similarity measures.

Group determination.

We composed two groups of similar genes from two families of the Protein ANalysis THrough Evolutionary Relationships database (PANTHER). The union of the pairs of genes within each family constituted the S distribution. The PANTHER database classifies proteins (and their genes) to facilitate high-throughput analysis [42]. PANTHER families are composed of genes sharing evolutionary history, molecular functions and biological processes annotations, and involvment in the same biological pathways. We assumed that genes belonging to a same PANTHER family share enough features to be considered as involved in similar biological processes. Conversely, we assumed that two genes belonging to two different PANTHER families should not be considered as involved in similar biological processes.

Intra-group and inter-group similarity measure.

We computed the similarity values for each pair of genes of the first family and for each pair of genes of the second family, and compiled them together in the S distribution. We then computed the N distribution composed of the similarity values between each gene from the first family and each gene from the second family.

Similar and non-similar distribution comparison.

When comparing the distributions of similar genes (S) to non-similar genes (N), if the minimum value of S is smaller than the maximum value of N, then the S and N distributions overlap and any threshold would lead to FPs or FNs.

Fig 2 illustrates the case without overlap, where min(S) = a, max(N) = b and a > b. A similarity value greater than a means that the genes compared are similar. A similarity value lower than b means that the genes compared are non-similar. A similarity value between a and b means that the genes compared are nearly similar and thus require expert opinion to interpret the result.

Download:

Fig 2. Ideal case of threshold determination.

The threshold should be located between the lowest whisker of the similar distribution (a) and the upmost whisker of the non-similar distribution (b).

https://doi.org/10.1371/journal.pone.0133579.g002

Fig 3 illustrates the case where the S and N distributions overlap, meaning that there are some FPs (i.e. pairs of genes from N that are non-similar but that have a similarity value greater than a) and FNs (i.e. pairs of genes from S that are similar but have a similarity value lower than b). In this case, a similarity value lower than a means that the genes compared are non-similar. A similarity value greater than b means that the genes compared are similar. Again, expert opinion would be required to interpret the result in this interval. However, in this case, it is possible to determine the threshold value that minimizes both FP and FN.

Download:

Fig 3. Overlap case of threshold determination.

The similar and non-similar boxes overlap. In this case, there are false-positive and false-negative results between the lowest whisker of the similar distribution (a) and the upmost whisker of the non-similar distribution (b).

https://doi.org/10.1371/journal.pone.0133579.g003

We established a general framework that proves suitable to the two cases described in this section. Under this framework, we define three thresholds values:

τ_S = max(a, b) is the threshold value above which the two compared genes are similar. There can not be any FP above τ_S, but there may be some FN below τ_S if a < b.
τ_N = min(a, b) is the threshold value under which the two compared genes are non-similar. There cannot be any FN below τ_N, but there may be some FP above τ_N if a < b.
τ_sim is the threshold value located between τ_S and τ_N that that minimizes the proportion of FP and FN. As τ_sim gets closer to τ_S, there will be more FN and fewer FP. Conversely, as τ_sim gets closer to τ_N, there will be more FP and fewer FN. τ_sim has to be computed using the proportions of FP and FN as the S and N distributions have different sizes.

Threshold stability study

Extension to multiple families.

The more groups we build to constitute the S and N distributions, the more reliable the thresholds obtained become. We generalized the above-described process using five groups of similar genes for CC and six groups for BP and MF in order to determine τ_S, τ_N and τ_sim for Wang’s and Lin’s measures.

For BP, we computed the S distribution gathering the similarity values of each pair of genes inside six different PANTHER families. We computed the fifteen distributions corresponding to all the combinations of genes similarity values from two of the previous six families. Each of these distributions is composed of the similarity values between each gene from the first family and each gene from the second family. We combined all these inter-family similarity values into a global N distribution.

For MF, we used the same six genes families to compute our S and N distributions, as the PANTHER families are also homogeneous in term of molecular functions.

For CC, we used the genes from five different pathways, each located in a different cellular compartment, to compute our S and N distributions. The lists of genes were borrowed from the Reactome database [43].

Robustness of threshold determination.

We validated our study using a leave-one-out approach that consisted in successively recomputing the thresholds using all the sets but one. This approach provides an evaluation of threshold stability.

Generalization

We generalized the approach by applying the method to another semantic measure and another ontology.

Particularity threshold.

In addition to the similarity thresholds determination, we used the same approach to compute semantic particularity thresholds on BP, CC and MF in order to determine the comparison profile of two genes G1 and G2. The procedure consisted in comparing each value of the triple (Similarity(G1, G2); Particularity(G1, G2); Particularity(G2, G1)) with its respective threshold (noted “+” if the value is greater than the threshold, and “-” otherwise). The results of comparing two genes on their similarity and particularity values can be classified into eight distinct patterns described in Table 1. A comparison should not result in a “+ + +” nor a “- - -” pattern. Indeed, a “+ + +” pattern would mean that the two genes compared share enough features to be considered similar yet, at the same time, that each have enough particular features to both be considered particular. Conversely, a “- - -” pattern would mean that the two genes compared are neither similar nor particular.

Download:

Table 1. Patterns of similarity and particularity.

https://doi.org/10.1371/journal.pone.0133579.t001

We applied the threshold determination process described in Fig 1 to obtain a particularity threshold. For the first step, we composed the same gene groups as those used to compute the similarity threshold. For the second step, we computed all the intra-group and inter-group particularity values between all possible pairs of genes. At the third step, we did not consider any FPs nor FNs as genes belonging to the same group can have some degree of particularity even if they are similar. However, knowing the similarity threshold, we computed the proportion of “+ + +” and “- - -” patterns found in the results while particularity threshold varied. For this step, three similarity thresholds were available: τ_N, τ_S and τ_sim. Let sim be the result of a semantic similarity measure between two genes G1 and G2.

If sim is lower than τ_N, we can conclude that G1 and G2 are strictly non-similar. Conversely, if sim is greater than τ_N, we can only conclude that G1 and G2 are possibly similar but with no certainty.
If sim is greater than τ_S, we can conclude that G1 and G2 are strictly similar. Conversely, if sim is lower than τ_S, we only can conclude that G1 and G2 are possibly non-similar but with no certainty.
Using τ_sim cannot lead to a conclusion with absolute certainty, but it does lead to the smallest number of errors.

Using τ_N can result in a lot of FPs and using τ_S can result in a lot of FNs. Consequently, we computed the particularity threshold τ_par using the similarity threshold τ_sim. For step 3c, we summed the “+ + +” and “- - -” proportions for each possible particularity threshold value. The particularity threshold τ_par was the one that minimized this sum.

ChEBI.

As the threshold determination process is neither specific to GO nor to the previously used measures, we applied our method to another ontology using two other similarity measures. We compared families of molecules using the ChEBI ontology and the simUI and the simGIC similarity measures. We composed our S and N distributions from the pairwise similarities obtained comparing all the children of two ChEBI entities. These entities were two distinct general (i.e. with no common descendants) ChEBI terms, each of which is the parent of numerous specific terms in the ChEBI ontology. This process allowed us to compare two distinct families of molecules.

Evaluation

The evaluation study involved first quantifying the extent of the changes resulting from using the threshold computed by our method instead of the default 0.5 and then determining whether these changes are biologically relevant.

The first part of this study focused on the changes in the results of the whole HomoloGene database intra-group gene comparisons. HomoloGene is a system that automatically detects homologs, including paralogs and orthologs, among the genes of 21 fully-sequenced eukaryotic genomes [44].

In the second part of this study, we computed the similarity and particularity measures on the well annotated peroxisome proliferator activated receptor (PPAR) multigene family. PPARα, PPARβ and PPARγ are involved in different processes [45] as transcription factors. Each member of this family uses the same molecular mechanisms in different metabolic pathways. The family is evolutionarily well conserved [46]. We expected a similarity value above the threshold for BP when comparing PPAR orthologs in several species. However, the ortholog conjecture assumes that orthologs generally share more functions than paralogs. We consequently expected some similarity values below the threshold when comparing PPAR paralogs within a species and between species. The goal was to determine whether our similarity and particularity thresholds lead to biologically more relevant interpretations than the default approach.

Results and Discussion

BP similarity threshold using two groups of similar genes

We studied the similarity values obtained when comparing genes known to be functionally close and genes without functional proximity. This study was performed using a hybrid semantic similarity measure (Wang) and a node-based measure (Lin).

Fig 4 presents the distribution of the BP similarity values obtained for two intra-family comparisons and the corresponding inter-family comparisons. The two PANTHER families were “neurotransmitter gated ion channel” (pthr18945) and “tyrosine-protein kinase receptor” (pthr24416).

Download:

Fig 4. Intra- and inter-family semantic similarity distributions using two families of similar genes.

Part A presents the results obtained using Wang’s measure and part B presents the results obtained using Lin’s measure. In both parts, the left side separately presents the two intra-family distributions in blue and the inter-family distribution in yellow. The right side presents the S distribution that gathers all the intra-family similarity values in blue and the N distribution that gathers all the inter-family similarity values in yellow.

https://doi.org/10.1371/journal.pone.0133579.g004

As expected, similarity values obtained using either Wang’s (Fig 4A) or Lin’s measure (Fig 4B) were significantly higher in the intra-family comparisons than the inter-family comparisons (Welch’s t-tests; see S1 File). We observed an overlap between the S and N distributions, which corresponds to the situation shown in Fig 3. τ_N was located at the lowest whisker of the intra-family S blue box, i.e. 0.096 with Wang’s measure and 0.364 with Lin’s measure. τ_S was located at the upmost whisker of the inter-family N yellow box, i.e. 0.519 with Wang’s measure and 0.588 with Lin’s measure.

We also determined the optimal similarity threshold value τ_sim that minimizes the sum of FP and FN proportions. Fig 5 reports the results for Wang’s measure and Fig 6 reports the results for Lin’s measure. The minimum ordinate value of the curve of Figs 5 and 6 gives the threshold for BP using Wang’s (0.42) and the Lin’s (0.49) measures, respectively.

Download:

Fig 5. Determination of Wang’s similarity threshold using two families of similar genes.

The minimum of false-positive and false-negative proportions gives the similarity threshold (τ_sim).

https://doi.org/10.1371/journal.pone.0133579.g005

Download:

Fig 6. Determination of Lin’s similarity threshold using two families of similar genes.

The minimum of false-positive and false-negative proportions gives the similarity threshold (τ_sim).

https://doi.org/10.1371/journal.pone.0133579.g006

Threshold stability

A threshold determined using only two groups of genes is exposed to bias. In order to obtain a more reliable threshold, we extended the threshold determination process by including the genes from six PANTHER families for BP and MF and the genes from five metabolisms for CC. We then performed a leave-one-out study to assess the stability of the threshold.