Fig 1.
Flowchart for threshold determination.
1) Define at least two distinct groups of genes expected to be similar. 2) Compute the intra- and inter-group similarities and compile the results into S and N distributions. If these two distributions are significantly different, the groups of genes are relevant. 3) If S and N do not overlap, define threshold τsim using any value between τS (the lowest value of S) and τN (the highest value of N). Else, considering every value under the threshold as FN and every value above the threshold as FP, compute the FN proportion in the S distribution (3a) and the FP proportion in the N distribution (3b) for all samples of the similarity threshold between τN to τS. 3c) For each possible threshold value, sum the FN and FP proportions obtained in steps 3a and 3b. The similarity threshold τsim is the one that minimizes this sum.
Fig 2.
Ideal case of threshold determination.
The threshold should be located between the lowest whisker of the similar distribution (a) and the upmost whisker of the non-similar distribution (b).
Fig 3.
Overlap case of threshold determination.
The similar and non-similar boxes overlap. In this case, there are false-positive and false-negative results between the lowest whisker of the similar distribution (a) and the upmost whisker of the non-similar distribution (b).
Table 1.
Patterns of similarity and particularity.
Fig 4.
Intra- and inter-family semantic similarity distributions using two families of similar genes.
Part A presents the results obtained using Wang’s measure and part B presents the results obtained using Lin’s measure. In both parts, the left side separately presents the two intra-family distributions in blue and the inter-family distribution in yellow. The right side presents the S distribution that gathers all the intra-family similarity values in blue and the N distribution that gathers all the inter-family similarity values in yellow.
Fig 5.
Determination of Wang’s similarity threshold using two families of similar genes.
The minimum of false-positive and false-negative proportions gives the similarity threshold (τsim).
Fig 6.
Determination of Lin’s similarity threshold using two families of similar genes.
The minimum of false-positive and false-negative proportions gives the similarity threshold (τsim).
Fig 7.
BP distribution of similarity values comparing similar and non-similar genes.
Part A gives results using Wang’s similarity measure. Part B gives results using Lin’s similarity measure.
Fig 8.
MF distribution of similarity values comparing similar and non-similar genes.
Part A gives results using Wang’s similarity measure. Part B gives results using Lin’s similarity measure.
Fig 9.
CC distribution of similarity values comparing similar and non-similar genes.
Part A gives results using Wang’s similarity measure. Part B gives results using Lin’s similarity measure.
Fig 10.
Determination of Wang’s similarity threshold.
The minimum of false-positive and false-negative proportions gives the similarity threshold (τsim). The overlapping parts of the boxplots (between τN and τS) from part A of Figs 7, 8 and 9 are shown in the lower part of the figure. The thresholds are located between the similar and non-similar boxes.
Fig 11.
Determination of Lin’s similarity threshold.
The minimum of false positive and false negative proportions gives the similarity threshold (τsim). The overlapping parts of the boxplots (between τN and τS) from part B of Figs 7, 8 and 9 are shown in the lower part of the figure. The thresholds are located between the similar and non-similar boxes.
Table 2.
Semantic similarity thresholds for Wang’s and Lin’s measures.
Table 3.
Similarity threshold variations considering full and partial datasets (Wang’s measure).
Table 4.
Similarity threshold variations considering full and partial datasets (Lin’s measure).
Table 5.
Semantic SV-based and IC-based particularity thresholds.
Fig 12.
Distribution of similarity values comparing similar and non-similar ChEBI entities.
Part A gives results using the simUI similarity measure. Part B gives results using the simGIC similarity measure. The S and N distributions did not overlap. For both measures, τsim was between τS (lowest whisker of the intra-family S blue box) and τN (upmost whisker of the inter-family N yellow box).
Table 6.
Evolution in patterns in results on HomoloGene intra-group BP comparisons.
Table 7.
Evolution in patterns in results on HomoloGene intra-group MF comparisons.
Table 8.
Evolution in patterns in results on HomoloGene intra-group CC comparisons.