Optimal Threshold Determination for Interpreting Semantic Similarity and Particularity: Application to the Comparison of Gene Sets and Metabolic Pathways Using GO and ChEBI | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

Flowchart for threshold determination.
1) Define at least two distinct groups of genes expected to be similar. 2) Compute the intra- and inter-group similarities and compile the results into S and N distributions. If these two distributions are significantly different, the groups of genes are relevant. 3) If S and N do not overlap, define threshold τ_sim using any value between τ_S (the lowest value of S) and τ_N (the highest value of N). Else, considering every value under the threshold as FN and every value above the threshold as FP, compute the FN proportion in the S distribution (3a) and the FP proportion in the N distribution (3b) for all samples of the similarity threshold between τ_N to τ_S. 3c) For each possible threshold value, sum the FN and FP proportions obtained in steps 3a and 3b. The similarity threshold τ_sim is the one that minimizes this sum.

More »

Fig 2 — Fig 2.

Ideal case of threshold determination.
The threshold should be located between the lowest whisker of the similar distribution (a) and the upmost whisker of the non-similar distribution (b).

More »

Fig 3 — Fig 3.

Overlap case of threshold determination.
The similar and non-similar boxes overlap. In this case, there are false-positive and false-negative results between the lowest whisker of the similar distribution (a) and the upmost whisker of the non-similar distribution (b).

More »

Table 1 — Table 1.

Patterns of similarity and particularity.

More »

Fig 4 — Fig 4.

Intra- and inter-family semantic similarity distributions using two families of similar genes.
Part A presents the results obtained using Wang’s measure and part B presents the results obtained using Lin’s measure. In both parts, the left side separately presents the two intra-family distributions in blue and the inter-family distribution in yellow. The right side presents the S distribution that gathers all the intra-family similarity values in blue and the N distribution that gathers all the inter-family similarity values in yellow.

More »

Fig 5 — Fig 5.

Determination of Wang’s similarity threshold using two families of similar genes.
The minimum of false-positive and false-negative proportions gives the similarity threshold (τ_sim).

More »

Fig 6 — Fig 6.

Determination of Lin’s similarity threshold using two families of similar genes.
The minimum of false-positive and false-negative proportions gives the similarity threshold (τ_sim).

More »

Fig 7 — Fig 7.

BP distribution of similarity values comparing similar and non-similar genes.
Part A gives results using Wang’s similarity measure. Part B gives results using Lin’s similarity measure.

More »

Fig 8 — Fig 8.

MF distribution of similarity values comparing similar and non-similar genes.
Part A gives results using Wang’s similarity measure. Part B gives results using Lin’s similarity measure.

More »

Fig 9 — Fig 9.

CC distribution of similarity values comparing similar and non-similar genes.
Part A gives results using Wang’s similarity measure. Part B gives results using Lin’s similarity measure.

More »

Fig 10 — Fig 10.

Determination of Wang’s similarity threshold.
The minimum of false-positive and false-negative proportions gives the similarity threshold (τ_sim). The overlapping parts of the boxplots (between τ_N and τ_S) from part A of Figs 7, 8 and 9 are shown in the lower part of the figure. The thresholds are located between the similar and non-similar boxes.

More »

Fig 11 — Fig 11.

Determination of Lin’s similarity threshold.
The minimum of false positive and false negative proportions gives the similarity threshold (τ_sim). The overlapping parts of the boxplots (between τ_N and τ_S) from part B of Figs 7, 8 and 9 are shown in the lower part of the figure. The thresholds are located between the similar and non-similar boxes.

More »

Table 2 — Table 2.

Semantic similarity thresholds for Wang’s and Lin’s measures.

More »

Table 3 — Table 3.

Similarity threshold variations considering full and partial datasets (Wang’s measure).

More »

Table 4 — Table 4.

Similarity threshold variations considering full and partial datasets (Lin’s measure).

More »

Table 5 — Table 5.

Semantic SV-based and IC-based particularity thresholds.

More »

Fig 12 — Fig 12.

Distribution of similarity values comparing similar and non-similar ChEBI entities.
Part A gives results using the simUI similarity measure. Part B gives results using the simGIC similarity measure. The S and N distributions did not overlap. For both measures, τ_sim was between τ_S (lowest whisker of the intra-family S blue box) and τ_N (upmost whisker of the inter-family N yellow box).

More »

Table 6 — Table 6.

Evolution in patterns in results on HomoloGene intra-group BP comparisons.

More »

Table 7 — Table 7.

Evolution in patterns in results on HomoloGene intra-group MF comparisons.

More »

Table 8 — Table 8.

Evolution in patterns in results on HomoloGene intra-group CC comparisons.

More »