Table 1.
Similarity measures evaluated in this study.
Figure 1.
Comparison of similarity measures applied to genetic interaction datasets.
Gene pair correlations derived from each similarity measure were benchmarked against a Gene Ontology-based standard using precision-recall statistics. The comparison was conducted on (A) S. cerevisiae genetic interaction data (Costanzo et al. 2010) - query genes’ similarities, (B) S. cerevisiae genetic interaction data - array genes’ similarities, (C) S. pombe genetic interaction data (Ryan et al. 2012) - query genes’ similarities, and (D) S. pombe genetic interaction data – array genes’ similarities. The horizontal dotted line shows the background precision expected from randomized ranking of gene pairs. The bar plot on the upper right corner in each section shows the area under the precision-recall curve (AUPRC) above the background for each similarity measure. The area was calculated by summation of the areas of trapezoids at increments of 2n (log2 units). The bars are sorted by their respective areas above background.
Figure 2.
Comparison of similarity measures based on highly similar gene pairs and stability on partial data.
(A) The overlap of the top 1000 similar gene pairs for each similarity measure were compared against each other on the query gene side of the S. cerevisiae genetic interaction data (B) The stability of the similarity measures was assessed by comparing the overlap of top 1000 similar gene pairs computed using 10 different random selections of 50% of the data in each profile.
Figure 3.
Role of thresholding genetic interaction data in the performance of similarity measures.
The precision-recall plots were compared on the query side of the S. cerevisiae genetic interaction data at several thresholds (A) ε<−0.08 - only negative genetic interactions at intermediate threshold, (B) ε<−0.2 - only negative genetic interactions at a stringent threshold, (C) ε >0.08 - only positive genetic interactions at an intermediate threshold, (D) ε >0.2 - only positive genetic interactions at a stringent threshold, (E) |ε| >0.08, negative and positive interaction at an intermediate threshold, and (F) |ε| >0.2, negative and positive interaction at a stringent threshold. The bar plot on the upper right corner in each section shows the area under the precision-recall curve (AUPRC) above the background for each similarity measure. The area was calculated by summation of the areas of trapezoids at in increments of 2n (log2 units). The bars are sorted by their respective areas above background.
Figure 4.
Investigation of Pearson correlation relative to the dot product for thresholded genetic interaction data.
In each of the panels, three instances of genetic interaction data have been used: original data, original data with all interactions whose absolute value was less that 0.2 set to zero, and original data where all interactions whose interaction value is less than 0.2 reorganized randomly. The three data instances are investigated using (A) the precision-recall performance of Pearson correlation on each instance, (B) the performance of dot product on the same three instances, (C) a histogram of normalization factor (1/norm) of the profiles for the three instances, and (D) a histogram of the mean of profiles for the three instances.
Figure 5.
Role of noise in the genetic interaction data on similarity measure performance.
In each panel, simulated noise was added to the S. cerevisiae genetic interaction data, and query correlations were used for comparing the similarity measures. The simulated noise conditions are (A) false negatives –95% of the significant interactions whose absolute value of interaction is greater than 0.08 were randomly set to 0, (B) false positives – values were randomly sampled from the set of genetic interactions whose absolute interaction value were greater than 0.08 and were randomly substituted in place of randomly selected non-interactions. This random sampling was repeated until 10 times the number of significant interactions were added as false positives in the original data, and (C) Gaussian noise - random values from a Gaussian distribution of mean 0 and standard deviation 0.08 were added to all values (interactions and non-interactions) in the dataset. The bar plot on the upper right corner in each section shows the area under the precision-recall curve (AUPRC) above the background for each similarity measure. The area was calculated by summation of the areas of trapezoids at in increments of 2n (log2 units). The bars are sorted by their respective areas above background.
Figure 6.
Role of simulated batch effects in genetic interaction data on similarity measure performance.
(A) shows the performance of similarity measures on the query side of the S. cerevisiae genetic interaction network when simulated intermediate batch effects were added to the data. The batch effects were added by creating random batches of size 5 and for each batch, Gaussian noise (μ = 0 and σ = 0.02) was added. Furthermore, Gaussian noise (μ = 0 and σ = 0.02) was added to entire dataset. (B) A stronger batch effect signature and noise was added (μ = 0, σ = 0.04 for both batch effect and noise) (C), (D) are similar plots for the query side of the S.pombe genetic interaction data (μ = 0, σ = 1 for (C), and μ = 0, σ = 2 for (D)). The bar plot on the upper right corner in each section shows the area under the precision-recall curve (AUPRC) above the background for each similarity measure. The area was calculated by summation of the areas of trapezoids at in increments of 2n (log2 units). The bars are sorted by their respective areas above background.