Multiple Genetic Interaction Experiments Provide Complementary Information Useful for Gene Function Prediction

Genetic interactions help map biological processes and their functional relationships. A genetic interaction is defined as a deviation from the expected phenotype when combining multiple genetic mutations. In Saccharomyces cerevisiae, most genetic interactions are measured under a single phenotype - growth rate in standard laboratory conditions. Recently genetic interactions have been collected under different phenotypic readouts and experimental conditions. How different are these networks and what can we learn from their differences? We conducted a systematic analysis of quantitative genetic interaction networks in yeast performed under different experimental conditions. We find that networks obtained using different phenotypic readouts, in different conditions and from different laboratories overlap less than expected and provide significant unique information. To exploit this information, we develop a novel method to combine individual genetic interaction data sets and show that the resulting network improves gene function prediction performance, demonstrating that individual networks provide complementary information. Our results support the notion that using diverse phenotypic readouts and experimental conditions will substantially increase the amount of gene function information produced by genetic interaction screens.


Genetic interaction networks Data sets
All genetic interaction data sets were downloaded from original publications or requested from the authors. When comparing two data sets, we only consider gene pairs tested in both. All networks were considered as undirected (query and array genes where reported have the same role in our analysis).  Table S1. Description of the genetic interaction data sets

Definition of the common space
In order to perform a meaningful comparison between given data sets, we consider only gene pairs that were tested in all of them. We filter out genes that were not present in both studies.

Filtering interactions
The SGA data set is defined as the intermediate data set in Costanzo et al. (epsilon>0.08 and p<0.05). We define the cutoffs for other data sets so that the numbers of positive and negative observed interactions are the same as for SGA.    Table S8. Sensitivity and precision of SGA genetic interactions scores [1].
The probability to have FN (or false negative rate) is directly given by the sensitivity Unfortunately, we don't know the error rates for other data sets. Consequently, we used the values from the SGA data set as an estimate of these rates. We define the cutoff so that the numbers of observed interactions match between the two data sets (if both data sets are sampled from the same model and with the same error rates, we expect the same numbers of interactions on the common space).

Expected counts
Using the notations of Chiang et al. [7], we define gene pairs as being interacting (I) or not interacting (Ic). Given the values n=|I| and m=|Ic|, we can define the expected values of three random variables: the number of gene pairs where no edge exists in any data set (X0), the number of gene pairs where an edge exists for exactly one data sets (and not the other)(X1), the number of gene pairs where an edge exists in both data sets (X2). (1) The total number of gene pairs considered is N = n+m.

Comparison of observed and expected
We consider here the presence or absence of an interaction between each tested gene pair. The overlap is measured by the Jaccard coefficient (intersection / union).  Table S9. Comparison of expected and observed measures between the reference SGA network and other networks separated by type (positive/negative interactions). The column D indicates if the observed overlap/unique measure is more (+) or less (-) than expected. P-values are computed using a Fisher's exact test between expected and observed counts.

All tested gene pairs
For two given network groups, we test the difference of the means of each given measure with a Student's t-Test. When there is a single network in the group (MMS) we assess the significance using a normal distribution with mean and standard deviation estimated from the control distribution, which is assumed to be normally distributed (no rejection of the Shapiro test).

Triplets of gene pairs tested across reference, control and condition
We consider here only gene pairs that were tested in the reference network and in a PHENO/MMS network and a CONTROL network. There are a total of 48499 of these triplets of gene pairs. We computed the similarity measures on the subset of gene pairs present in a given triplet of networks described in Table S12.

Gene function prediction performance
For each network, we only consider GO terms with at least five genes in the networks.  Figure S2. Performance of the combined and reference networks as measured by the area under the PR curve. GO Table S11. Improvement of the gene function performance of the combined network as compared to the condition and reference networks alone as measured by the area under the PR curve. The relative improvement of the combined network C obtained from two individual networks A and B is computed as follows:

All tested gene pairs
is the mean score of the two individual networks A and B. Significant outliers were identified based on their residuals to the linear fit. P-values were then computed under the assumption that the distribution of the residuals is normal, and were further corrected for multiple testing using the Benjamini-Hochberg method (FDR < 0.05).

Triplets of gene pairs tested across reference, control and condition
We study gene function prediction performance on the sets of interactions present in three data sets as described above (triplets). We consider the gene function performance when combining the PHENO/MMS to the SGA reference and when combining the CONTROL to the reference in order to assess the complementarity of the networks. The relative improvement of the combined network C obtained from two individual networks A and B is computed as:    Figure S8. Clustering of the data sets based on the gene profile correlation values. The hierarchical clustering was done using different criteria (Ward, Complete, Average, Median).