Predicting Quantitative Genetic Interactions by Means of Sequential Matrix Approximation

Despite the emerging experimental techniques for perturbing multiple genes and measuring their quantitative phenotypic effects, genetic interactions have remained extremely difficult to predict on a large scale. Using a recent high-resolution screen of genetic interactions in yeast as a case study, we investigated whether the extraction of pertinent information encoded in the quantitative phenotypic measurements could be improved by computational means. By taking advantage of the observation that most gene pairs in the genetic interaction screens have no significant interactions with each other, we developed a sequential approximation procedure which ranks the mutation pairs in order of evidence for a genetic interaction. The sequential approximations can efficiently remove background variation in the double-mutation screens and give increasingly accurate estimates of the single-mutant fitness measurements. Interestingly, these estimates not only provide predictions for genetic interactions which are consistent with those obtained using the measured fitness, but they can even significantly improve the accuracy with which one can distinguish functionally-related gene pairs from the non-interacting pairs. The computational approach, in general, enables an efficient exploration and classification of genetic interactions in other studies and systems as well.


Introduction
The systematic mapping of genetic interactions in biological systems has the potential to provide a better understanding of how genes function as networks to carry out and regulate cellular processes. In particular, recent advances in the experimental technologies which allow for the large-scale screening of the effects of combinatorial gene deletions are providing an exciting glimpse into the organization of complex genetic networks in terms of revealing novel interacting cellular components and compensatory pathways involved in many cell functions. Comprehensive maps of genetic interactions in model organisms, such as yeast, may also provide a valuable template for understanding the basic principles underlying the relationships between genotype and phenotype in other populations [1]. In humans, genetic interactions are involved in many complex phenotypes and they contribute to most genetic disorders, but the organization of the underlying networks is largely unknown [2,3]. Due to their combinatorial nature, the mapping of genetic interactions is highly labor-intensive even in genetically amenable organisms. Efficient computational frameworks are therefore required to underpin the full potential of these experiments.
Several large-scale studies, especially in yeast Saccharomyces cerevisiae, have already identified a number of synthetic lethal interactions, in which a combination of two individually non-lethal mutations results in lethality [1,4]. Genome-wide screening strategies for synthetic sick or lethal interactions, such as those based on synthetic genetic arrays (SGA) or the diploid synthetic lethality analysis by microarray (dSLAM), have successfully been used for providing insights into the nature of genetic robustness [5], and for identifying functional relationships among the genes and pathways [6]. In addition to this rather limited spectrum of observed phenotypes (synthetic sick/lethal vs. non-interacting pairs), quantitative phenotypes, such as the relative growth rate of yeast colonies, have recently been explored systematically using high-throughput screening approaches like epistatic miniarray profiling (E-MAP) and genetic interaction mapping (GIM) [7,8]. The importance of measuring a broader spectrum of genetic interactions when identifying functionally-related genes and pathway organizations has been demonstrated in theoretical and experimental studies [9][10][11]. To provide reliable information on genetic interactions, customized data handling and pre-processing pipelines have been developed for the different screening approaches [12][13].
Regardless of the experimental technology used, the screening strategies aim to quantify the extent to which a mutation in one gene modulates the phenotype (or fitness) associated with altering the second gene, either by explicitly measuring or analytically comparing the observed fitness of double-mutants to those of single-mutants. More formally, a genetic interaction between mutants i and j can be defined by the deviation (e ij ) of an observed double-mutant phenotype (P ij ) from the expected neutral phenotype of an organism's fitness (E ij ) under the hypothesis that it carries two non-interacting mutations (the null hypothesis). If the fitness P ij is evaluated in terms of the growth rate of double-mutant w ij , relative to the wild-type growth rate, and E ij is a function g(w i , w j ) of the relative single-mutant fitness values w i and w j , this definition can be formulated as: When testing the null hypothesis, a large absolute deviation |e ij | provides evidence for genetic interaction, while deviations close to zero indicate non-interacting gene pairs. Significant genetic interactions can further be classified into so-called synergistic interactions (e ij ,0) and alleviating interactions (e ij .0). Synergistic interactions occur when a double-mutant has a more extreme effect on the fitness than would be expected from independent single mutants alone, and can therefore identify e.g. complementary pathways, with synthetic lethality being the extreme case (w ij = 0). Alleviating interactions, in which the double-mutant phenotype is less severe than expected, can occur, for example, when the first mutation already impairs the function of a whole pathway and thereby masks the effect of the second mutation in the same pathway.
Recently, Mani et al. [14] demonstrated that the product function g(w i , w j ) = w i w j provides a convenient null model in the sense that it yields a distribution with location close to zero and low dispersion over all of the measured deviations. The comparison was based on the principle that, as the vast majority of gene pairs should be non-interacting, the rare gene pairs sharing a specific function should be distinguishable from this background distribution as outlying cases with extreme absolute deviations. Accordingly, it was shown that the observed deviations based on the multiplicative null model were indeed most accurate at identifying functional relationships between the genes [14]. In the present study, we asked two follow-up questions: (1) whether the observed deviations could be estimated directly from the doublemutant phenotypes under the multiplicative model; and (2) whether the prediction of specific functional links could be made at a similar accuracy without utilizing the measurements of singlemutant phenotypes. Under the assumptions that significant genetic interactions are rare and that the multiplicative model is a reasonable approximation in the case of no interaction, we developed a sequential approach that enables a multi-resolution approximation of the double-mutant fitness matrix to address these particular questions, and more generally, to provide a computational framework for exploring genetic interaction datasets.

Estimating single-mutant fitness values
As an initial study objective, we sought to assess the accuracy to which the single-mutant fitness vector w~w i ð Þ n i~1 could be estimated directly from the double-mutant fitness matrix W~w ij À Á n i,j~1 . In the quantitative genetic interaction dataset of St Onge et al. [10], which was used in the following results, w is a 26-dimensional column vector and W is a 26626-dimensional symmetric matrix. Under the multiplicative null model, Eq. 1 takes the form: where w6w is the tensor product (or outer-product) of the vector w with itself, and E is the n6n-matrix comprising the e ij values as its elements for each gene pair i, j = 1,2,…,n. In the ideal case, when there are no measurement inaccuracies, the approximation problem of Eq. 2 could easily be solved using the well-established machinery of linear algebra. More precisely, using the spectral decomposition theorem, one can represent any symmetric real matrix as W~le6e z E, where l is the largest eigenvalue of W and e is the corresponding eigenvector [15]. Under the unrealistic assumption that there are no genetic interactions among any of the gene pairs, the approximation would in fact be exact, that is, the residual error E equals zero. However, as the genetic interaction screens are bound to present with experimental variation, missing data, and hopefully also with significant genetic interactions, the estimation problem must in practice be solved by numerical means.
In the present work, we developed a sequential matrix approximation procedure, which uses increasingly larger subsets of mutation pairs in W to provide a series of estimates for w as solutions of the weighted least squares problem [16]. During the first rounds, the procedure solves the approximation problem of Eq. 2 using only those mutation pairs that best fit the multiplicative model, and then gradually extends the subset to also include pairs with larger residual errors (see Methods for details). Already when using all but the diagonal and missing entries of the double-mutant fitness matrix in the dataset of St Onge et al. [10], we obtained an estimate relatively close to the actual measured single-mutant fitness vector, as compared to the conventional median estimate ( Figure 1A). The estimation accuracy could be markedly improved by excluding those pairs with the largest residual errors from the approximation subset. The pairs having the greatest impact on the approximation error, in fact, corresponded to the five confirmed synthetic lethal pairs ( Figure 1B). When the sequential procedure omitted those pairs from the approximation process, an accurate estimate was obtained for each of the single-mutant fitness measurements (see Figure S1). As will be seen in the following subsections, however, the subset of mutation pairs which gives the most accurate estimates of the single-mutant fitness values does not necessarily lead to the best predictive power when identifying functionally related genes.

Predicting specific functional relationships
Beyond the dynamic variability in the estimates of the singlemutant fitness values, the behavior of the approximation procedure with different subsets of mutation pairs revealed another interesting observation: the order in which the mutation pairs were added into the approximation process reflects on average the relative order of their actual measured deviations, even if the measurements of single-mutant fitness were not employed ( Figures 1C and D). In fact, the measured deviations e ij and the ranked deviationsẽ e ij obtained from the approximation procedure were highly correlated (Pearson correlation equals 0.964, see Figure S2). This led us to investigate whether such procedure-ranked deviations could be used instead of the measured deviations when predicting functional links between the mutations. To this end, we took the same set of gene pairs which were found to have a highly specific shared function in the previous study by Mani et al. [14] (see Methods for their definition). Interestingly, the majority of the mutation pairs selected towards the end of the approximation procedure shared a specific function ( Figure 2). The rate of the functional enrichment observed among the 50 pairs with the largestẽ e ij values was significantly higher than expected (p,10 211 ), whereas the functional enrichment was exceptionally low among the 50 pairs with the lowestẽ e ij values (p = 0.998). These results show that the sequential matrix approximation procedure gives as its byproduct a ranking of the mutation pairs that is in good agreement with their likelihood of sharing a specific function.
To test more systematically whether the prediction of the functional relationships could be made to an accuracy similar to that obtained when using the measured single-mutant fitness values, we assessed how well the ranking based onẽ e ij values can discriminate between gene pairs with or without a specific functional link. Similarly as in the original study of St Onge et al. [10], the prediction capability was evaluated using the receiver operating characteristic (ROC) curves that show the relative tradeoff between the sensitivity and specificity of the predictions at multiple decision thresholds ( Figure 3). Surprisingly, the ranked deviations gave even a better prediction accuracy than the measured deviations according to the area under the ROC curve (AUC) values (AUC = 0.780 vs. AUC = 0.662, p = 0.003). The prediction capability was improved systematically at each specificity-level, demonstrating that the procedure could distinguish with high accuracy the functionally related gene pairs over the whole spectrum of the exceptional deviations ( Figure 3). The order in which the mutation pairs were included into the approximation process further improved the relative classification power (AUC = 0.789, p = 0.002). The prediction accuracy of the double-mutant fitness values alone was similar to that of a random classifier (AUC = 0.506), demonstrating that the normalization by the measured or estimated single-mutant fitness values can, in any case, improve the prediction of the functional links. The estimates achieved using different subsets of mutation pairs can also lead to different degrees of prediction accuracy ( Table 1).

Distribution of the estimated deviations
Finally, we investigated how the estimated deviations b e e k ð Þ ij obtained from the approximation procedure at different subset sizes, k, are distributed relative to the true deviations e ij obtained when using the measured single-mutant fitness values. When comparing the different distributions, we used the same interpretation rule as in the earlier comparison by Mani et al. [14]: an ideal definition of genetic interaction should result in a tight distribution (indicating low variability) that is centered at zero (indicating low bias) for the bulk of the measured interactions (reflecting the background distribution of non-interacting genes). The subset of mutation pairs being used in the approximation process had a marked effect on the distribution of the estimated deviations ( Figure 4). Moderate subset sizes generated distributions with a lower bias and variability than those obtained using smaller subset sizes or all of the mutations (Table 1). Surprisingly, the cutoff point, k = 317, which gave the most accurate estimates for the single-mutant fitness values resulted in a relatively weak prediction accuracy for the functional links (AUC = 0.679). Although the   Figure 4). The estimated deviations obtained using this particular subset of mutation pairs were also most successful in predicting the functional relationships (AUC = 0.810). These results indicate that the distribution characteristics of the estimated deviations could serve as a guide to choosing the optimal subset of gene pairs for defining genetic interactions in a given dataset. Both the optimal subset size and the overall performance of the method is likely to depend on the properties of the dataset being analysed, including the number of gene pairs and and the degree of their functional homogeneity.
To investigate how the differences in the distributions are visible in the conventional heat map visualizations, we displayed the color-coded deviations on a two-dimensional grid spanned by the individual mutations. Here, a special emphasis was placed on analyzing the estimated deviations b e e 136 ð Þ ij , which provided the most ideal definition of genetic interactions in terms of both distribution characteristics and predictive power, relative to the measured deviations e ij ( Figure 5). In general, the interaction patterns were relatively similar between the measured and estimated deviations. However, the transformation of the double-mutation fitness measurements through the approximation process seemed to emphasize the mutation pairs with exceptionally large absolute deviations (putative genetic interactions), and pointed out especially those pairs having positive deviations (alleviating interactions), while it diminished certain subsets of mutation pairs with negative deviations (synergistic interactions). For instance, a considerable number of double-deletion strains involving either hpr5 or sgs1 mutations showed marked evidence for alleviating interactions in the color map of the estimated deviations ( Figure 5A), while these pairs were not identifiable from the original map ( Figure 5B). At the same time, the approximation algorithm blotted out certain moderate degree synergistic interactions in the hpr5 deletion strain, including the effects of additional mutations of mms4, mph1, or mus81. Although these and other changes contributed positively to the prediction of functional relationships in the dataset of St Onge et al., further evaluation how well these findings can be generalized beyond this relatively small set of functionally related genes is required on independent datasets.

Discussion
The growing availability of large-scale genetic interaction datasets is enabling computational methods to systematically explore how genes interact to produce phenotypes on a global network-level. While these datasets yield an unprecedented insight into the organization and function of complex genetic networks, their analysis also poses many challenging computa-  with different sizes of subsets of those gene pairs used in the approximation process (see Figure 4). The distribution of the measured deviations e ij is used as a references value for the different parameters (the first row). As robust measures of location (bias) and dispersion (variability), we calculated the trimmed mean and interquartile range, respectively, in addition to the median and median absolute deviation that were used in the previous comparative study [14]. The bold type indicates the subset size which provided the most accurate prediction of the functional links in terms of the area under the receiver operating characteristic curve (AUC). doi:10.1371/journal.pone.0003284.t001 tional problems. Using a high-resolution screen of genetic interactions in yeast as an example dataset of St Onge et al. [10], we have demonstrated that the computational approach based on sequential matrix approximations facilitates extraction of pertinent information from the background variation. A key finding of the present work is that the double-mutant fitness matrix alone carries enough information for accurate estimation of the single-mutant fitness values and for prediction of functional relationships among the genes. This makes it possible to avoid performing the single-mutant growth experiments, without compromising -if not even improving -the functional prediction power encoded in the double-mutant measurements. Surprisingly, the subset of mutation pairs which gave the most accurate estimates for the single-mutant fitness values did not lead to the most accurate predictions of the functional links. This may be due to experimental variability, such as differences in growth or screening conditions when measuring the strains carrying either single or double mutations, which may be beyond the capacity of the standard data pre-processing but can be normalized by the sequential approximation procedure. Other possible explanations for this surprising observation include biases in the definition of the gene pairs sharing a specific function or in the targeted set of genes pairs chosen for the particular interaction screen. Further study is therefore needed to confirm whether similar results can be obtained also in larger genetic interaction datasets, in which genes with much wider variety of functional categories are studied.

Limitations and extensions of the procedure
Perhaps the biggest technical limitation of the present work concerns the heuristic way in which the subsets of mutation pairs were selected for the approximation of the double-mutant fitness matrix. The greedy subset selection scheme was motivated by a similar approach successfully being used in many feature selection problems [17]. An adaptive version of such a forward floating selection method was applied here because of its low computational complexity and because it was capable of excluding the most prominent outliers during the sequential approximation process (Fig. 1). Similarly, despite the weighted least squares matrix approximation algorithm being based on a rather straightforward decomposition method, it was able to reduce some degree of background variation in the data (Fig. 4). However, more sophisticated search and approximation schemes based on e.g. genetic algorithms or simulated annealing should lead to even better estimation and prediction results, or at least reduce the computational complexity. Additional modifications to enhance the present framework either in biological and/or computational terms include using deviations from the expected fitness as weights in the least squares approximation and using the sign of the deviations when including or excluding a mutation pair over the course of the sequential approximation process. While the present results were based on the rank-one approximation only, which enabled the partitioning of pairs of genes into two categories (interacting or non-interacting), utilizing the higher order ranks could allow us to classify the quantitative measurements  Table 1 lists the shape parameters of these distributions calculated over all of the mutation pairs. The two distributions in each individual plot are scaled according to their total number of pairs. The non-scaled versions of the same distributions are provided as Figure S3, which allows for better visual comparison between the functionally-linked and the functionally non-linked pairs. doi:10.1371/journal.pone.0003284.g004 into several categories, for instance, synergistic and alleviating interactions, or even more fine-grained classification of interactions that can occur between genes [18]. This could also help us also to distinguish those biological modules in which the distribution of genetic interactions does not follow the ideal tight and zero-centered distribution that has been used traditionally [9,14].

Future applications and research directions
In spite of the above mentioned technical limitations, the present results support the feasibility of the approximation framework for systematic exploration of genetic interaction data, and warrant its applications to larger-scale datasets, such as those generated with the E-MAP or GIM screens, to confirm whether similar findings can be extrapolated to the genetic interactions data derived from high-throughput technologies. Other quantitative phenotypes or experimental techniques for defining or measuring genetic interactions could, in principle, also be used, although certain modifications will be needed to adapt the procedure to the specific characteristics of each genetic interaction screen. In the larger-scale screens, the gene pairs under analysis can be selected more randomly among a wider range of functions, thus increasing the expected proportion of non-interacting pairs. Accordingly, the more interactions there are being measured, the better the assumptions behind the multiplicative model will be justified, provided that significant genetic interactions remain relatively rare. The many missing values typically occurring in the large-scale screens should not pose a major problem for our approach either, due to its sequential nature being able to adapt to those subsets of doublemutant measurements with the best approximation power. Hence, the approximation approach is likely to yield even better results with larger and unbiased datasets. Similarly, even if the assumption of low frequency of significant interactions may become compromised in more targeted studies, such as the one of St Onge et al, this should not have a major effect on the results as the strongest genetic interaction pairs are effectively filtered out in the sequential estimation process. For such smaller-scale and more targeted genetic interaction studies, a further increase in the performance could be obtained by modifying the null model for non-interacting pairs to take into account the multitude of singlemutants affecting the particular double-mutant fitness value. This is one of the modeling challenges which we aim to tackle in the future.

Integrative analysis together with other data sources
More generally, the computational approach based on the sequential matrix approximation can provide a principled framework for exploring and classifying genetic networks and interactions using a wide spectrum of global data sources, including the localization, mRNA or protein expression, physical interaction and functional annotation of the proteins encoded by the genes [19]. It has previously been demonstrated that physical protein-protein interactions, in particular, provide useful information that is, by and large, complementary to that obtained from the functional genetic interactions [6,20]. To reveal the modular structure of the underlying networks and functional organization the multitude of pathways reflected in such largescale data types, various network partitioning methods have been used to detect either densely-or similarly-connected clusters as well as significantly-repeated motifs in the individual or integrated interaction networks [21][22][23][24]. However, many open questions still remain about the integrative analysis strategy of these datasets and the most meaningful interpretation of their results. For instance, the extent to which the genetic interaction could be explained by the other information sources, such as protein-protein, protein-DNA, metabolic network and protein structure data [25][26][27][28], and how these should be efficiently employed when scoring genetic interactions using measures such , which generates the most ideal distribution and provides the best discrimination between the functionally-linked and the non-functionally linked pairs (see Table 1). While the five confirmed synthetic lethal pairs (sgs1D mus81D, sgs1Dmms4D, sgs1Dslx4D, sgs1Dhpr5D and rad54Dhpr5D) are clearly visible in both of the maps, there are marked differences in the more subtle interaction scores at many places of the matrix between the estimated deviations and the measured deviations e ij . Red color corresponds to synergistic interaction scores and blue to alleviating interactions. The grey boxes indicate missing data points. doi:10.1371/journal.pone.0003284.g005 as the pairwise deviations, the S-and COP-scores, or the correlation and congruence between the interaction patterns [9][10][11][12][27][28][29]]. Finally, the success of any computational approach for constructing genetic interaction networks is likely to be driven by parallel improvements in the experimental technologies, such as enabling measurement of phenotypic effects in response to the mutation of more than two genes in combination.

Materials and Methods
Approximating the double-mutant fitness matrix generally congruent with the ranked deviations. The distributions of the estimated deviations with different subset sizes k are illustrated in Figure 4 in the dataset of St Onge et al. [10].

Quantitative genetic interaction measurements
To evaluate the performance of the sequential matrix approximation procedure in practice, we applied it to a recent high-resolution genetic interaction study of St Onge et al. [10]. This particular study was chosen because it contains quantitative growth-rate measurements of both single-and double-mutant cell populations for a targeted set of 26 genes related to DNA repair in yeast S. cerevisiae. The detailed time course fitness measurements were performed in the presence and absence of the DNAdamaging agent methyl methaneusulfonate (MMS). The results of the present study were based on the growth measurements in the absence of MMS. The prediction of functionally related gene pairs was in fact more challenging in this case than in the data measured in the presence of MMS. The measured and estimated singlemutant fitness values and the double-deletion deviations in the dataset are shown in Figure S1 and Figure 5, respectively.

Defining gene pairs sharing a specific function
Functional links among the 26 genes were defined using the same approach as in many previous genetic interaction studies [6,10,14]. Briefly, a term in the Biological Process branch of the Gene Ontology was considered specific if it was associated with fewer than 30 yeast genes, and two genes were considered to have a specific functional relationship if they shared any of those specific terms [14]. This resulted in the set of 35 specific functional links in the dataset of St Onge et al. [10].

Statistical evaluation of the predictive power
Statistical enrichment of the specific functional links among a set of mutation pairs selected by the sequential approximation procedure was assessed using the standard hypergeometric test. Briefly, if t is the number of top mutation pairs selected according to their residual errors, and M is the total number of the functionally-related links, then the probability of obtaining at least m functionally-related pairs when selecting pairs at random from the set of K mutation pairs can be calculated using the cumulative distribution function: The enrichment for the M = 35 functionally-linked pairs among the t = 50 mutation pairs selected on the basis of either small (m = 1) or large (m = 22) residual errors is shown in Figure 2. The dotted line shows the expected rate of the functional links when selecting mutation pairs at random, that is, M/K = 35/323 = 0.108.
The predictive power of the measured and estimated deviations was assessed using the receiver operating characteristic (ROC) curves that characterize the relative trade-off between the true positive rate (sensitivity) and the false positive rate (1 -specificity). The overall predictive performance was summarized using the area under the ROC curve (AUC). The statistical significance of the difference in the AUC values between two genetic interaction measures was assessed using a custom written algorithm based on the method of DeLong et al. [30]. This nonparametric method uses the theory of generalized U-statistics to calculate an estimated covariance matrix and hence it can also take into account the correlated nature of the data. The ROC curves and the corresponding AUC values for the prediction of the 35 functional links in the dataset of St Onge et al. [10] are shown in Figure 3.