^{1}

^{2}

^{1}

^{2}

^{1}

^{1}

^{2}

^{*}

Conceived and designed the experiments: KWM FM XW. Performed the experiments: XW KWM. Analyzed the data: XW MAC. Contributed reagents/materials/analysis tools: XW MAC. Wrote the paper: XW KWM FM.

The authors have declared that no competing interests exist.

Combinatorial gene perturbations provide rich information for a systematic exploration of genetic interactions. Despite successful applications to bacteria and yeast, the scalability of this approach remains a major challenge for higher organisms such as humans. Here, we report a novel experimental and computational framework to efficiently address this challenge by limiting the ‘search space’ for important genetic interactions. We propose to integrate rich phenotypes of multiple single gene perturbations to robustly predict functional modules, which can subsequently be subjected to further experimental investigations such as combinatorial gene silencing. We present posterior association networks (

Synthetic genetic interactions estimated from combinatorial gene perturbation screens provide systematic insights into synergistic interactions of genes in a biological process. However, this approach lacks scalability for large-scale genetic interaction profiling in metazoan organisms such as humans. We contribute to this field by proposing a more scalable and affordable approach, which takes the advantage of multiple single gene perturbation data to predict coherent functional modules followed by genetic interaction investigation using combinatorial perturbations. We developed a versatile computational framework (

An important goal of systems biology is to understand how genes act in concert with each other to control a biological process. Large-scale gene silencing coupled with rich phenotypic screening paves the road towards a systematic understanding of gene functions. Rich phenotypes can result from quantifying many different phenotypic changes in an organism or population of cells

Quantitative synthetic genetic interactions evaluated from combinatorial perturbations provide rich information about underlying network structure of biological processes

A major limitation of combinatorial gene silencing, however, lies in its scalability in higher organisms such as humans. Genetic interaction profiling requires double knock-down experiments over all possible combinations of RNAi reagents targeting each pair of genes; thus, the very recent application to

Our biological strategy poses two key challenges to computation: (a) how to assess the statistical significance of functional interactions computed from phenotyping screens of single gene perturbations; (b) how to integrate complementary data, such as protein-protein interactions, as

Previous methods to predict genetic interactions in model organisms have made use of physical interactions

Clustering methods have been used for functional module searching from rich RNAi phenotyping screens

Synthetic genetic interaction profiling lacks scalability to metazoans such as

The rich phenotyping screens can be obtained from public data sets or custom generated. In the first case study of the paper, the data came from published high-throughput RNAi screens using a kinome siRNA library in four different cancer cell lines

In our second application, we generated our own perturbation data to explore functional interactions between chromatin factors in epidermal stem cells. A typical experimental workflow includes RNAi transfection, different biochemical treatments, reporting phenotypes as well as data preprocessing (

(A) Experimental strategy. A typical experimental workflow for RNAi screening involves RNAi transfection, different biochemical treatments, reporting phenotypes as well as data preprocessing The schematic figure illustrates how to customize rich phenotyping screens to study epidermal stem cell fate. (B) Computational framework.

We demonstrate the general applicability of our computational methodology on a publicly available data set of single RNAi perturbations across four cell lines in Ewing's sarcoma (ES)

We first describe a unified framework for predicting functional interactions and enriched modules and then assess its power in the controlled setting of a comprehensive simulation study. Finally, we describe novel biological insights made possible by our approach in two case studies: The first one on prioritizing a potential therapeutic network for Ewings sarcoma, and the second one on predicting and confirming a genetic interaction network controlling stem cell fate.

To represent functional interactions between perturbed genes, we introduce posterior association networks (

A conventional way to quantify the functional association between two genes is to compute the similarity between their phenotypic profiles based on correlation coefficients (e.g.

Motivated by the density pattern of association profiles, we propose to model functional associations by a mixture of three components representing positive association (

To assess the strength of evidence for having a functional interaction, a model selection step is performed for each pair of genes. We compute signal-to-noise ratios (SNRs), which are posterior odds for edge

A cutoff score

We search for coherent functional modules in the inferred PAN by performing hierarchical clustering on functional association profiles, each of which is a vector of cosine similarities between one gene and all genes screened. The method compares functional profiles of genes instead of their individual functions, and it has been demonstrated to be a highly desirable measure to group genes with similar interaction patterns

More details for the above procedures can be found in the

In this section, we demonstrate the effectiveness of

The performance of

In our simulations, we model replicate number by the sample size of a multivariate normal distribution and interaction strength by Pearson correlation coefficient. Considering 100 genes in total, we set two modules (with 30 genes for each) with positive internal interactions and negative external interactions to each other. We enumerated replicate size (from 2 to 20) and varied interaction strength by introducing random noise (

(A) Simulation on the effect of replicate sample size and interaction strength. The black and red dashed lines indicate the base line (AUC = 0.5) and a high prediction performance (AUC = 0.8), respectively. The performance of

The simulation results suggest that our approach tends to identify those modules that are highly enriched for functional interactions. Increasing the number of replicates can help promote the prediction accuracy for modules with weaker interaction strength. When genes are completely randomly associated (100% random noise in the correlation matrix), as expected,

In this simulation, we demonstrate that

Taking one parameter setting (8 replicates,

Having established our computational framework, we first demonstrate its general applicability on biological data sets that are publicly available. In this case study, we use RNAi phenotyping screens across multiple cell lines to infer functional modules of kinases that are critical for growth and proliferation of Ewing's sarcoma. We demonstrate that our model can make efficient use of single gene perturbation data to predict robust functional interactions.

The data used in this case study is a matrix (

To predict the functional interactions between genes, the proposed beta-mixture model was applied to quantify the significance of their associations, which are measured by cosine similarities computed from the Z-score matrix. We first permuted the Z-score matrix 20 times, computing cosine similarities and fitting a null distribution by maximum likelihood estimation using the function

(A) Fitting a beta distribution to permuted screens. The transformed cosine similarity density curves of the permuted data are colored in grey. The fitted beta distribution is plotted as a dashed green curve. (B) Fitting a beta-mixture distribution to screening data. The transformed cosine similarities of the real screening data is shown in the grey histogram. Fitted beta distributions representing the

Having fixed the parameters for the

Having fitted the global mixture model to data successfully, we inferred a network of functional interactions between kinases based on the proposed edge inference approach. Setting the cutoff SNR score at 10, which is interpreted as a ‘strong’ evidence in Bayesian inference

Hierarchical clustering with multiscale bootstrap resampling was conducted subsequently using the R package

The first module (upper left in

Previous RNAi screening studies such as

Among the top significant pathways (

Similar pathway analyses were also performed on the other four modules separately, but none of them are significantly overrepresented in any KEGG pathway. Taking all together, the second module is highly enriched for clinically confirmed and potential therapeutic targets, and associated with signalling pathways that are crucial for growth and proliferation of Ewing's sarcoma, demonstrating the prediction power of

Having demonstrated its applicability, we applied the proposed computational framework to study self-renewal of epidermal stem cells using RNA interference screening data for 332 known and predicted chromatin modifiers. We predicted a highly significant module enriched for functional interactions, and confirmed their dense genetic interactions using combinatorial gene perturbation. Further experimental follow-up suggests that their genetic interactions may involve transcriptional cross regulations.

RNAi screening data were obtained for 332 chromatin factors under five conditions: vehicle, AG1478, BMP2/7, AG1478+BMP2/7 and serum stimulation in triplicates. In detail, siRNAs targeting these genes were placed in four 96-well plates, each of which includes two independent siRNAs targeting controls. For each well in each plate, the endogenous levels of transglutaminase I (TG1) protein and DRAQ5 signal were screened to measure differentiation per cell. TG1 is the key enzyme that mediates the assembly of the epidermal cornified envelope and is a marker of differentiated cells, while DRAQ5 signal is used to measure all cells. More details about the siRNA screening experiment can be found in our accompanying paper

To correct for plate-to-plate variability, the raw screening measurement

Similar to the previous case study, we first fit the global mixture model to functional interaction profiles quantified by cosine similarities on the Z-score matrix. The fitting results of the null and mixture model are shown in

(A) Fitting a beta distribution to functional associations computed from permuted screening data. For each one of the total 100 permuted datasets, association densities were computed and a beta distribution was fitted. Each fitted distribution is plotted as a grey curve. The median scores of the two shape parameters of fitted beta distributions were selected to fix the

The matrix (54946 pairs of genes

We performed GSEA for each mixture component using R package

(A), (B) and (C) correspond to enrichment analysis of protein-protein interactions (PPIs) in the posterior probabilities for associations belonging to the

As shown in the simulations, with complementary data our extended beta-mixture model can greatly improve prediction accuracy of functional interactions (

Similarly, we first fit a null beta distributions to each of 100 perturbed data sets, and used the median values of the fitted parameters to fix the

The whole set of gene pairs are stratified to PPI(protein-protein interaction) group and non-PPI group. The extended beta-mixture model is fitted to functional associations, setting different prior probabilities (mixture coefficients) to these two groups. The fitting results for the PPI group is illustrated in (A), and the non-PPI group in (B). The histogram and the dashed curves show the real distribution of transformed association scores and the fitting result, respectively. Fitted distributions for positive, negative and lack of association are illustrated by red, blue and green dashed curves, respectively. The fitting results suggest that gene pairs in the PPI group have higher probability to be functionally connected than the non-PPI group.

Based on the fitting results of the extended mixture model, we next inferred a network of functional interactions between the chromatin factors. We weighted the edges using signal-to-noise ratios (SNRs), which are essentially posterior odds of gene pairs in favor of signal (association) to noise (lack of association). The sign of each edge was determined by comparing the posterior probabilities belonging to the positive and negative association components. Setting a cutoff SNR score at 10, we obtained a sparse network with 165 genes, only 848 positive and 878 negative edges (12.8% of all gene pairs).

To assess the uncertainty of the clustering analysis, we computed a

Of all modules predicted using

Nodes with purple colors represent positive perturbation effects. Node colors are scaled according to their averaged perturbation effects under the vehicle condition. Node sizes are scaled in proportion to their degrees. Edge widths are in proportion to log signal-to-noise ratios. Edges colored in green and grey represent positive interactions inside modules and summed interactions between modules, respectively. This figure illustrates top significant modules and their dense functional interactions. Genes colored in red were selected for further experimental investigation.

The dense functional connections between

(A) The predicted functional module examined by further experiments. Figure legends are the same as

To understand the basis of their genetic interactions, we further looked for possible transcriptional regulation among them. Chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) analysis was conducted for

Interestingly, in our ChIP-qPCR analysing

Recent years have seen an increasing interest in using massive combinatorial perturbations to study genetic interactions systematically. This approach has only been applied to model organisms such as yeast and bacteria on a large scale due to its limited scalability on metazoans. In this paper, we reported a scalable and affordable strategy to predict functional interactions from single gene perturbation screens. As demonstrated in our two applications,

As shown in our second case study, protein-protein interactions are found to be significantly enriched for functional interactions. Such prior information is informative but poses big challenges to conventional parametric or permutation-based nonparametric hypothesis tests.

To show the general applicability to real biological data, we applied

In our two applications, only a handful of top significant modules are obtained because: a) a stringent SNR cutoff was deliberately chosen to select highly significant functional interactions, and b) a few filtering steps are involved to select modules enriched for significant interactions (

Although not found in our applications, it could happen in principle that no phenotypic change is observed upon single gene perturbation. These extreme cases could be explained when two genes in two distinct but combinatorial pathways fully compensate each other function. The functional associations between these genes have much higher chance to belong to the

Cosine similarity is a measure of similarity by computing the cosine of the angle between two vectors. Let

Finite mixture models have been used to identify co-expressed genes from gene expression data

For simplicity, we denote the set of association scores (e.g. cosine similarities) as

We assume that

Let

We demonstrated in our application to epidermal stem cells that gene pairs with evidences of protein-protein interactions in the nucleus tend to have higher functional associations. However, such prior information is ignored in the above global mixture model, which treats every association equally multinomially distributed with the same parameters. Inspired by the stratified Gaussian mixture model proposed by Pan

The full set of associations

To obtain smoother estimates of the parameters and guide the selection of model structures, we perform Bayesian regularization for the mixture model by introducing Dirichlet priors for the likelihood:

The corresponding log-posterior probability is:

For a Dirichlet prior distribution

Having estimated the paramters in the beta-mixture model, the posterior probability for association

We propose to perform MAP estimation using a similar EM algorithm as Ji et al., which alternates between computing the expectation of the log-posterior probability based on the current estimates for the latent variables and maximizing the expected log-posterior:

Due to the difficulty to derive a closed-form expression to estimate the parameters of beta distributions, similar to Ji et al.

In practice, our method differs from the global beta-mixture model proposed by Ji et al. in the following aspects:

The global beta-mixture model proposed by Ji et al. has a challenge to determine the number of beta distributions using a model selection criterion (e.g. AIC, BIC or ICL-BIC). We deliberately apply a three-component beta-mixture model to fit association densities of perturbation screens under a very reasonable biological assumption as we discussed before.

We fit a beta distribution to association scores computed from permuted screening data to fix the mixture component representing lack of association. This strategy can help avoid potential overfitting in the global model.

Our extended stratified mixture model allows integration of prior knowledge such as protein-protein interactions.

The preprocessed phenotyping screens can be considered as samples drawn from multivariate normal distributions. Considering

In the ‘signal’ matrix (the left triangular matrix),

To evaluate the uncertainty of cluster analysis, a conventional approach is to perform ordinary bootstrap resampling of data

Functional modules are generated by superimposing clusters, obtained from hierarchical clustering on functional profiles, onto inferred posterior associated networks. To select highly significant functional modules, we applied a few filtering procedures (

Select significant modules that are strongly supported by data. The significance of clusters is quantified by

Exclude extremely big or small modules.

Select modules that are densely functionally connected. Graph (or module) density, the ratio of predicted significant associations to all possible associations, is computed for each module to assess how densely genes are functionally connected.

Select modules associated with specific function of interest. Identified functional modules could be dominated by genes associated with positive or negative loss-of-functions. This filtering step can be applied in many real applications to focus on a specific function of interest. For example, in the application to epidermal stem cells, modules associated with positive loss-of-function (increased differentiation upon perturbation) were selected because we are only interested in chromatin factors regulating self-renewal.

Chromatin immunoprecipitations were performed as described in our accompanying paper

(PDF)

(PDF)

(PDF)

(DOC)

(DOC)

(DOC)

We thank Gunnar W. Klau at the Netherlands Centre for Mathematics and Computer Science and Lodewyk Wessels at the Netherlands Cancer Institute for suggestions on simulation studies, and Dr. Roland F. Schwarz at Cancer Research UK Cambridge Research Institute for suggestions and discussions on bioinformatic analyses.