Limited Agreement of Independent RNAi Screens for Virus-Required Host Genes Owes More to False-Negative than False-Positive Factors

Systematic, genome-wide RNA interference (RNAi) analysis is a powerful approach to identify gene functions that support or modulate selected biological processes. An emerging challenge shared with some other genome-wide approaches is that independent RNAi studies often show limited agreement in their lists of implicated genes. To better understand this, we analyzed four genome-wide RNAi studies that identified host genes involved in influenza virus replication. These studies collectively identified and validated the roles of 614 cell genes, but pair-wise overlap among the four gene lists was only 3% to 15% (average 6.7%). However, a number of functional categories were overrepresented in multiple studies. The pair-wise overlap of these enriched-category lists was high, ∼19%, implying more agreement among studies than apparent at the gene level. Probing this further, we found that the gene lists implicated by independent studies were highly connected in interacting networks by independent functional measures such as protein-protein interactions, at rates significantly higher than predicted by chance. We also developed a general, model-based approach to gauge the effects of false-positive and false-negative factors and to estimate, from a limited number of studies, the total number of genes involved in a process. For influenza virus replication, this novel statistical approach estimates the total number of cell genes involved to be ∼2,800. This and multiple other aspects of our experimental and computational results imply that, when following good quality control practices, the low overlap between studies is primarily due to false negatives rather than false-positive gene identifications. These results and methods have implications for and applications to multiple forms of genome-wide analysis.


Outline
Regarding the statistical model introduced in the main paper, this document provides additional details on the probability of observed patterns of detection and confirmation over four studies. It also describes our computational approach to likelihood-based inference via numerical optimization and Markov chain Monte Carlo (MCMC).
The assessment of agreements and disagreements among studies has long been a focus of modelbased statistical analysis, from seminal work by R.A. Fisher and colleagues on species abundance estimation in ecology (Fisher et al. 1943) to more recent and relevant precursors to our own calculations, including Raftery (1988), Craig et al. (1997), and Basu and Ebrahimi (2001). The rationale for this general approach is that the specific findings of any study are affected by numerous factors, some of which are systematic and shared in some predictable way among studies, and some of which are idiosyncratic. To capture the systematic effects we treat them as parameters in a stochastic process presumed to have generated the observed data, and we infer the parameter values by calculating the probability of observed data (the likelihood). The design and structure of RNAi experiments forces us to go beyond previously described probability models and propose a specification for multi-study, two-stage (detection/confirmation), genome-wide RNAi data.

Model specification and pattern probabilities
In their essential form, data consist of binary indicators D g,s = 1 [gene g is detected by the primary screen of study s ] , C g,s = 1 [gene g is confirmed by the secondary screen of study s ] for genes g = 1, 2, ..., G and studies s = 1, 2, 3, 4. We consider the genome to be the union of genes that are covered by the siRNA libraries used in the four studies, and assume the genome size G = 22000. We observe detection indicators {D g,s } for all g and s. We observe confirmation indicators {C g,s } only for cases g and s where D g,s = 1. That is, the primary screen of each study is viewed as scanning the full genome; the secondary screen aims to confirm those primary findings. It is technically convenient to allow C g,s to be defined even when D g,s = 0, though by the study designs such C g,s is unobserved and does not enter our computations. The four studies differ in details of their secondary screens. To simplify our analysis we model studies similarly, in terms of results C g,s,k from further assays k = 1, 2, 3, 4, wherein the kth assay entails the application of the kth siRNA from the pool of (typically) 4 siRNAs that targets gene g. Then we have confirmation C g,s = 1 if (and only if) at least two of these assays indicates a phenotype; i.e. if k C g,s,k ≥ 2. Several studies (e.g., U2OS) have this precise structure, although we have not used any assay-level data C g,k,s in subsequent computations (these data are not complete; we use the summary calls C g,s ). Not all studies have this structure (e.g., DL-1); the key consequence of modeling this way is that the secondary validation is more stringent for filtering non-involved genes.
There are three observable states of (D g,s , C g,s ) for a given gene g in a given study s: The proposed model entails gene-specific latent random effects, and thus marginally the (D g,s , C g,s ) is not independent from (D g,s , C g,s ) for any two studies s and s . Our model does entail independence among genes, and therefore the likelihood (probability of observed data) can be expressed as the probability of the multinomial count vector {N π } over the 3 4 = 81 possible multi-study observation states, or patterns, {π} (see Table 3), where (On the independence among genes assumption, this is conditional on involvement (see below) and expresses the fact that separate cells and assays are used for different genes within a given study.) In these terms, the log-likelihood is where pattern probabilities {P π } are defined by a smaller number of parameters through a stochastic model of genome-wide RNAi. The model is hierarchical, and is specified using latent random effects: T g = number of involved off-targets for gene g, relative to a pool of siRNAs that might be used to target gene g T g,s = size of the accessible subset of T g in study s V g,s,k = size of the accessible subset of T g,s in assay k of secondary screen in study s .
One could alternatively classify the {I g } as a high-dimensional parameter, but in doing Bayesian inference we would immediately cover it with a prior, so we treat it as a vector of latent factors in the notation.
A number system-level parameters are used to specify the probability structure of latent variables and observed data; they describe the basic system in terms of rates governing the latent variables as well as quantities affecting false-positive and false-negative detections and confirmations: The log-likelihood L in (1) is a function of these 12 parameters, which we collect in a vector ψ = (θ, α, β, γ, ω, ν), where β = {β s } 4 s=1 , γ = {γ s } 4 s=1 . Thus L = L(ψ). The stochastic model itself is: . of a gamma distribution with shape parameter 4 and scale parameter 1. Another system-level parameter we fix a priori and do not estimate from the data is K = number of siRNAs that target a gene, in total over studies.
We're modeling a typical gene targeted by 4 siRNA's in the detection screen of each study. If all studies happen to use the same siRNA library, then there would be only K = 4 siRNA's used alltogether for this gene across the studies. On the other hand, increasing K corresponds to a wider diversity of siRNA's targeting a given gene and ultimately to different siRNA's per study.
K controls the correlation in numbers of off-targets between studies, i.e. cor(T g,s , T g,s ) → 0, as K → ∞. In our case, we fix K = 12 since the 4 studies of interest use 3 different libraries.
Diagnostic computations showed little sensitivity to this setting. A full specification of conditional independence assumptions is in Figure 2. The various modeling elements have been introduced to address known features of genome-wide siRNA screening. For example, every additional siRNA applied to an involved gene increases the chance that the harboring cells exhibit a phenotype. The higher the rate of involved genes, the higher the rate of an off-target phenotype. There is heterogeneity among genes, owing to whether or not they are involved, and owing to varying amounts of off-targets associated with their targeting pools of siR-NAs, but there is among-gene independence in terms of siRNA detection/confirmation. The studies are heterogeneous, because they may entail different sets of accessible genes and these accessibility rates (γ s ) are study specific, and also there may be different false negative measurement errors (β s ) involved in each study due to individual experimental environment. (We had considered a single parameter γ and β, but saw substantial improvements when we allow the extra flexibility.) From study to study the data are not independent, owing to gene specific factors I g and T g , which get marginalized in our likelihood computation.
To further explain the model structure, the curious G 4,1 term and related terms enter because of our knockdown model. We suppose that each targeting or off-targeting event is associated with a uniformly distributed variable on (0, 1). For on-targets, we suppose that four hits reduce the expression of the target such that the amount left over equals the product of the four uniforms, and if this amount is less than ω, we would see an effect, in the absence of measurement error. This happens with probability G 4,1 (− log ω). We further assume that off-target events hit separate genes, and act in a parallel fashion, so that a phenotype effect occurs if any of the mRNA levels is reduced below ω; this happens with probability 1−(1−ω) t when there are t off-targeting events. By modeling this way we allow that multiple on-target hits are more effective than the same number of dispersed off-target hits ( Figure S1).
The 81 multi-study pattern probabilities {P π } (and thus the log-likelihood L(ψ)) in (1) are obtained as a function of the 12 system-level parameters ψ by summing out the discrete-valued latent variables. Considering among-gene independence, we focus on a single gene, and sum out values of the involvement indicator I g , the four accessibility indicators A g,s , and the off-target counts T g , the four T g,s , and the {V g,s,k } 4 k=1 for each study. (We model T g,s 's as subsets of a common T g to reflect the possibility that different studies share siRNAs.) All but the target counts are binary sums; more complicated is the elimination of the off-target counts. To investigate this calculation, write the vector a = {a s } and the conditional probability of data pattern π as, Each multi-study pattern probability P π is computed as a summation of these P π (i, a) over the 2 5 values of its arguments. The trickier computation is the evaluation of each P π (i, a), which requires marginalization of the off-target counts.
To marginalize the off-target counts, first recognize that each pattern π is an intersection of four study-specific patterns π = s π s . For example π = 3111 indicates that the gene is confirmed and detected in the first study and neither detected nor confirmed in any of the remaining three studies. The modeling assumptions give Pois(t) where Pois(t) = P (T g = t) = exp{−Kθν}(Kθν) t /t! by the Poisson assumption, B s (t, u) is the Binomial mass function at u with t trials and success probability 4γs K , and where each contribution Q s,i,as,u = P (π s |I g = i, A g,s = a s , T g,s = u) is computed from the stochastic model (2). Coming back to pattern π = 3111 for example, the four sub-pattern probabilities are: Q 1,i,a 1 ,u = P (D g,1 = 1|I g = i, A g,1 = a 1 , T g,1 = u) P (C g,1 = 1|I g = i, A g,1 = a 1 , T g,1 = u) Q 2,i,a 2 ,u = P (D g,2 = 0|I g = i, A g,2 = a 2 , T g,2 = u) P (C g,2 = 0|I g = i, A g,2 = a 2 , T g,2 = u) Q 3,i,a 3 ,u = P (D g,3 = 0|I g = i, A g,3 = a 3 , T g,3 = u) P (C g,3 = 0|I g = i, A g,3 = a 3 , T g,3 = u) Q 4,i,a 4 ,u = P (D g,4 = 0|I g = i, A g,4 = a 4 , T g,4 = u) P (C g,4 = 0|I g = i, A g,4 = a 4 , T g,4 = u) .
where P (C g,s = 1|I g = i, A g,s = a s , T g,s = u) = P 4 k=1 C g,s,k ≥ 2 I g = i, A g,s = a s , T g,s = u We make the simplifying approximation that C g,s,k are conditionally independent (and thus C g,s is governed by Binomial masses), though in fact they have some negative dependence attributable to the divvying up the off-target count T g,s among the four separate assays.
A key to simplifying the computation further is to recognize that with respect to the count variable u, each P (D g,s |I g , A g,s , T g,s ) is a polynomial of ξ 1 = 1 − ω, and each P (C g,s |I g , A g,s , T g,s ) is a polynomial of ξ 2 = 1 − ω 4 . Thus, each Q s,i,as,u is a bivariate polynomial in ξ 1 and ξ 2 , of degree at most u and 4u respectively. By careful book-keeping, we identify coefficients {b s,p,q } (depending on system parameters ψ and the pattern π) such that Thus the inner factor of (3) with the second-last line obtained from the moment generating function of a Binomial variable, and with e s,p,q = 1 − 4γs Incorporating this back into (3), we obtain for the conditional probability of a pattern given accessibility and involvement: Pois(t) s where the last line comes from the moment generating function of a Poisson distribution. Finally, the pattern probability P π is obtained by summing over the 2 5 states of i and a, as indicated previously. This provides a route to computing all 81 multi-study pattern probabilities required for likelihood evaluation.

Likelihood evaluation and maximization
Based on formulas for pattern probabilities {P π } the log-likelihood (1) was available numerically. We used some convenient facilities in the R system, including the polynom package, to organize the rather complex sums (R Core Development Team, 2011, version 2.13.1; Venables et al., 2009). To maximize the log-likelihood, we used the R function nlminb. We wrote a project specific R package, metaflu, to contain the data and main likelihood-based computations.

Posterior computation
The Metropolis-Hastings method was used to construct a Markov chain to simulate the joint posterior density Thus, the posterior is proportional to the likelihood times a flat prior. Importantly, we did not run MCMC over the high-dimensional space including latent variables, because we were able to solve these analytically. Our sampler produced a sequence ψ 1 , ψ 2 , . . . , ψ B of parameter vectors according to standard Metropolis-Hastings updating rules (e.g., Robert and Cassella, 1999, page 231.) We systematically scanned the 12 parameter values, and used a base set of local move types that modified one parameter at a time. We call this sampler local. The base proposal distribution for ν was exponential, with mean at the fixed value 1/50. All other parameters resided in (0, 1), and for each we used a uniform window proposal; window length 0.025 gave acceptance rates in the range 28% to 70%. Concerned about possible poor mixing caused by posterior correlation between θ and ν, we included a joint update involving (θ, ν) → (θc, ν/c) for a Gamma-distributed multiplier c (shape, rate both 50, so mean 1).
Extensive numerical experiments indicated that from random starting points local would converge to one of two distinct modes. One of these modes was degenerate from the biological and technical perspective, corresponding to implausible parameter settings: involvement θ and error rates α and β s all very close to unity. To address the lack of mixing between the degenerate and main modes, we modified the MCMC by adding to the local move types a global move type using an 'indpendence' sampler, wherein the proposed state does not depend on the current state. Specifically the proposal was an equal mixture of two 12-dimensional Gaussian distributions, with means and covariance matrices designed to match the structure of the evident posterior modes. Repeated runs of this enhanced sampler (called global) from random starts (uniform for rate parameters, standard exponential for ν) showed very good mixing properties (Results in Supplementary Text S2).
To address the implausibility of the degenerate posterior mode (in which θ and the error rates α and β s were all near unity), we post-processed the MCMC output through a prior constraint: namely, we retained sampled parameters only if all θ, α, and all β s were less than 0.8. Thus, posterior inference was based upon a flat prior, but subject to a plausibility constraint. We used marginal posterior means to estimate the parameters and the equi-tail percentile method to obtain confidence (equivalently credible) intervals.

Posterior distribution of N
The total number of influenza-involved genes is N = I g . By our approach we have marginalized the involvement indicators, and so the posterior of N needs to be obtained through further postprocessing of the MCMC output. We estimate N by G ·θ whereθ is the mean of posterior distribution of θ from MCMC. The posterior distribution of N is approximated as follows. For n = 0, 1, . . . , G, where {ψ b } constitute the MCMC output. A priori, N given ψ is distributed Binomial(G, θ), but in conditioning on the data we have a different distribution for N , even with ψ in hand. Being a sum of independent but differently-distributed Bernoulli trials, N has a Poisson Binomial distribution Thomas and Taub, 1982). For example, the one gene that is confirmed by all 4 studies is more likely to be truly involved than a gene confirmed just once. Thomas and Taub's recursion method is applied to evaluate the probability mass of N at each sampled parameter setting ψ b .

Error rate inference
Depending on the reference set of genes, there are different ways to measure false positive and false negative error rate. In any case, we are thinking of errors in a single study, s, and define four rates FDR(ψ) = P ( I g = 0| D g,s = C g,s = 1) FNDR(ψ) = P ( I g = 1| D g,s × C g,s = 0) FP(ψ) = P ( D g,s = C g,s = 1| I g = 0) FN(ψ) = P ( D g,s × C g,s = 0| I g = 1) .
Point estimates of error rates were obtained by plugging in an estimateψ of system parameters, using the DL-1 study as a reference. Bayesian confidence sets were obtained by percentiling error rate values computed across MCMC samples {ψ b }.

Code
The metaflu package and associated R source code is available at http://www.stat.wisc.edu/~newton/metaflu. Figure S1: Knockdown model properties. Four on target hits are more effective at producing an observed effect than four off-target hits. Shown is the case of no measurement error, as a function of the threshold for an effect.