Modeling Bi-modality Improves Characterization of Cell Cycle on Gene Expression in Single Cells

Advances in high-throughput, single cell gene expression are allowing interrogation of cell heterogeneity. However, there is concern that the cell cycle phase of a cell might bias characterizations of gene expression at the single-cell level. We assess the effect of cell cycle phase on gene expression in single cells by measuring 333 genes in 930 cells across three phases and three cell lines. We determine each cell's phase non-invasively without chemical arrest and use it as a covariate in tests of differential expression. We observe bi-modal gene expression, a previously-described phenomenon, wherein the expression of otherwise abundant genes is either strongly positive, or undetectable within individual cells. This bi-modality is likely both biologically and technically driven. Irrespective of its source, we show that it should be modeled to draw accurate inferences from single cell expression experiments. To this end, we propose a semi-continuous modeling framework based on the generalized linear model, and use it to characterize genes with consistent cell cycle effects across three cell lines. Our new computational framework improves the detection of previously characterized cell-cycle genes compared to approaches that do not account for the bi-modality of single-cell data. We use our semi-continuous modelling framework to estimate single cell gene co-expression networks. These networks suggest that in addition to having phase-dependent shifts in expression (when averaged over many cells), some, but not all, canonical cell cycle genes tend to be co-expressed in groups in single cells. We estimate the amount of single cell expression variability attributable to the cell cycle. We find that the cell cycle explains only 5%–17% of expression variability, suggesting that the cell cycle will not tend to be a large nuisance factor in analysis of the single cell transcriptome.


Introduction
With the advent of single cell expression profiling [1][2][3][4], the assessment of cell population heterogeneity and identification of cell subpopulations from mRNA expression is achievable [5][6][7]. However, at the single cell level, there is concern that cell cycle might interfere with the characterization of gene expression variability [8]. As many biological samples are prepared from asynchronous cell populations, where each cell is in an unknown phase of the cell cycle, it is imperative to understand the impact of cell cycle in order to account for its effect on observed expression patterns and downstream data analysis. Here, we have measured mRNA expression and cell cycle from 930 single cells derived from three cell lines in order to explore this hypothesis.
A distinctive feature of single-cell gene expression data is the bimodality of expression values. Genes can be on (and a positive expression measure is recorded) or off (and the recorded expression is zero or negligible) [9,10]. This dichotomous characteristic of the data prevents use of the typical tools of designed experiments such as linear modeling and analysis of variance (ANOVA). We develop a novel computational framework to overcome this problem. First, a probabilistic mixture model-based framework allows the separation of positive expression values from background noise using gene-specific thresholds. After signal separation by thresholding, we model separately the frequency of expression (the fraction of cells expressing a gene) and the continuous, positive expression values. Our semi-continuous framework combines evidence from the two salient parameters of single cell expression in a statistically appropriate manner, an approach dubbed the Hurdle model [11,12]. Extending our previous proposal of a twosample semi-continuous test akin to the two-sample t-test, our new framework allows for testing arbitrary contrasts and allows the use of variance components/mixed models, thus bringing to bear the full power of the general linear model.
The Hurdle model allows us to identify many genes with an archetypal cell cycle expression pattern despite a frequently bimodal distribution of expression. It also suggests that stochastic variation in single cell gene expression is relatively large compared to the effect of cell cycle. We find that even in the most tightly regulated gene, cell cycle explains only 27% of the variability, while in the median gene in our data set, cell cycle explains 5%-18% of the variability, depending on the assumptions we make regarding latent technical variability. The semi-continuous model also provides a framework for estimating co-expression networksin which edges connect genes whose partial correlations remain after removing the effect of all other geneswhile adjusting for population-level nuisance factors that could bias network inference. Applying this framework to our data, we show that only a subset of canonical cell cycle genes are highly co-expressed in single cells.

Periodic expression associated with cell cycle is observed at the singlecell level
In order to assess differential expression associated with actively cycling cells, expression of 333 genes was interrogated in 930 cells, across three cell lines: H9 (HTB-176), MDA-MB-231 (HTB- 26), and PC3 (CRL-1435) ( Figure 1A). Single cell expression was measured from flow-sorted cells and compared between cell cycle phases and cell lines via nCounter single cell profiling, a multiplexed hybridization-based detection technology that utilizes fluorescent barcodes to count individual target nucleic acid molecules [13]. This platform has been recently adapted to enable expression profiling from single cells via hybridization after a multiplexed target enrichment (MTE) in which mRNA is first converted to cDNA and then amplified [14].
Each cell was categorized as being in G0/G1, S or G2/M phase by measuring DNA content via flow-cytometry based on retention of Hoechst dye ( Figure 1B and S1) [15]. Probes were selected for cell cycle associated genes (n = 119). These genes provided coverage of the entire cell cycle (Data Set S1) based on peak expression and periodicity information obtained from Cyclebase, an integrated database of bulk cell cycle expression profiling experiments that scores and ranks genes based on strength of evidence for a cell cycle associated expression pattern [16]. Probes were also included for noncell-cycle associated genes with primary roles in the inflammatory response, and housekeeping controls without a Cyclebase ranking (n=214). We denote probes with a Cyclebase rank (i.e. genes with the strongest evidence for cell cycle associated periodic expression) as the ranked set. 253 genes were expressed and passed quality control (see Methods). Genes showed a bimodal expression pattern in log-transformed mRNA levels (Figure 2), consistent with a burstmodel of "on/off" transcription at the single cell level [17] and consistent with the kinetics of PCR amplification with low starting template concentrations, described by us and other authors [9,10].
Expression levels for each gene were most different between cell lines ( Figure 2). Many genes, including those in the ranked set showed cell line-specific expression patterns. For example, expression of TOP2A in G0/G1 varied from 70% of cells in MB-231 and PC3 to nearly universal in H9. This cell line effect was a nuisance factor we needed to adjust for in differential expression tests on cell cycle.
Nonetheless, many genes from the ranked set, such as KIF23, TOP2A, HJURP, NUSAP1, and TPX2 exhibited expression patterns consistent with cell cycle regulation ( Figure 2). Figure 2 also reveals that changes in both the positive expression mean (i.e. the mean over the cells expressing that gene; PEM), and changes in The positive mean and 95% confidence interval is depicted as a box with solid line. The Hurdle model (see Methods) combines evidence for changes in either of these parameters, after adjusting for cell line, to determine statistical significance.
the frequency of cells expressing a gene, occur throughout the cell cycle. The frequency and PEM in these genes also vary widely between cell lines, so it was important to adjust for cell line effects for accurate assessment of differential expression.
In order to test for significant differences in expression between cell cycle phases that were consistent across cell lines, we developed an ANOVA-like model (Hurdle model, see Methods) that permits adjustment for additive effects due to cell line. The Hurdle model improves the power to detect changes in singlecell expression by testing both the frequency of expression (corresponding to the relative distribution of cells between the two modes), and the PEM. Combining evidence from the discrete and continuous components of the data provides better sensitivity to changes in expression compared to test statistics based on frequencies of expression (discrete) or on the PEM (continuous) alone; or a union test (see Materials and Methods) while remaining competitive in specificity ( Figures S3, S4) Within the three cell lines tested here, significant differential expression (Bonferroniadjusted for 253 tests at P<0.05) was observed for 78 genes in the ranked set and 28 genes in the unranked set ( Figure 3A). Genes showing the strongest cell cycle associated expression patterns in bulk measurements were more likely to be identified as significant in the single-cell populations ( Figure 3A-B).
For each gene, peak time was determined based on the phase (G0/G1, S or G2/M) with maximum average expression across all cell lines. Despite large cell-line-specific expression variability, peak times were broadly consistent with Cyclebase annotations ( Figure 3C), and especially so within the subset of genes with strongest evidence of cycle regulation in our data (e.g. Bonferroni significant at P<0.05).
The majority of genes in the unranked set (115/143 or 80%) did not exhibit significant cell cycle effects, in concordance with their primary roles in functions unrelated to the cell cycle. Of the 28 unranked genes that exhibited a significant cell cycle phase association, we noted genes involved in cytoskeletal organization (PLAT), proliferation (PDGFA), and signaling pathways (IFNA1, IFNB1) that have been previously demonstrated to modulate progression through the cell cycle [18]. Experimentally observed peak times broadly match the times estimated from bulk data. Concordance in observed peak times is greater for genes with stronger evidence of differential expression. (C) Cumulative number of significant (red) or all (blue) ranked genes versus Cyclebase rank. Genes with lower Cyclebase rank, and hence stronger evidence of cycle regulation in bulk expression, are detected more often than genes with weaker evidence as shown by the minimal gap between significant (red) and all (blue) ranked gene lines at Cyclebase rank < 150.

Cell cycle explains a small portion of the gene expression variability
It has been argued that a substantial portion of the stochastic variability observed in single cell gene expression experiments may be caused by global changes in transcription due to cell cycling [19]. We explore this idea by examining the proportional change in the Hurdle model fit associated with inclusion and omission of cell cycle as an explanatory variable. Because the Hurdle model accounts for both the dichotomous (on/off) and continuous nature of single cell data, the change in deviance (generalized linear model log-likelihood) between nested models can be used to calculate the amount of variability explained by cell cycle. The total deviance can be partitioned into components corresponding to cell cycle effects, nuisance effects described below, and residual effects. The ratio of cell cycle deviance to the sum of cell cycle plus residual deviance can then be interpreted as the analog to the coefficient of determination in linear least squares.
We consider expression changes due to main effects and interactions of cell cycle by cell line and account for amplification efficiency and  (Figure 4). Within the ranked gene set, cell cycle phase explains 8% of the deviance in the median gene and 27% of the deviance in the top gene (TOP2A). In unranked genes, phase explains only 5% of the deviance in the median gene.
To derive these estimates, it is important to be able to account for the nuisance factors by using the Hurdle model. If cell-to-cell variation in amplification efficiency is not removed, we underestimate the explanatory power of cell cycle on in the median ranked gene by 26% since the unmodeled deviance would include this large additional component. Similarly, other unmeasured factors may inflate the residual deviance and attenuate the apparent role of cell cycle. These factors could include errors in inferring the cell cycle phase via FACS or imperfect modeling of changes in amplification or detection efficiency between samples. To guard against this attenuation, we set an upper bound on cell-cycle-dependent variation as follows: We suppose that transcription of the gene with the most deviance attributable to cell cycle (TOP2A, 27%) would be entirely regulated in a phase-dependent manner, and we characterize other genes' cell-cycledependent deviance relative to this maximum. For example, a gene with 13.5% cell-cycledependent deviance has half as strong a cell cycle effect as TOP2A, leading to the conclusion that at most 50% of this gene's deviance could be attributable to cell cycle. Even under these generous upper bounds, cell cycle phase explains only 18% (eg, .05/.27) and 29% (eg,.08/.27) of the deviance in the median gene in the unranked and ranked sets, respectively, suggesting that even when allowing for cell line-specific cell cycle effects, cycle is generally a small factor, compared to residual variability, in gene expression variability in the human transcriptome.

Network analysis reveals gene coexpression at the single-cell level
Single-cell gene expression data sets have the resolution to reveal not only differential expression in response to biological variables like cell cycle phase, but also to provide insight into co-expression between genes at the cellular level (e.g. the influence of one gene on another's expression or the sharing of upstream regulatory elements). In bulk-gene expression data (e.g. microarrays), apparent co-expression arises from tissue-level factors inducing shared marginal changes in genes. For example, different radiation doses in samples will induce correlation amongst all the genes affected by radiation, regardless of whether these genes interact or even participate in the same biological processes. In contrast, single cell data allow isolation of co-expression arising from cellular-level factors, giving access to more fundamental biological relationships. If two genes are correlated across cells drawn from the same environment, then the two genes are likely to share an intimate biological relationship: they may be regulated by the same transcription factor, or one gene may directly regulate the other. The distinction between cellular and marginal co-expression follows from a probabilistic identity on conditional covariances (see Materials and Methods). , and combining the top 30 edges from discrete and continuous components are shown (C,F). Panels A-C adjust for additive cell cycle effects, while panels D-F are unadjusted. The shape of the glyph corresponds to the cycle with peak expression from cyclebase, while the saturation of the glyph corresponds to the ranking. Blue and green edges are partial correlations detected from discrete expression and continuous expression, respectively. Red edges are detected in both discrete and continuous expression.
When cell cycle is not adjusted for (Figure 5 D-F), known cell cycle genes with strong evidence of marginal regulation comprise the majority of the network. These genes generally peak in phase G2/M, suggesting that the co-expression is mostly driven by the coincident peak in average expression. The networks adjusted for cell cycle at least partially remove marginal effects ( Figure 5 A-C). In some cell cycle genes, substantial evidence for co-expression remains, but now additional co-expression is detected in genes without a previously described cell cycle role. In the unadjusted estimates, marginal shifts in expression in canonical cell cycle genes overwhelm subtler co-expression in unranked genes. Even though cell cycle variability is modest compared to residual variability, cell cycle is a substantial source of biological variability in the ranked genes and is in a sense confounded with the co-expression patterns.
In an attempt to quantify the performance of the Hurdle model and the effect of cell-cycle adjustments, we examined network properties when varying the number of edges. We call an edge peaktime concordant if it connects nodes that have the same peaktime annotated in cycle base (eg, G0/G1-G0/G1 or S-S). Over a range of network densities (30-240 edges) the unadjusted Hurdle or Raw networks contain between 45%-80% peaktime concordant edges, while the adjusted Hurdle contains only 32%-38% peaktime concordant edges.
Cell cycle adjustment in networks estimated on the raw data is not very effective compared to the unadjusted, raw networks ( Figure S6). This is unsurprising, as this would occur when the model for the mean of the response is misspecified, as is true when ignoring the bimodality that the data exhibit (eg, Figures 2 and  S2). If the Hurdle model is correct and cell cycle is additive, then the identity link cannot recover this additivity. On the other hand, the Hurdle model can still recover an additive mean model under a linear link by taking the discrete coefficient estimates to be null. Overall, the adjusted and unadjusted Hurdle networks in Figure 5 are rather different, sharing 39% of nodes (Jaccard similarity) and 51% of edges (Hamming Distance / #edges).
Combining both discrete and continuous networks (with the top 30 edges from discrete and continuous networks) allows a richer set of genes to be characterized. When discrete expression is used alone, networks primarily consist of G2/M peaking genes and unranked genes ( Figure 5A). When positive, continuous expression is also used, S and G0/G1 peaking genes enter the networks ( Figure 5B-C).
The adjusted, semi-continuous network depicted in Figure 5C consists of two primary sub-networks, one consisting entirely of ranked genes, and another largely consisting of weakly ranked and unranked genes. While we cannot rule out that measurement error of the inferred cycle is not partially responsible for the persistence of a subset of ranked genes, previously described mutual regulation in RNAinterference experiments [20] of some of these genes suggests that this subset is coexpressed at the single cell level as opposed to being co-expressed on average at the population level. The sub-network of ranked genes contains the central node of NUF2, a highly-conserved protein required for stable kinetochore localization of centromereassociated protein E (CENP-E) [21]. NUF2 is connected to other actors in mitotic organization such as ANLN, KIF23, and CENPF, as well as the check-point genes CCNA2 and BUB1, reflecting the central role of these genes in mitosis.
The sub-network of primarily unranked genes contains two key nodes: TUBB and CCR3. The predominance of genes associated with cell growth, like TUBB, and transmembrane proteins, like CCR3, in the unranked cluster is likely related to the actively dividing nature of the profiled cells, i.e. dividing cells must generate new scaffolding and membranerelated materials to support growth. This relatively large sub-network of unranked and weakly ranked genes is largely missed by the unadjusted analysis that is biased by the population level cell-cycle effect.

Discussion
Stochastic, bimodal expression is a hallmark of single cell data [22][23][24]. Within a population of cells, detectable expression for any given gene typically resides in one of two modes, corresponding to an "on" or "off" state. Both technical and biological factors likely contribute to this bimodality. Quantities of some species of cDNA may be minute after reversetranscription, and in this case random variation in the number of template-primer-enzyme complexes that form during each annealing phase may dominate the kinetics of the PCR [25]. But regardless of its origin, modeling bimodality improves the power of differential expression tests.
Here, we show how the Hurdle model can be adapted to complex study designs, extending our previous results describing its use for twosample comparisons. We demonstrate the model's ability to identify many genes with a periodic expression pattern from asynchronously cultured cells utilizing a combination of FACS sorting and these new analytical techniques, including genes with little previous evidence of cell cycle associated periodic expression like MEF2D [26] and FAM189B. The Hurdle model is able to identify phase-dependent patterns of expression despite the fact that G2 and M phases are indistinguishable by DNA content. The similar rank ordering of differentially expressed genes in our single cell experiment as compared to bulk experiments and concordance in the phase of peak expression demonstrates the power of the Hurdle model. While we have applied the Hurdle model to our specific problem, the approach is general and can be applied to test any effect of interest in a single-cell gene expression dataset. We offer this modeling framework as an R package for other interested users at github.com/RGLab/SingleCellAssay.
Although we recommend the Hurdle model in general for testing for differential expression, it should be noted that its desirability is contingent on the frequency of the gene under consideration. For example, if a gene is highly expressed (eg, > 90% expression), then the information to be derived from the 10% of cells that do not express a gene may not be worth the cost of an extra degree of freedom in the chi-square null distribution of the test statistic. However, even when this is the case, the Hurdle model might be preferred for methodological simplicity, since it is poweredalthough perhaps not always optimallyregardless of expression frequency, and does not require extensive pre-test simulations of power to yield acceptable performance. The data set considered here offers a relatively stringent test of the relative sensitivity of the Hurdle model, owing to the high expression frequency of the genes in this experiment (interquartile range ranked genes: .7-.9; unranked genes: .56-.88).
Single cell data also allows unparalleled resolution of genes' co-expression patterns. While bulk expression data can reveal correlation induced by varying biological conditions, single-cell data has the possibility to reveal co-expression driven by shared regulatory elements within the cell. However, when inferring gene expression networks, it is important to adjust for population level covariates that could confound the network estimation, especially for genes that are marginally affected by such a population level covariate (like known cell cycle genes in our experiment.) By measuring a limited set of cell cycle associated genes, we are able to identify a network of co-expressed genes with known roles in cell cycle regulation even after adjusting for cell cycle phase. It should be noted that the unadjusted network estimate would be appropriate in some circumstances, for example when a summary of the co-expression occurring on average in the population of cells is desired, as opposed to inference of coexpression occurring conditionally within defined subsets.
Work remains to derive network estimators that optimally combine information from discrete and continuous portions. Our current approach is likely theoretically naïve, since it is essentially a union test of the discrete and continuous portions, rather than a summation of signal from the two domains. We also have left unresolved the asymptotic consistency of our proposed network procedure under dimensional scaling.
It is crucial to understand the relationship between cell cycle and the stochastic nature of single cell expression as it determines the magnitude of the cell cycle's distorting effect on single cell analyses. In contrast to earlier estimates of Zopf et al. [19] we find little evidence of periodic regulation of expression among non-cell cycle associated genes. Our results are consistent with genome-wide mRNA profiling efforts utilizing bulk expression methodologies in mammalian cells where genes with cycle-dependent periodic expression patterns are limited and wellcharacterized [16,27,28]. Disparity between our findings and those of Zopf et al. may arise from differences between yeast and mammalian cells. Moreover, Zopf et al. primarily focus on a single, synthetic promoter while we sample hundreds of transcripts presumably driven by many different promoters. Whether the substantial remaining variability is inherent to the human single cell, or due to thus far latent, unmeasured biological variables remains to be explored.

Cell Lines and Flow Cytometry
Three human cell lines H9 (HTB-176), MDA-MB-231 (HTB-26) and PC3 (CRL-1435) were commercially obtained and cultured as recommended by the supplier (ATCC). Cultured cells were re-suspended in culture media containing Hoescht 33342 (Sigma) and incubated at 37°C for 60 minutes prior to sorting.
Cultured cells were flow-sorted to isolate individual cells from each of the cell lines according to phase (G0/G1, M/G2 and S). Cells were isolated and sorted using the FACSJazz (Becton Dickinson) at 500 events per second using a 100 micron nozzle. Single cells were defined by gating on forward and side scatter area/width. Phase was inferred from Hoescht 3342 DNA-fluorescent dye, then cells were individually deposited and lysed in wells of a 96-well PCR plate containing 3uL of Cells-to-Ct lysis buffer (Life Technologies). The proportion of cells in G0/G1 phases varied from 54% of PC-3 cells to 73% of H9 cells (Supplementary Figure S1).

Genes Assayed
A set of 333 probes was designed. It contained cell cycle associated genes and provided coverage of the entire cell cycle based on peak expression and periodicity information derived from an integrated database of cell cycle expression profiling experiments [16]. Non-cell cycle associated genes had primary roles in the inflammatory response and included housekeeping controls without a Cyclebase ranking. Genes with a Cyclebase ranking < 1000 were placed in the ranked set (n = 119) and all other probes were considered part of the unranked set (n=214).

cDNA Conversion and Multiplexed Target Enrichment (MTE)
After lysis, RNA was converted to cDNA with SuperScript VILO (Life Technologies). Primers for 333 genes were pooled and cDNA was enriched in a multiplexed amplification (MTE) reaction according to the nCounter Single Cell Expression protocol (NanoString). The MTE samples were hybridized overnight at 65°C with an nCounter CodeSet containing probes for all enriched targets (cell cycle related, unrelated genes and controls) and internal controls as recommended by the manufacturer.

Statistical Analysis Dichotomization and Thresholding
The nCounter Analysis System reports the number of counts of each observed nucleic acid target. We transformed the counts with a shifted log-2 transformation so that   2 log count 1 lCount  . In examining histograms of the transformed data, lCount, we found evidence of bi-modality (e.g. Figures 1C,  2). It has been previously observed [2,9,10] in single cell gene expression that genes may appear "off" in a cell, lacking detectable transcript. Thus we hypothesize that in genes with two clusters apparent, the cluster of smaller lCount might represent background noise without detectable expression, and the cluster of larger lCount might correspond to bona fide signal. The distribution of lCount in positive controls, which were added at known concentrations, and negative control probes not occurring in human cDNA, additionally supported this hypothesis ( Figure S2).
We used an empirical Bayes, model-based clustering procedure to discriminate between signal and noise clusters. Via maximum likelihood estimation, we fitted a Gaussian mixture model to an omnibus of expression in all genes to insure that both signal and noise clusters were initially present. The parameter estimates from the omnibus were then used to form an empirical Bayes estimate of a prior distribution for Bayesian Gaussian mixture models fit to each gene separately. The function thresholdNanoString available in SingleCellAssay implements our thresholding framework, while mixture models are estimated with the flowClust R package [29].  evaluated at x. Observations with posterior probability >.5 of belonging to the noise cluster were truncated (set to zero) while observations that more likely belonged to the signal cluster were left unchanged.
We denoted the truncated, log-transformed value as the Expression Threshold (et) in which the value zero denotes no detectable expression, while positive values correspond to increasing values of log-expression. We model the zero value specially and separately from positive values.

Normalization
The log-count measurements were normalized to ensure that the mean signal was comparable across plates. This was done in a stepwise manner. First data were split into three experimental batches, corresponding to cells that were run on different dates, and preliminary signal and noise clusters for each gene were estimated using the above thresholding technique. Then letting lCount kip denote the clustered, but un-thresholded log Count for cell k, cluster  [30]. The normalized log counts were translated so that the minimum normalized value in each gene was zero. Since no gene was expressed 100% of the time under the preliminary clustering, this was a welldefined procedure. Finally, the normalized data were thresholded jointly to produce the data set used for filtering and testing.

Filtering
We adopted a previously published filtering approach [9] based on a robust z-scoring, removing wells with no expression or otherwise outlying in the number of transcripts expressed. We applied the SingleCellAssay function filter with parameters nOutlier=2, SigmaProportion=2, SigmaContinuous=5.
We removed non-variable genes (i.e. detectable expression < 1% in any cell line). 253 of 333 probes passed these filtering criteria and were carried forward in the analysis. After thresholding and filtering we found that the frequency of expression (the rate at which 0 et  in a gene) varied considerably between genes, with a range of .08-.99 and median value of .72.

Amplification Efficiency
We found in examining principle component plots that the first axis of variation corresponded to the number of genes expressed in a well. In cell k and gene g, let et kg be the thresholded 2 log count, and lCount kg be the un-thresholded 2 log count. Then we defined First, higher-order axes of variation corresponded to identifiable biological factors (e.g., phase, cell line). Orthogonality of k x to the biological axes of variation suggested that this factor was technical in nature [31]. Secondly, many steps in deriving cDNA from live cells could induce technical, cell-specific variation. These steps include incomplete lysis, variation in reverse transcription to generate cDNA and efficiency differences in the multiplexed, amplicon-specific pre-amplification step. Cell-tocell variability in any of these could appear downstream as a source of variability [32]. Lastly, we have observed similar phenomena in other cDNA-based single cell gene expression experiments, including multiplexed qPCR and single-cell RNA-seq. As k x contributed variability to our data and appeared to derive from technical rather than biological sources, we chose to adjust for it as a nuisance source of variability.
In fact,

Hurdle models for zero-inflated expression
In single cell gene expression, we have previously found that accounting for both changes in the frequency of expression and shifts in the PEM produces more sensitive measures of differential expression compared to using either the frequency or the positive values alone, or compared to t-tests on the zero-inflated values [9,33]. We sought to extend this framework to any model that permits a likelihood ratio test on parameters, e.g., generalized linear or generalized linear mixed models, in order to account for additive cell line effects. Let  inference about   proceeds conditional on 0 et  . The log likelihood is then additive in the  and   parameters. Classical hypothesis tests with chi-square asymptotic null distribution, such as Wald or likelihood ratio tests on specific components of  and   are null can be conducted separately. Then the test statistics are added together, combining and summarizing the evidence from the two processes, with the degrees of freedom in the null distribution doubled for the purpose of assigning significance. This approach is dubbed the "Hurdle" model and has been used in economics for several decades [34,35]. chi-square distribution with four degrees-offreedom, as there are three cycles, but with a linear constraint, hence two degrees in each of the discrete and continuous et components.

Union-Intersection Test of Cell Cycle Expression Regulation
Equivalently, where l and ¢ l are the discrete and continuous LRT test with rejection regions respectively. In other words, this means to reject the intersection hypothesis at level  if either discrete or continuous test reject at level

Proportion of Deviance Explained by Cell Cycle
In order to calculate the proportion of deviance explained by cell cycle, we compare our Hurdle model given by (1)

Network Estimation
We extend the conditional, neighborhood-based algorithm of Meinshausen-Bulmann [36] to estimate co-expression networks using the Hurdle model. The standard Meinshausen-Bulmann algorithm uses L1-penalized regressions to estimate partial correlations between vertices (genes) by treating each vertex as a dependent variable in a regression that includes all other vertices as independent variables. If the vertices are jointly Gaussian, non-zero coefficients correspond to statistical dependences between vertices, conditional on all other factors and so reflect a Gauss-Markov Random Field. Here, since the distribution of expression in single cells is not multivariate Gaussian, edges in our network correspond to conditional correlations (after possible application of the logit link). Although we do not attempt to show consistency of our proposed approach here, we note that Meinshausen-Bulmann-like methods have been shown to be consistent in estimating non-Gaussian graphical models under fairly general conditions [37,38].
Then for the k th cell, following equation (1) where kg et is the expression of the gth gene in the kth cell, and   separately, with distinct L1 penalties  and  for  and   using the R package glmnet [39].
Unpenalized vector parameters  and  adjust for pre-amplification effect k x ; cell line and cell cycle.

Combining Networks
We connect genes ,,, g g g g g g g g

 
is non-zero at their respective penalties thus take the union of the symmetrized sub-networks. To select the penalty parameters , we fix a number of edges e , then find  and  (constant across genes) so that /2 e edges enter from each of  and   . Other ratios of edges are easily attained by choosing  and  appropriately.

Cellular and Marginal Co-Expression
Even when expression is measured in single cells, co-expression estimates may reflect cluster-specific shifts in mean expression rather than cellular co-expression. Let 1 2 , G G be two genes, and let Z be a clustering factor that affects expression of at least one of 1 2 , G G .
Then an elementary probability calculation shows that    In panel A the number of discoveries in ranked genes is plotted against the number of discoveries in unranked genes as the level of the test varies. A discovery in a ranked gene, as it has been previously found to be cell-cycle regulated, is more biologically plausible than a discovery in an unranked gene, so the number discovered at a given level is plausibly related to the sensitivity of a test. Likewise, the number of discoveries in unranked genes may be plausibly related to the specificity of the test. In panels B and C the absolute number of discoveries in ranked and unranked gene sets are plotted for various P-values. In both panels, the binomial model uses logistic regression on dichotomized expression values, while the continuous model uses only values with positive expression. All models adjust for cell line and pre-amplification efficiency. The Hurdle, Union and continuous tests are largely equivalent when judged by their area under the curve of the panel A; however the Hurdle is more sensitive than the continuous or union when judged by absolute number of discoveries in panel B.  The hurdle adjusted for cell cycle selects half as many edges with the same peaktime compared to the adjusted or unadjusted raw models, while the unadjusted hurdle selects modestly more peaktime concordant edges than the raw models, especially for richer (>100 edges) networks. A similar phenomenon occurs when examining the distribution of nodes. The unadjusted hurdle tends to select more nodes with previous description of marginal expression regulation. After adjustment, it selects the fewest nodes out of the four models.