Skip to main content
Advertisement
  • Loading metrics

Double-CRISPR Knockout Simulation (DKOsim): A Monte-Carlo randomization system to model cell growth behavior and infer the optimal library design for growth-based double knockout screens

  • Yue Gu,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America, Department of Biostatistics and Data Science, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, United States of America

  • Traver Hart,

    Roles Investigation, Resources, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation Department of Systems Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America

  • Luis Leon-Novelo ,

    Roles Data curation, Investigation, Methodology, Supervision, Validation, Writing – review & editing

    Jshen8@mdanderson.org (JPS); Luis.G.LeonNovelo@uth.tmc.edu (LLN)

    Affiliation Department of Biostatistics and Data Science, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, United States of America

  • John Paul Shen

    Roles Conceptualization, Funding acquisition, Investigation, Resources, Supervision, Validation, Visualization, Writing – review & editing

    Jshen8@mdanderson.org (JPS); Luis.G.LeonNovelo@uth.tmc.edu (LLN)

    Affiliation Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America

Abstract

Advances in functional genomic technology, notably CRISPR using Cas9 or Cas12, now allow for large-scale double perturbation screens in which pairs of genes are inactivated, allowing for the experimental detection of genetic interactions (GIs). However, as it is not possible to validate GIs in high-throughput, there is no gold standard dataset where true interactions are known. Hence, we constructed a Double-CRISPR Knockout Simulation (DKOsim), which allows users to reproducibly generate synthetic simulation data where the single gene fitness effect of each gene and the interaction of each gene pair can be specified by the investigator. We adapted Monte-Carlo randomization methods to extend single knockout simulation methods to double knockout designs, which simulate the gene-gene interactions between all possible combinations of the input genes. Using DKOsim, we generated simulated datasets that closely resemble real double knockout CRISPR datasets in terms of Log Fold Change (LFC), GI distribution, and replicate correlation. We further inferred optimal CRISPR library designs by systematically investigating critical experimental parameters including depth of coverage, guide efficiency, and the variance of initial guide distribution. This simulation scheme will help to identify optimal computational methods for GI detection and aid in the design of future dual knockout CRISPR screens.

Author summary

We designed DKOsim to simulate CRISPR double knockout screens by modeling cell division behavior with both single knockout (SKO) and double knockout (DKO) constructs via Monte-Carlo randomization samplers. Running DKOsim at large scale, we identified the asymptotic tuning points that optimize genetic interaction (GI) identification performance by delta-LFC (dLFC) method compared to the simulated truth. We show that DKOsim is tunable to approximate actual dual-CRISPR knockout screening data. Comparing replicate correlation from DKOsim with experimentally generated data, DKOsim can be tuned based on users’ desires to reproduce a similar level of randomness to that observed in variety CRISPR screening conditions.

Introduction

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) was first identified in E. coli bacteria in Japan [1,2]. CRISPR knockout can be multiplexed into a high-throughput genetic screening method to systematically perturb genes and/or pairs of genes [3]. In 2017, combinatorial CRISPR-Cas9 screens were performed in cancer cells for the first time to allow for high-throughput identification of synthetic lethal gene pairs [4,5]. More recently, the Cas12a platform provides a highly efficient multiplex gene knockout, significantly increasing efficacy of gene knockout and decreasing library size and thus the cost of genetic interaction screening [6,7]. Combinatorial CRISPR technology has revolutionized the discovery of gene-gene interactions [defined as genetic interactions (GIs)] by allowing large-scale screening in human cell lines, organoids, and mouse models. GIs occur when the fitness effect of a gene is modified by the functional status of other genes. It is measured by comparing fitness following the CRISPR knockout of a gene pair (double knockout, DKO) vs. the knockout of each gene (single knockout, SKO). The study of GIs reveals the functional relationships between genes and pathways, specifically, the synthetic interactions and compensatory pathways. The findings serve as the foundation for systematic gene network construction, which is valuable for novel drug development in cancer research [8].

With around 200 million possible interacting gene pairs for a mammalian cell, GIs are typically rare and hard to quantify accurately from noisy data [8,9]. Hence, there is limited consensus on the use of existing computational tools for detecting GIs. Moreover, experimental validation of GIs in high-throughput is difficult since there are no gold standard datasets that include the true interactions. To address these challenges, we developed a probabilistic simulation framework with simulated theoretical GI truth to emulate the real laboratory CRISPR screening procedures for both SKO and DKO designs. This approach allows us to efficiently test several interacting gene pairs for GIs and approximate the underlying true GI distributions.

Prior efforts in generating simulation frameworks for CRISPR screening mainly used SKO designs. In 2015, Stombaugh et al. proposed the Power Decoder Simulator [10] to generate in silico shRNA pooled screening experiments using short hairpin RNA (shRNA) to efficiently estimate the genotypic biological relevance of a set of genes based on their experimental phenotype. Building upon this, Nagy et al. (2017) developed the first CRISPR SKO discrete simulation tool, CRISPulator [11], which simulates both the Growth-based and the Fluorescence-Activated Cell Sorting (FACS)-based SKO screening to test the effects of different library designs. de Boer et al. (2020) designed MAUDE [12] (Mean Alterations Using Discrete Expression) utilizing CRISPulator to show the consistency in the optimal cell-sorting bins configurations by quantiles via simulation, and derive the mean expression of cells containing each guide. These studies connected the designed simulation framework with empirical assumptions from real experiments but lacked systematic profiling of the screening parameters for growth-based pooled CRISPR screens that assume exponential cell growth. Moreover, since they did not consider the combinatorial CRISPR screening (i.e., DKO) design, the effects of GIs are not articulated in their simulated CRISPR design.

To address the need for systemizing a CRISPR simulation scheme and the lack of a DKO simulation framework, we developed a Double-CRISPR Knockout Simulation (DKOsim) that will enable researchers to determine the DKO behavior, including the simulated theoretical true interactions. DKOsim will help identify the optimal GI detection methods and design future double-guided CRISPR experiments. The motivations of the study are visualized in Fig 1. We started by analyzing sets of CRISPR DKO screening data using two different computational approaches named CTG [13] and GEMINI [14]. While the SKO gene fitness scores showed high correlation, the GI scores were essentially random, with almost no overlap in the identified GIs between the two methods in HeLa cell line. Without the GI ground truth, we could not determine which method was detecting the truly interacting gene pairs. Motivated by this, we designed a systematic synthetic data simulation scheme to simulate theoretical GIs that could be treated as the underlying truth.

thumbnail
Fig 1. Motivations of the study.

Two genetic interaction (GI) identification methods named CTG and GEMINI are applied to HeLa cell line in Shen et al. 2017 Double-CRISPR Knockout (DKO) datasets. Computational results are visualized in both scatterplots of CTG vs. GEMINI scores, and Venn diagram on the identified GI by CTG and GEMINI. Created in BioRender. Gu, Y. (2026) https://BioRender.com/elgwcy7.

https://doi.org/10.1371/journal.pcbi.1013510.g001

Methods

Overview and notations

We simulated the CRISPR knockout screens by modeling cell division, using Monte-Carlo randomization sampling. The common notations and conceptual methodology of the simulation scheme are summarized in Fig 2. We indexed genes with and used the notation to refer to a cell with only gene to be knocked out. We used the notation to refer to cells whose genes and were targeted for knock out. Since the order of knocking out the genes does not matter, we used and , with , to index the dual-targeted genes in a DKO cell.

thumbnail
Fig 2. Simulation methodology.

Main modules are conceptually visualized in the simulation schematic design, including KO target gene class initialization, cell division and growth modelling, and cell population transfection and selection simulation. Created in BioRender. Gu, Y. (2026) https://BioRender.com/tl5sg1s.

https://doi.org/10.1371/journal.pcbi.1013510.g002

We extended the notation to consider the guides targeting specific genes to be knocked out. We referred to the definition of guide in CRISPR screening as a synthetic RNA molecule named guide-RNA (gRNA) that directs CRISPR-associated nuclease, such as Cas9 or Cas12, to a target DNA sequence [15]; in SKO cells, we assumed a single-guide (sgRNA) disrupts one gene, whereas in DKO screens, two distinct guides were combined as one dual-guide (dgRNA) to perturb two genes simultaneously. We additionally defined the construct as either gene and guide or the combination of two genes and their corresponding guides. In this way, the guide targeting gene is indexed with and the notation refer to cells on construct with only gene targeted with guide . Note that in and , though the guides share the same index , are different. Similarly, and refer to guides targeting genes and simultaneously. denote cells on constructs whose genes and are targeted to be knocked out with guides and , correspondingly.

We initialized the knockout (KO) target gene into one of the four main classes: Negative, Wild-Type (WT), Non-targeting Control, or Positive. The theoretical phenotype of each class of genes was drawn from pre-specified distributions, where the mean and variance of each gene class can be tuned by users as desired. Treating the initialized genes as inputs, we derived the cell division and growth behavior in both and designs from Multi-bernoulli (Multinoulli) distributions that model the exponential cell growth with KO target genes. To simulate cell population transfection and selection procedure, we additionally incorporated the guide-efficacy effects, and defined the initial cell population at t0 as the population that contains all pre-specified constructs and with set of counts C1. After several cell doublings controlled by the users’ desires on bottlenecks, the final cell population at t2 contained and with set of counts C2. We calculated the Log2 Fold Change (LFC) at t2 vs. t0 to quantify the change in the relative abundance of the constructs over time. We also calculated the simulated true GI π by measuring the difference between the expected counts given DKO cell division probability vectors, with or without interactions.

We summarized the analytical framework of the study in Fig 3, based on users’ inputs. DKOsim mimics the real CRISPR screening for both SKO and DKO designs that output the simulated growth-based screening data and calculates the LFC for all constructs with simulated theoretical GI truth. A standard analytical workflow for the simulated datasets includes SKO genes and LFC distributions visualizations, DKO gene combinations deconvolution and dLFC application. Here, we referred the term gene combination deconvolution to the signal-stratification of aggregate LFC measurements into distributions corresponding to distinct gene-class pairings, thereby isolating overlapping phenotypic signals [16].

thumbnail
Fig 3. Study design.

Analytical frameworks of the study. Created in BioRender. Gu, Y. (2026) https://BioRender.com/fek72q6.

https://doi.org/10.1371/journal.pcbi.1013510.g003

Cell-behaviors multinoulli distribution derivations

Growth behavior derivations of SKO cells.

For one unit of the cellular population doubling cycle, for a single cell, we define

Specifically, the knockout of gene 1 will yield, in terms of cell division, one of the following three outcomes, in one unit of WT cell doubling time: (a) : Cell does not divide and loses viability; (b) : Cell divides once as WT; (c) : Cell divides twice.

These outcomes are simplified for further derivations and simulation programming. We adapted the ideas from CRISPulator; it is possible that the cell divides more than twice or that the cell starts division, but takes twice as long to divide as a WT cell. We did not choose the option “cell does not divide and remains viable”, allowing us to connect our discrete simulation approach with the continuous exponential growth-based model (S1 Text, Connection between the Discrete and the Continuous Model).

Given , the SKO cell will produce descendants. The value of depends on the theoretical phenotype , and we assume

(1)

where , and the refers to the multi-bernoulli distribution where with probability .

Based on the value of , there are three possible outcomes: (a) When , it is a negative phenotypic gene with , where , . (b) When , the cell behaves like a WT cell with . (c) When , , it is a positive phenotypic gene where .

Without loss of generality, the same deductions and notations are used for when we knock out gene 2 as in , but with parameter where

(2)

Growth behavior derivations of DKO cells.

For DKO gene-level outcomes, we considered the joint effects of both the variables and , with parameters and , respectively. Similar to the notations for SKO effects, we included a variable for DKO effects. More specifically, for one unit of the cellular population doubling cycle, for a specific single cell, we denote

in one unit of the cell doubling time, we assume will have one of the following outcomes: (a) : Cell does not divide and loses viability; (b) : Cell divides once as wildtype; (c) : Cell divides twice; (d) : Cell divides three times.

As such, given , under no gene-gene interaction . We define

(3)

The original single cell in the DKO design produces descendants per WT population doubling cycle. We chose this definition of so that if the first targeted gene is a non-targeting control (this is, , and there is no gene-gene interaction), then DKO cells behave like SKO cells based upon the second-targeted-gene, regardless of the behavior of the first targeted gene. In math, follows the multinoulli distribution in and . This is shown below in (8).

The distribution of is

(4)

where is the cell division probability vector.

Deriving the values of the cell division probability vector Based on the definition of additive interactions in combination perturbation [17], we calculated as shown in (3). Its value as a function of and , is given in the following matrix:

Accordingly, the joint density of is given by the matrix

where So in (4) is

(5)

The can take any values in as long as they add up to 1. Under the condition of no gene-gene interactions, where and are independent (in math ), the matrix above becomes

Thus, under no interaction and induce the multinoulli distribution shown in (4), and for with , this is

(6)

With where

(7)

So when , and there is no interaction (i.e., when the first targeted gene is a non-targeting control),

(8)

That is the same distribution of in (2). Also note that the multinoulli distribution of in (1) is the same as the multinoulli distribution of in (6) with . In a later subsection, we describe the simulation of the number of divisions in an SKO or DKO cell without gene interaction from the multinoulli distribution shown in (6). We also used (6) to simulate cell divisions of a DKO cell, with interaction jiggling the values of and .

Genetic interaction derivations.

To simulate the gene-gene interactions, we define the theoretical GI based on the growth rate of the cells [18] as

(9)

where, denotes the number of descendants of the DKO cell, defined as shown in (3), after one unit of population doubling cycle. is defined in (7) and is calculated by defined as follows:

Based on the GI flag (no interaction coded as and interaction as ) from the initial cell library detailed in Simulation System Design I, we define the resampled theoretical phenotypes with interactions as

where , and indexes the first and second KO genes in the cell. We defined as the resulting when replacing by in (7), and the simulated GI as shown in (9). When , we defined for and then .

Simulation system design I: Cell library construction

Parameters specification.

Table 1 summarizes all tunable parameters used as inputs in our CRISPR KO simulation scheme. The glossary of each tunable component and its deduced products in our designed system is detailed in S1 Text, Glossary of tunable components.

thumbnail
Table 1. Summary table of tunable parameters in the simulation system1.

https://doi.org/10.1371/journal.pcbi.1013510.t001

SKO gene initialization.

As defined in (1), we first initialized the theoretical phenotypes for the SKO cells, , . Each corresponds to one of four classes: negative, positive, wild-type (WT), and non-targeting control. The proportion of s of each class is specified by the user. The distribution of within each class is also prespecified by the user. Specifically, we assumed the distribution for phenotypes , is:

  1. Negative:
  2. Positive:
  3. WT:
  4. Non-Targeting Control:

where denotes the normal distribution with mean and variance truncated to the interval .

Second, for each we defined that would later help us determine the initial relative frequency of each (SKO and DKO) construct in the library. We obtained by drawing

where is the standard deviation of the log10 normal distribution of . Following CRISPulator [11], by default, where the number 3.29 was chosen so that there is a 10-fold difference between the 95 and 5 percentiles of the initial SKO counts distribution. As a toy example, shown in S1 Table, we generated a library with genes, 1,2, and 3 belonging to negative, WT, and non-targeting control gene classes, respectively. After sampling from the above distributions, we obtained , and .

DKO gene initialization.

To initialize the cell library containing both SKO and DKO cells, we generated all unique combinations of gene pairs that can be targeted for KO, with , . Then we row binded the s and s to generate all indices for genes and their unique combination in pairs.

We aimed to generate a set of preliminary (non-standardized) frequencies for SKOs and DKOs. We defined a preliminary non-standardized (i.e., they do not add up to 1) frequency of cells with SKOs and DKOs. At the initial timepoint , this frequency is defined as for and for . These frequencies will be used later to generate the initial library counts.

Genetic interaction index initialization.

For every an interaction indicator was generated. Within the set of DKOs not containing a non-targeting control, we randomly selected of DKOs and flagged their genes as interacting (coded as ) and the rest as not interacting (). Recall that , selected by the user, is the proportion of DKO constructs with no non-targeting controls whose genes interact. For each we defined and , that later helped us define interaction, as for if there is no interaction, ; and we draw , with if there is interaction, . The user-specified simulation parameter will control the magnitude of the GIs in the gene pairs. Continuing with our toy example, let’s assume that we are requesting that , i.e., 1st and 2nd targeted genes interact. S2 Table shows an example of 3 unique genes drawn from the above distributions, with all possible combinations without considering orders.

Guides initialization.

We initialized the guides targeting each gene and categorized them based on their KO efficiency. The efficacy of the guides targeting gene was denoted as . We added an index or , to differentiate between high () and low () efficiency guides. Guide-efficacy (high/low) of (i.e., # of genes# of guides targeting each gene) guides was determined by randomly selecting guides to be highly efficient and the rest to be low efficient. Recall is the percentage of high-efficacy guides chosen by the user. Based on the simulated guide-efficacy (high or low), the guide-efficacy was simulated by a tunable CRISPR model parameter, as summarized below:

  1. CRISPRn (CRISPR-nuclease):

There is no experimental evidence supporting the threshold of 0.6. Nevertheless, both mean efficiency of high and low-efficacy guides and are tunable by users, with default values 0.9 (= 0.1), and 0.05 (= 0.07). The guide efficacy for both high and low categories drawn from randomization almost never reaches 0.6 by default. This threshold is used to ensure that the overlapping probability is exactly zero, thereby guaranteeing that the high-efficacy guides always have higher efficacy than the low-efficacy guides.

  1. 2. CRISPRn-100%Eff (CRISPR-nuclease with full-efficacy guides):

Constructs frequency and counts initialization. We utilized the Dirichlet Distributions to randomly assign the initialized cell counts to guides as follows:

1. For SKO, we determine the non-standardized relative frequency of cells with guide targeting gene , i.e., for and :

(10)

s.t.

  1. 2. For DKO, we computed the non-standardized relative frequency of cells with a guide targeting gene and guide targeting gene , i.e., for , and :
(11)

s.t.

For each pair of genes, across the guides, the average relative frequency of the SKOs is the same as the average relative frequency of DKOs. Hence, the initial counts of SKO and DKO are similar. Also, the distribution of Log2 fold change to be similar to the distribution of when the gene is a non-targeting control (i.e., no interaction).

Specifically, we chose for generating in (10) and in (11) to maintain a small variance in the relative frequencies of the SKO and DKO counts across the guides within each gene or gene-pair, respectively.

Similar to the DKO gene initialization, we calculated the relative frequency of each construct (i.e., combinations of the targeting SKO(DKO) gene(s) and the corresponding guides) at as

(12)

with , and . so that the relative frequencies sum up to .

The initial construct counts are set equal to for and for , respectively, where again denotes the rounding of the real number to the nearest integer, is the total number of unique single input genes, is the total number of guides corresponding to each unique KO construct, and is the requested initial library size specified by the user. Our initialized cell library with constructs from all pre-specified guides and genes is denoted as . S3 Table shows the toy example of the simulated cell library counts with initialization with 18 row based on S2 Table, using a coverage yielding a library size of .

Cell growth behavior of SKO and DKO constructs incorporating guide-efficacy.

Following the simulation of cell behavior with KO genes, incorporating both the guide-efficacy and GI effects, we defined

(13)

for , and . The cell growth behavior of will be determined by while the behavior will be determined by both and . Recall that is the efficacy of the guide targeting gene . More specifically, by (1), (2) and (8), cell growth on a SKO or DKO depends on theoretical phenotypes , that determine the growth when . Equation (13) above decreases the cell growth rate of cells infected with low-efficacy guides compared to the cells infected with high-efficacy guides. For let and , and for let and , in (7) to compute to simulate the cell division behavior. In every cell doubling cycle, the cell will divide times with multinomial with in (4).

Computation of Simulation True Genetic Interaction. With the initialized cell library specified above, and based on the methodology demonstrated in the cell-behaviors multinomial distributions sub-section, the simulated truth cell behavior with interactions of genes and was defined as follows:

  1. Using and in (7), we computed the cell division probability based on KO gene effects without GI
  2. Using and in (7), we computed the cell division probability based on KO gene effects with GI
  3. The gene-level GI values were computed by plugging in the values and in (9).

The GIs were categorized as either negative, none, or positive. In the absence of an interaction , and (9) produces a 0 interaction. For example, the interaction of genes 1 and 2 in the toy example in S2 and S3 Tables , and in item 1 above; , and in item 2 above; producing a simulation true interaction of -0.1506 in item 3.

Simulation system design II: Cell population transfection and selection

This subsection profiles the methods used for simulating cell population transfection and selection, and simulation data processing.

Transfection and selection.

In the simulated cell population transfection and selection stage, we used as a cell doubling cycle counter and as a counter of bottleneck encounters. We initialized and . The number of bottleneck encounters, was specified by the user, and the maximum number of cell doubling cycles was set to 30. The setup was as follows:

1. At , we initialized the current library of cell counts for every construct equal to the initial library; for SKO and for DKO. The dimensional vector of cell counts was calculated by binding the current counts of SKO and DKO constructs as follows:

and

2. The parameters were stored to the matrix , wherein row contains the cell division probability vector for construct . Hence, construct has current counts and the cell division vector equal to row of .

When and , we iteratively run:

1. Compute the current library size and check the bottleneck condition:

where is the user-prespecified bottleneck size.

• If yes, let , and draw cells from the current cell library:

(14)

and set the current value of equal to . Here denotes the multivariate hypergeometric distribution that indicates the balls drawn from each color when balls are extracted without replacement from a urn containing balls of color , here is a vector of dimension the number of different colors in the urn. If, for example, the current library has 50 cells, the new library will have at most 50 of these cells. Following CRISPulator, we do not model multiple infections and assume the simulated CRISPR screening is on a low multiplicity of infection (moi) [11]. By default, we chose moi λ = 0.3 and model it as a Poisson process during transfection to select the cells that have single transfection occurrence by

where of the is set to be , the number of cells that we kept after reaching the bottle neck.

2.For each construct , grow to following the row of that defines the corresponding cell division probability vector for both if , and if . The current value of is set to to simulate the cell populations’ transfection and growth. In math, before growth we have construct cells, with each cell dividing according to (6), producing cells after one growth cycle with

So that And set .

Increase the iteration counter from to .

Simulation data processing.

After this iterative process (when we reach either or ), the cell library at this final timepoint was denoted as . We set the cell counts of referred as equal to , and computed the total library size . The relative frequency of each construct at was calculated as follows:

(15)

where , and .

Accordingly, we defined the dimensional vector of the relative frequency of constructs at , , by binding the relative frequency of SKO and DKO constructs at as shown below:

and

Similarly, using equation (12), we also defined the dimensional vector of the relative frequency of constructs at the initial timepoint by binding the relative frequency of SKO and DKO constructs at as shown below:

and

Based on the definition of log fold change and included in the parameters specification subsection, we calculated the Log2 -dimensional Fold-Change (LFC) vector at vs. as follows:

(16)

where is the pseudo-count added to each relative frequency of the constructs to avoid in logarithm. The rest of the sections in simulation system design are detailed in S1 Text, Supplemental Methods and Materials.

Algorithmic designs: Monte-Carlo simulation on large scales

Algorithm: Double CRISPR-Knockout Simulation (DKOsim).

The practical computational workflow of DKOsim for large-scale applications is summarized in S1 Text, Simulation Steps, and the computational modules are presented in S1 Fig. Using the notations defined in the parameter specifications subsection, we initialized the simulated cell library as desired by users. For the purpose of modeling the competitive exponential cell growth and selection procedures after library transfection and transduction, we designed Algorithm 1 (Fig 4) to simulate the cell population adaptations through multinouli-resampling induced by the growth bottleneck.

thumbnail
Fig 4. Algorithm for double CRISPR knockout simulation.

Algorithmic design of Double CRISPR-Knockout Simulation (DKOsim) via Monte-Carlo samplers.

https://doi.org/10.1371/journal.pcbi.1013510.g004

Results

DKOsim is tunable to infer the asymptotic effects of laboratory CRISPR screening parameters

We systematically ran DKOsim to investigate the tunability of the scheme and validate its use for simulating the empirically expected CRISPR screening data pattern from laboratory experiments. From the tunable parameters summarized in Table 1, we chose six parameters to compare the simulation tunability, including coverages, percentage of high-efficacy guides, GI magnitude, dispersion of the initial frequency of SKO counts, number of guides per gene, and cell doublings. To quantify the association between the GI identifications vs. simulated GIs, we applied Delta Log fold Change (dLFC) that calculates the deviations in log fold change (LFC) from the mean LFC of all constructs targeting gene pairs to the expectation of the sum of the single mutant fitness (SMF) for the two genes [19], and measured the association by Pearson’s correlation r. Additionally, we calculated the precision and recall at 80 strongest dLFC GI identifications among the top 100 negative simulated GI hits (Precision/Recall@80) to quantify the dLFC GI identification performance. Since we were simulating both SKO and DKO, we applied BAGEL [20,21] to the simulated SKO data to assess the validity of the simulated essential genes, defined as genes with negative theoretical phenotypes in our designed system. Except for the tuning parameter being compared, the others were assumed to be the same for comparison. Specifically, for the simulated screenings that were compared, we summarized the default parameters of the baseline screening - Simulation (Systematic Run - Baseline) in Table 2. One simulation was run for each unique combination set of parameters, and we visualized the analytical outputs in line diagrams (Fig 5).

thumbnail
Table 2. Input parameters1 in DKOsim simulation (Systematic Run - Baseline).

https://doi.org/10.1371/journal.pcbi.1013510.t002

thumbnail
Fig 5. Tunability of the parameters in the simulation scheme.

Delta Log Fold-Change (dLFC) is applied to simulations to measure the genetic interaction (GI) scores. (A)-(C) Simulation Runs on Coverage Effects. We use the parameters in Table 2 except for the coverage C varying from 1x to 500x: (A) Changes of Pearson correlation r on dLFC vs. simulated GI, from Coverage varying runs. (B) Changes of AUC-PR for SKO Gene Essentiality on BAGEL identifications on the simulated essential (negative phenotypic) genes, from Coverage varying runs. (C) Changes of Precision and Recall of the 80 most negative dLFC identifications on the top 100 simulated negative GI hits, from Coverage varying runs. (D)-(F) Simulation Runs on Percentage of High-Efficacy Guides Effects. We use the parameters in Table 2 except for the percentage of guides with high-efficacy %heg varying from 10% to 100% with Mode on “CRISPRn”, stratified by coverage at 10x, 100x, and 200x: (D) Changes of Pearson correlation r on dLFC vs. simulated GI, from High-Efficacy Guides Percentage varying runs. (E) Changes of AUC-PR for SKO Gene Essentiality on BAGEL identifications on the simulated essential (negative phenotypic) genes, from High-Efficacy Guides Percentage varying runs. (F) Changes of Precision and Recall of the 80 most negative dLFC predictions on the top 100 simulated negative GI hits, from High-Efficacy Guides Percentage varying runs. (G)-(I) Simulation Runs on Dispersion of Initial Counts Effects. We use the parameters in Table 2 except for the dispersion of the initial frequency of SKO counts σf varying from to , stratified by coverage at 10x, 100x, and 200x: (G) Changes of Pearson correlation r on dLFC vs. simulated GI, from Initial Counts Dispersion varying runs. (H) Changes of AUC-PR for SKO Gene Essentiality on BAGEL identifications on the simulated essential (negative phenotypic) genes, from Initial Counts Dispersion varying runs. (I) Changes of Precision and Recall of the 80 most negative dLFC predictions on the top 100 simulated negative GI hits, from Initial Counts Dispersion varying runs. (J) Simulation Runs on Number of Guides per Gene Effects. We use the parameters in Table 2 except for the number of guides per gene ng tuned varying from 1 to 10 and the percentage of guides with high-efficacy %heg tuned from 10% to 100%: Changes of Pearson correlation r on dLFC vs. simulated GI, from Number of Guides per gene varying runs, colored by the percentage of high-efficacy guides. (K) Simulation Runs on Cell Doublings Effects. We use the parameters in Table 2 except for the bottleneck size nb and the number of bottleneck encounters ne tuned to control the cell doublings varying from 1 to 19: Changes in the fraction of reads in the top 5% of the guides, from Cell Doublings varying runs. (L) Simulation Runs on GI Magnitude Effects. We use the parameters in Table 2 except for the strength of the simulated GIs σGI varying from 0.1 to 5: Changes of Precision and Recall of the 80 most negative dLFC identifications on the top 100 simulated negative GI hits, from GI Magnitude varying runs.

https://doi.org/10.1371/journal.pcbi.1013510.g005

For experimental parameters, including coverage, guide quality, initial constructs’ counts dispersion, and cell doublings parameters, which can be directly controlled in the experimental design, we compared and visualized the effects of each by systematically running DKOsim (Fig 5A-5K). When increasing the coverage of the CRISPR screening experiments, the Pearson correlation r between the detected GI vs. the simulated GI monotonously increased and reached an asymptote around 0.7 at 100x (Fig 5A). Results of AUC-PR from BAGEL SKO gene essentiality identification further demonstrated that the simulated SKO essential genes asymptotically reach the optimal performance starting at 100x screening with 0.98 AUC-PR (Fig 5B). We quantified the GI identification performance of dLFC and visualized the changes in Precision/Recall@80 presented in Fig 5C, while the changes in the AUC-PR for all negative GIs are shown in S2A Fig. Monotone increasing patterns are found in both, with the asymptote at 100x. the Precision@80 dominates the Recall@80 and AUC-PR, reaching an asymptote around 0.86 at 100x. Beyond this 100x asymptotic point, increments to the experimental coverage do not significantly improve either the association or identification performance between dLFC vs. simulated GI, indicating the optimal cost-effective design for screening coverage is 100x.

We then compared the tunability for the percentage of high-efficacy guides. When the systematic runs were restricted to compare guide quality by tuning the high-efficacy guides percentages at 100x coverage, the correlations between dLFC and simulated GIs with increasing percentage of high-efficacy guides from 10% to 100% showed an increasing trend, with r = 0.76 for 100% high-efficacy guides (Fig 5D). We additionally tested whether there are synergistic effects between high-efficacy guides percentage and coverage. Simulations were run on increasing percentage of high-efficacy guides from 10% to 100%, stratified by coverage at 10x, 100x, and 200x. Results showed that correlations increase with higher coverage overall, comparing the same level of percentage of high-efficacy guides. We observed the effect of increasing percentage of high-efficacy guides on correlation is synergistically enhanced by higher coverage, particularly at low-to-moderate efficacy ranges, where improvements in guide-efficacy translate into disproportionately larger gains in correlation.

Fig 5E shows BAGEL essentiality identification results: while BAGEL AUC-PR in recovering SKO gene essentiality is high (AUCPR > 0.75) across almost all conditions, benefits of increasing high-efficacy guides percentage depend on coverage, with substantial improvements at low coverage and diminishing returns once coverage is sufficient. This suggests an interaction but weak-additive synergy interaction between high-efficacy guides percentage and coverage for SKO gene essentiality identification. To evaluate GI identification performance from dLFC, we examined changes in Precision/Recall@80 for the top-ranked negative genetic interactions across increasing percentage of high-efficacy guides (Fig 5F), as well as AUC–PR for all negative GIs (S2B Fig). Both precision and recall increased with guide efficacy; however, the magnitude of improvement strongly depended on library coverage. Specifically, increases in the proportion of high-efficacy guides yielded the largest gains in both Precision@80 and Recall@80 under low-coverage conditions at 10x, whereas performance improvements diminished as coverage increased to 100x and 200x. At moderate to high coverage, Precision@80 approached 0.95, indicating that approximately 76 of the top 80 ranked interactions correspond to true simulated negative GIs. In contrast, recall improved more gradually and exhibited earlier saturation, reflecting the greater difficulty of exhaustively recovering all true interactions. Together, these results demonstrate a coverage-dependent, synergistic interaction between library coverage and guide efficacy, whereby improvements in guide efficacy disproportionately enhance GI detection performance when guide sampling is limited, with diminishing returns once coverage becomes sufficient.

To demonstrate the tunability of the initial counts dispersion, following Table 1, we systematically ran DKOsim by tuning based on the z-score resulting from a 40–90% confidence level. For example, at 90% confidence, the expected z-score is 3.29, and we constructed the SKO gene frequency following a normal distribution with so that there is a 10-fold difference between the 95th and 5th percentiles. As defined, with 100x coverage, lower confidence leads to higher , resulting in a higher dispersion of the initial count distribution. Based on this setting, we visualized the changes of the correlation r with increasing setting to the dispersion of initial frequencies of SKO counts (Fig 5G). The increments of the initial counts dispersion decrease the correlation r on dLFC vs. simulated GI overall, and we found that r drops to 0.58 when . On metrics of correlation r, we tested whether there are synergistic effects between the initial counts dispersion and coverage. Increasing dispersion in initial SKO counts’ frequencies consistently reduced the correlation between dLFC and simulated GIs; however, the magnitude of this effect depended on library coverage. In particular, high dispersion caused a disproportionate loss of correlation under low-coverage conditions, whereas higher coverage partially buffered against uneven starting frequencies. These results indicate a synergistic interaction between library coverage and the initial SKO counts dispersion, whereby limited sampling and skewed guide abundances jointly amplify stochastic noise and degrade GI detection performance from dLFC.

We applied BAGEL essentiality identification on simulations in this setting. Increasing dispersion in initial SKO counts’ frequencies had only a modest effect on SKO gene essentiality identification performance, with AUC-PR remaining high (AUCPR > 0.75) across most conditions (Fig 5H). While higher dispersion led to gradual performance degradation at low to moderate coverage, high coverage largely buffered against uneven starting frequencies, maintaining near-ceiling AUC-PR values. Unlike GI detection, SKO essentiality exhibited a saturating response to both coverage and initial frequency dispersion, indicating limited interaction and no strong synergistic effect between these parameters. Increasing dispersion in initial SKO counts’ frequencies led to a pronounced decline in both Precision@80 and Recall@80 for top negative GI detection (Fig 5I). Importantly, the magnitude of this degradation depended strongly on library coverage: under low-coverage conditions, even moderate dispersion caused a rapid collapse in both precision and recall, whereas higher coverage substantially mitigated these effects. The highly non-parallel performance trajectories indicate a strong synergistic interaction between coverage and initial frequency dispersion, whereby limited sampling and skewed guide abundances jointly amplify stochastic noise and severely compromise GI detection.

We further tested the effects of the number of guides per gene (Fig 5J) on different proportions (10%-100%) of high-efficacy guides. With only 10% highly efficient guides, we observed a monotone increasing trend of the Pearson correlation r on dLFC vs. simulated GI, peaking at 0.78 with 10 guides per gene. When 30% of the guides were highly efficient, 5 guides per gene with r = 0.8 is the optimum of the correlation; when either 50% and 80% of the guides were highly efficient, we observed an asymptotic optimum of the correlation at 3 guides per gene with 0.79 and 0.77, respectively; and when all of the guides were highly efficient, the correlation reached the optimum with r = 0.76 at 2 guides per gene and, unexpectedly, show a decreasing trend beyond this point, indicating dLFC might not identify the GI well in a perfect guide-efficacy scenario. Based on the simulation results, choosing 3 guides per gene would be enough in real laboratory screening by ensuring sufficient quality of the guides when ordering. Our simulation reflected the synergistic effects between high-efficacy guide rates and the number of guides per gene in CRISPR screens. Considering a correlation of at least 0.75 for GI to be effectively identified, for sets of the simulation with high-efficacy guides proportions lower than 50%, more guides for each target gene (≥ 5) need to be incorporated into library design. For experiments with higher proportions of high-efficacy guides (≥ 80%), two guides are generally sufficient. This shows that the number of guides per gene and high-efficacy guide proportion interact synergistically, with more guides set for each target yielding disproportionately greater stability when higher fractions of guides achieve strong knockout efficiency.

For cell doublings, we mainly compared and visualized the asymptotic trends of the fraction of reads in the top 5% reads for cell doublings in Fig 5K. From 1 to 19 cell doublings, the fraction of reads in the top 5% reads increased monotonously due to the decrease in cell diversity caused by the death of simulated cells with negative phenotypic targeted KO genes’ constructs. With 19 doublings, 100% of the reads are represented by the top 5% of guides, indicating the cell diversity reaches a minimum where the cell counts are dominated by one specific construct type, possibly in cells with dual-positive combinatorial KO gene constructs with positive GIs.

For biological parameter, specifically GI magnitude, which is determined intrinsically by genetic characteristics in CRISPR screening and not directly controlled by the laboratory experimenter, we visualized its effects on Precision/Recall@80 of dLFC on the top 100 negative GI hits (Fig 5L). Results showed that precision and recall for the top hits reached the asymptotes when is 1, where Precision@80 is 0.875, Recall@80 is 0.7, supported by asymptotic AUC-PR as 0.792 (S2C Fig).

We prepared a summary of guidance on parameter selection for running simulations. The guideline can be accessed in “Summary Guidance in Picking Suitable Parameters” section from the tutorial in vignettes of DKOsimR, our built R package for DKOsim. Check Data Availability Statement to install and access DKOsimR.

DKOsim approximates patterns from actual laboratory Double-CRISPR Knockout screening data

We compared the actual experimental data to the synthetic data approximated by our simulation. We collect three sets of laboratory screening data for approximation, including the combinatorial CRISPR-Cas9 screens designed by Shen et al.[4] (Shen-2017), “Big Papi” orthologous combinatorial CRISPR-Cas9 screens designed by Doench et al.[22] (Doench-2017), and “SCHEMATIC” combinatorial CRISPR platform to map synthetic lethal interactions designed by Fong et al.[23] (Fong-2024). In this section, the data approximation is restricted to the A549 cell line, a human lung adenocarcinoma cell line with a KRAS gain-of-function mutation in the oncogenic background of cancer studies.

We investigated whether DKOsim can approximate constructs’ count distribution at the initial timepoint, given the same number of perturbed genes. To approximate Shen-2017 design1, we initialized Simulation (mimicking Shen) by 120 uniquely perturbed single genes, 3 guides per targeted gene with 3% GIs, and set 80% confidence (expected z-score as 2.56) on dispersion of SKO genes for DKOsim (Fig 6A). Comparing simulation results with Shen-2017 day 3 collected constructs’ counts in the A549 cell line, histograms of the log 10 counts are highly aligned between the laboratory Shen-2017 vs. Simulation (mimicking Shen). To approximate the Doench-2017 design, we initialized Simulation (mimicking Doench) by 28 uniquely perturbed single genes, 5 guides per targeted gene with 3% GIs, and set 90% confidence (expected z-score as 3.29) on dispersion of SKO genes for DKOsim (Fig 6B). Compared with Doench-2017 plasmid constructs’ counts in the A549 cell line, histograms of the log 10 counts highly overlapped between the laboratory Doench-2017 vs. Simulation (mimicking Doench). To approximate the Fong-2024 design, we initialized Simulation (mimicking Fong) by 246 uniquely perturbed single genes, 3 guides per targeted gene with 3% GIs, and set 80% confidence (expected z-score as 2.56) on dispersion of SKO genes for DKOsim (Fig 6C). Similarly, we compared simulation results with Fong-2024 plasmid constructs’ counts in A549 cell line, histograms of the log 10 counts highly overlapped between the laboratory Fong-2024 vs. Simulation (mimicking Fong).

thumbnail
Fig 6. Simulation approximation on laboratory data.

Comparison of Distributions between Simulation and Laboratory Data. A maximum of 30 doubling cycles for each simulation with moi λ = 0.3 is assumed. (A) Histograms of log10-scaled constructs’ counts at the initial timepoint on Shen-2017 A549 vs. Simulation Run (mimicking Shen). Parameters table indicates values of each input parameter for DKOsim (mimicking Shen-2017 A549) and the total number of simulated cell doublings is 3. (B) Histograms of log10-scaled constructs’ counts at the initial timepoint on Doench-2017 A549 vs. Simulation Run (mimicking Doench). Parameters table indicates values of each input parameter for DKOsim (mimicking Doench-2017 A549) and the total number of simulated cell doublings is 3. (C) Histograms of log10-scaled constructs’ counts at the initial timepoint on Fong-2024 A549 vs. Simulation Run (mimicking Fong). Parameters table indicates values of each input parameter for DKOsim (mimicking Fong-2024 A549) and the total number of simulated cell doublings is 3. To associate gene labels from lab data, we initiate the theoretical phenotypes in unknown gene class by . (D) Histograms of overall LFC distributions on Fong-2024 A549 vs. Simulation Run (mimicking Fong). Parameters table indicates values of each input parameter for DKOsim (mimicking Fong-2024 A549) and the total number of simulated cell doublings is 3. To associate gene labels from lab data, we initiate the theoretical phenotypes in unknown gene class by . (E) Histograms of genetic interaction (GI) scores from Schematic on Fong-2024 A549 vs. zdLFC on Simulation (mimicking Fong). (F) Histograms of LFC by gene-gene combinations on Fong-2024 A549 vs. Simulation (mimicking Fong).

https://doi.org/10.1371/journal.pcbi.1013510.g006

Based on the aligned distributions of constructs’ counts at the initial timepoint, we investigated whether DKOsim could simulate actual cell growth and approximate real experimental data patterns on LFC. Specifically, we chose the Fong-2024 [23] design (data characteristics summarized in S4 Table) for simulation approximation, in which this recently developed combinatorial CRISPR platform comprises a panel of 246 genes with 67 frequently mutated genes, asymmetrically crossing another 176 druggable genes, with 3 additional non-targeting controls that do not affect the functions of the cells. Among these genes, 64 genes are treated as essential, where AAVS1 is known as a safe harbor locus that should not disrupt any cell function and is treated as the negative control. The cell line was infected at a multiplicity of infection (MOI) of 0.3 to ensure > 100x coverage in library production. Each gene was targeted by 3 independent guides in this asymmetric library. Within all possibly interacting 12282 gene pairs, 400 pairs were under FDR < 10%, and we treated 3% () of the gene pairs as truly interacting pairs. While screening, two biological replicates were included, each with independent viral transduction on low numbers of cell passages.

To approximate the laboratory design, in our Simulation (mimicking Fong) initialization (Fig 6D Parameters), we included 246 genes: 64 negative phenotypic genes to align with the number of essentials, 178 wild-type genes to align with the number of nonessentials, and 4 non-targeting controls to align with the AAVS and 3 non-targeting controls from Fong-2024. Each gene was targeted by 3 independent guides, and the asymmetric design in the lab was extended to a symmetric library where all the initialized genes could interact with each other. To further mimic the design, we set coverage =1000x with MOI = 0.3 to align with the high laboratory coverage, and 3 times cell doublings to align with the low number of passages. We simulated 2 biological replicates, each transduced independently. LFC was calculated for each replicate and aggregated by mean in the final output after all simulated cell growth, transfection, and selections to approximate Fong-2024’s data pattern.

We compared the overall LFC distributions from the Fong-2024 versus the A549 cell line simulation (Fig 6D). Within the same range, DKOsim Simulation (mimicking Fong) approximates Fong-2024’s LFC pattern with almost perfect overlaps. To compare GI distributions, we applied dLFC to Simulation (mimicking Fong) to identify the simulated GIs, followed by z-standardization to the identification scores named zdLFC. Comparing distributions of zdLFC with SCHEMATIC interaction scores from Fong-2024 (Fig 6E). The overall shape is aligned and most interaction scores fall within -2.5 to 2.5, zdLFC approximates SCHEMATIC scores with merely rightward distribution and a few more spikes, possibly due to the inclusion of the simulated positive GI in DKOsim. In contrast, SCHEMATIC is specifically designed for identifying actionable synthetic lethal (negative) interactions.

We then compared LFC distributions for its unique gene combinations. Utilizing the 64 gene essentiality labels from Fong-2024, we deconvoluted both laboratory LFC and simulated LFC by unique gene combos. To align with laboratory design, we treated our simulated negative genes as essentials, kept the non-target controls with same number as Fong-2024, and categorized the rest of the simulated genes, including wildtype and positive genes, as unknown. The deconvolution results (Fig 6F) demonstrated that LFC of the simulated co-essential genes, co-unknown genes, and essential with unknown combos are well approximating the laboratory data. The trend of alignment is most prominent among the co-unknown genes representing the predominant construct category in Simulation (mimicking Fong), where we found almost perfect alignment between the simulation and laboratory data. However, LFC for gene combos consists of the simulated non-targeting control genes tending to have smaller variability compared to Fong-2024, mainly due to our strict definition of non-targeting controls in having 0 theoretical phenotypes. Since the non-targeting controls are explicitly defined to not affect the exponential cell growth in DKOsim, this results in markedly reduced variability in simulation, compared to actual laboratory experimental data.

DKOsim simulates the noise of existence from laboratory experimental replicates and is reproducible from randomness

While DKOsim is applicable to approximate data from laboratory CRISPR experiments in real-world settings, we ensured the design of this simulation system to achieve a high degree of reproducibility. Reproducibility, as defined in our context, primarily encompasses two aspects: DKOsim is reproducible to maintain both stochastic consistency across Monte-Carlo randomizations and a sense that users across many disciplines can obtain consistent simulation results.

We measured the reproducibility of DKOsim using Pearson correlation r between two replicates of simulations in DKOsim, on the asymptotic effects of the experimental parameters coverage, high-efficacy guides percentage, and initial counts dispersions. For coverage from 1x to 500x, DKOsim asymptotically gained higher reproducibility between replicates, shown by the increasing correlations, where at 100x, the correlation asymptotically approached 0.95 and reached 0.99 up to 500x (Fig 7A). A monotone increasing trend was seen in the replicates’ reproducibility as the percentage of high-efficacy guides increased from 10% to 100%. When all guides are 100% efficient in knocking out the target genes, the correlation between the replicates was 0.87, indicating a highly consistent LFC between the two replicates (Fig 7B), as empirically expected, improved guide quality is contributing to greater reproducibility of the experiments. We observed a monotone decreasing trend in the replicates’ reproducibility, given a larger dispersion of initial counts (Fig 7C), with correlation r dropping from 0.95 to 0.91 with higher dispersion.

thumbnail
Fig 7. Reproducibility of the simulated CRISPR experiments.

(A)-(C) Systematic Tunability Effects on the simulation reproducibility: (A) Changes of Pearson correlation r on LFC between two replicates of the simulation runs, from Coverage varying runs. (B) Changes of Pearson correlation r on LFC between two replicates of the simulation runs, from High-Efficacy Guides Percentage varying runs. (C) Changes of Pearson correlation r on LFC between two replicates of the simulation runs, from Initial Counts Dispersion varying runs. (D)-(E) Comparison of reproducibility on dual-CRISPR screening experiments between laboratory and simulation data: (D) Barplots of Pearson correlation r on the relative frequency of guides and LFC between laboratory replicates with dual-vector designs on two independent viral transduction vs. Simulation (mimicking Fong). (E) Barplots of Pearson correlation r on the relative frequency of guides and LFC between laboratory replicates with single-vector design on one pooled-viral transduction vs. Simulation (Systematic Run - Baseline).

https://doi.org/10.1371/journal.pcbi.1013510.g007

We compared the reproducibility between the laboratory experimental replicates and the simulated replicates in DKOsim across different experimental screening designs. We additionally collected laboratory data from the combinatorial CRISPR-Cas9 metabolic screens designed by Zhao et al.[24] (Zhao-2018), and the “in4mer” CRISPR-Cas12a multiplex knockout screens designed by Hart et al.[6] (Hart-2024). Following the original combinatorial CRISPR screening design of Shen et al.[4], Zhao-2018 and Fong-2024 [23] reproduced the dual-vectorized DKO setup with independent lentiviral transductions per context, where each replicate was independently initiated to ensure reproducibility. This approach offers greater flexibility in scaling the interacting gene pool and captures full biological variability signals in replication from guide integration and infection noise, but introduces more stochastic dropouts depending on guide quality, and higher labor complexity from dual transduction in practical experimental runs. We compared the correlation of the replicates with the dual vectorized design using Simulation (mimicking Fong) which was designed to approximate the Fong-2024 A549 data (Fig 7D). Two sets of the correlations were compared, first on the relative abundance of constructs at final timepoint, and on the LFC between final timepoint vs. initial timepoint from independent transductions. We found increasing replicate correlations from relative abundance compared to the original design of Shen-2017, and the close agreement with Fong-2024 implies that the simulation accurately models and captures the biological noise arising from cell growth and the selection process. While DKOsim approximated the noise signals from replicates, it maintained the highest LFC correlations when compared with other dual-vectorized DKO laboratory data.

Both Doench et al. and Hart et al. built single-vector lentiviral delivery design. Doench-2017 [22] design encoded the dual-sgRNA in a single lentiviral construct and conducted one viral transduction per replicate, yielding high precision and reproducibility with cleaner delivery but sacrificing true biological independence. In contrast, Hart-2024 design had one lentiviral construct encoding up to 4 guide RNAs using Cas12 with one-time viral transduction in the pooled in4mer library. Within the same transduction, multiple sequencing was conducted to measure the constructs’ counts, resulting in two technical replicates at the final timepoints. This approach ensures high precision and reproducibility on pre-defined KO gene combos and minimizes the possible dropout and noise. But as with the Doench design, the one-time lentiviral transduction predisposes the replicates’ reproducibility towards the technical consistency rather than full biological independence. Under this experimental design of a single vector, we compared the replicates’ correlation with Simulation (Systematic Run - Baseline) designed to systematically infer the optimal CRISPR library designs (Fig 7E). Two sets of correlations were compared: first on the relative abundance of constructs at the final timepoint, and on the LFC between final timepoint vs. the plasmid library. Results showed that while all screening yielded strong reproducibility, replicates’ correlation closely agrees with Hart-2024 and is identical to Doench-2017, indicating that DKOsim can also be utilized to capture noises in single-vectorized DKO design for both biological and technical replicates. This conclusion is further supported by the LFC correlations, where simulated LFC correlates better than the technical replicates from Hart-2024 and falls between the Doench-2017 A549 and A375 cell lines.

Discussion

GIs are rare, and many challenges remain in systematically profiling them without the possibility of validating all gene pairs in high-throughput experiments and in constructing gold standard datasets with true interaction values. On the one hand, laboratory experimental scientists invest a lot of time and resources in performing multiplexed CRISPR screening, which presents both biological and technical difficulties. On the other hand, many existing computational tools have devoted much effort to minimizing the CRISPR screening data noise in order to perform the most robust estimation of gene-gene interactions. But without the underlying truths of the GI values, and systematic quantitative views of the experimental CRISPR screening schemes and parameters, it is only possible to validate the partial interaction detections in restricted cell lines by gathering a large amount of screening data across multiple platforms with different library designs and varying data quality, resulting in a tremendous amount of uncertainty.

To address the aforementioned problems, we designed a Double-CRISPR Knockout Simulation Scheme (DKOsim), which systematizes a Monte-Carlo simulation methodology applicable to mimic both the SKO and DKO laboratory CRISPR knockout screening experiments and data patterns, while ensuring high tunability and reproducibility. Utilizing DKOsim on a large scale, users can simulate desired CRISPR screening datasets with the underlying true values of GIs as input, which serves the goals of both inferring the optimal experimental design for CRISPR knockout screening and supporting the statistical rigor in method development to perform inference on interaction values.

Accordingly, to better demonstrate the working logics of our designed scheme, we probabilistically derived the cell growth behaviors’ distributions for both SKO and DKO cells, and defined simulated GIs based on growth rates of the cells. We incorporated the guide-efficacy design into our simulation and summarized all tunable components in DKOsim. To make the large-scale implementation practically feasible, we designed an algorithm for DKOsim to compile the entire theoretical framework in generating a simulated cell population that mimics the practical laboratory experimental process. As evident from the simulation results, the approximation to data patterns from practical laboratory experiments shows superior alignments between the simulation and laboratory data in many ways, and as expected, the distribution of the simulated GI values recapitulates the laboratory-derived interaction scores. This shows the unprecedented potential of applying DKOsim to generate analyzable synthetic data for downstream analysis.

Furthermore, DKOsim framework was designed to achieve both stochastic consistency and real-world fidelity. Specifically, it aims to reproduce stable and coherent outcomes across parallelly repeated runs, despite the inherent randomness of Monte Carlo processes, while also capturing the biological and technical variability characteristic across many CRISPR screening experimental designs. In addition to modeling experimental noise realistically, DKOsim emphasizes user-level reproducibility: the platform provides a modular simulation R package named DKOsimR, ensuring future researchers from diverse backgrounds can reliably reproduce results by following the same defined workflow and tutorials. This dual reproducibility, emphasized on algorithmic robustness and user consistency, makes DKOsim a versatile tool for benchmarking and evaluating CRISPR-based double-knockout screening strategies.

Limitations exist in current schemes, which will be addressed in future work. First, we are only modeling the on-targeting effects of the screening. The off-target effects might be an important consideration to incorporate into current schemes. Second, we simplified the cell growth assumptions for derivations and simulation programming; the cells may divide more than twice in unequal time intervals. Taking continuous time effects into current model might reflect more signals in approximating true biological variability. Third, we mainly rely on Pearson correlation coefficient to validate DKOsim’s reproducibility but this might not consider context difference when comparing to real experimental data. Future comparisons on context-specific replicates need to take involved for a through comparison between simulations and real datasets. Lastly, DKOsim is mainly designed to simulate DKO CRISPR screening, though it also simulates SKO effects. Its scalability to increase the number of initialized single genes is not perfected at the current stage.

Supporting information

S1 Text. Supplemental methods and materials.

https://doi.org/10.1371/journal.pcbi.1013510.s001

(DOCX)

S1 Fig. Overview of double-CRISPR knockout simulation schematic workflow.

Created in BioRender. Gu, Y. (2026) https://BioRender.com/93cmnmy.

https://doi.org/10.1371/journal.pcbi.1013510.s002

(TIF)

S2 Fig. AUC-PR for all Negative GIs on asymptotic runs of dLFC identifications.

Delta log fold-change (dLFC) is applied to simulations to measure the genetic interaction (GI) scores. Simulated GIs are restricted to negative values for calculating AUC-PR. Line plots show the changes of AUC-PR for all simulated negative GIs on asymptotic runs of: (A) Coverage; (B) Percentage of high-efficacy guides effects; (C) GI Magnitude; (D) Confidence level for initial counts dispersion (%); (E) Number of guides per gene.

https://doi.org/10.1371/journal.pcbi.1013510.s003

(TIF)

S1 Table. Toy example: Single Gene KO parameters.

https://doi.org/10.1371/journal.pcbi.1013510.s004

(DOCX)

S2 Table. Toy example: Single & Double KO Genes Initialization.

https://doi.org/10.1371/journal.pcbi.1013510.s005

(DOCX)

S3 Table. Toy Example: Initial cell library.

https://doi.org/10.1371/journal.pcbi.1013510.s006

(DOCX)

S4 Table. Summary Table for Fong 2024 data characteristics.

https://doi.org/10.1371/journal.pcbi.1013510.s007

(DOCX)

S1 Data. Statistics of DKOsim systematic runs.

https://doi.org/10.1371/journal.pcbi.1013510.s008

(XLSX)

S2 Data. DKOsim Run 95 (Systematic Run - Baseline) data with dLFC scores.

https://doi.org/10.1371/journal.pcbi.1013510.s009

(ZIP)

S3 Data. DKOsim Run 139 (mimicking Fong 2024) data with dLFC scores.

https://doi.org/10.1371/journal.pcbi.1013510.s010

(CSV)

References

  1. 1. Ishino Y, Shinagawa H, Makino K, Amemura M, Nakata A. Nucleotide sequence of the iap gene, responsible for alkaline phosphatase isozyme conversion in Escherichia coli, and identification of the gene product. J Bacteriol. 1987;169(12):5429–33. pmid:3316184
  2. 2. Gostimskaya I. CRISPR-Cas9: a history of its discovery and ethical considerations of its use in genome editing. Biochemistry (Mosc). 2022;87(8):777–88. pmid:36171658
  3. 3. Shalem O, Sanjana NE, Hartenian E, Shi X, Scott DA, Mikkelson T, et al. Genome-scale CRISPR-Cas9 knockout screening in human cells. Science. 2014;343(6166):84–7. pmid:24336571
  4. 4. Shen JP, Zhao D, Sasik R, Luebeck J, Birmingham A, Bojorquez-Gomez A, et al. Combinatorial CRISPR-Cas9 screens for de novo mapping of genetic interactions. Nat Methods. 2017;14(6):573–6. pmid:28319113
  5. 5. Du D, Roguev A, Gordon DE, Chen M, Chen S-H, Shales M, et al. Genetic interaction mapping in mammalian cells using CRISPR interference. Nat Methods. 2017;14(6):577–80. pmid:28481362
  6. 6. Esmaeili Anvar N, Lin C, Ma X, Wilson LL, Steger R, Sangree AK, et al. Efficient gene knockout and genetic interaction screening using the in4mer CRISPR/Cas12a multiplex knockout platform. Nat Commun. 2024;15(1):3577. pmid:38678031
  7. 7. Hsiung CC-S, Wilson CM, Sambold NA, Dai R, Chen Q, Teyssier N, et al. Engineered CRISPR-Cas12a for higher-order combinatorial chromatin perturbations. Nat Biotechnol. 2025;43(3):369–83. pmid:38760567
  8. 8. Mani R, St Onge RP, Hartman JL 4th, Giaever G, Roth FP. Defining genetic interaction. Proc Natl Acad Sci U S A. 2008;105(9):3461–6. pmid:18305163
  9. 9. Horlbeck MA, Xu A, Wang M, Bennett NK, Park CY, Bogdanoff D, et al. Mapping the Genetic Landscape of Human Cells. Cell. 2018;174(4):953-967.e22. pmid:30033366
  10. 10. Stombaugh J, Licon A, Strezoska Ž, Stahl J, Anderson SB, Banos M, et al. The Power Decoder Simulator for the Evaluation of Pooled shRNA Screen Performance. J Biomol Screen. 2015;20(8):965–75. pmid:25777298
  11. 11. Nagy T, Kampmann M. CRISPulator: a discrete simulation tool for pooled genetic screens. BMC Bioinformatics. 2017;18(1):347. pmid:28732459
  12. 12. de Boer CG, Ray JP, Hacohen N, Regev A. MAUDE: inferring expression changes in sorting-based CRISPR screens. Genome Biol. 2020;21(1):134. pmid:32493396
  13. 13. Shen JP, Munson B, Fong S, Birmingham A, Sasik R, Kreisberg JF. CTG: Compositional and time-course aware genetic analysis. 1.
  14. 14. Zamanighomi M, Jain SS, Ito T, Pal D, Daley TP, Sellers WR. GEMINI: a variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screens. Genome Biol. 2019;20(1):137. pmid:31300006
  15. 15. Moon SB, Kim DY, Ko J-H, Kim JS, Kim YS. Improving CRISPR genome editing by engineering guide RNAs. Trends in Biotechnology. 2019;37:870–81.
  16. 16. Zhao J, Tang Z, Selvaraju M, Johnson KA, Douglas JT, Gao PF, et al. Cellular Target Deconvolution of Small Molecules Using a Selection-Based Genetic Screening Platform. ACS Cent Sci. 2022;8(10):1424–34. pmid:36313155
  17. 17. Roohani Y, Huang K, Leskovec J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat Biotechnol. 2024;42(6):927–35. pmid:37592036
  18. 18. Collins SR, Roguev A, Krogan NJ. Quantitative Genetic Interaction Mapping Using the E-MAP Approach. Methods in Enzymology. Elsevier; 2010. pp. 205–31.
  19. 19. Dede M, McLaughlin M, Kim E, Hart T. Multiplex enCas12a screens detect functional buffering among paralogs otherwise masked in monogenic Cas9 knockout screens. Genome Biol. 2020;21(1):262. pmid:33059726
  20. 20. Hart T, Moffat J. BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics. 2016;17:164. pmid:27083490
  21. 21. Kim E, Hart T. Improved analysis of CRISPR fitness screens and reduced off-target effects with the BAGEL2 gene essentiality classifier. Genome Med. 2021;13(1):2. pmid:33407829
  22. 22. Najm FJ, Strand C, Donovan KF, Hegde M, Sanson KR, Vaimberg EW, et al. Orthologous CRISPR-Cas9 enzymes for combinatorial genetic screens. Nat Biotechnol. 2018;36(2):179–89. pmid:29251726
  23. 23. Fong SH, Kuenzi BM, Mattson NM, Lee J, Sanchez K, Bojorquez-Gomez A, et al. A multilineage screen identifies actionable synthetic lethal interactions in human cancers. Nat Genet. 2025;57(1):154–64. pmid:39558023
  24. 24. Zhao D, Badur MG, Luebeck J, Magaña JH, Birmingham A, Sasik R. Combinatorial CRISPR-Cas9 metabolic screens reveal critical redox control points dependent on the KEAP1-NRF2 regulatory axis. Molecular Cell. 2018;69:699-708.e7.