A statistical simulation model to guide the choices of analytical methods in arrayed CRISPR screen experiments

doi:10.1371/journal.pone.0307445

Fig 1.

Overall simulation workflow.

Experiment: (A) Plate based high-throughput arrayed CRISPR screen in which each genetic perturbation is applied in each well. (B) High-content imaging in which the endpoint readouts are a combination of multiple phenotypic and molecular biomarkers relevant for the mechanism of actions under study at the single-cell level. (C) Downstream data analysis for endpoint readouts including spatial normalization and hit callings. Simulation: the process consists of two main steps. (D) Single cell-level simulation incorporating genetic knockout effects, biological & technical variations, and spatial bias. (E) Well-level summarization of simulated endpoint values of cells in the well (e.g., mean, median, summation, etc.). (F) The simulated endpoint values is the same format of arrayed CRISPR screen pipeline which can be used as input to the downstream data analysis pipeline.

More »

Expand

Fig 2.

Benchmark workflow.

The simulation model requires the following inputs: plate layout, the number of true hits, the average number of cells per well, spatial bias, basal expression level, average gene editing effect, between gene variation, between cell variation and measurement error. The simulated data set will be used as input for various computational workflows. A computation workflow includes a normalization step and a hit calling step. The example 1 shows how the simulation model helps to choose the spatial normalization method. The example 2 shows how simulation model helps to choose the data analysis workflow.

More »

Expand

Fig 3.

Comparison of simulated and real arrayed CRISPR screening data.

(A) the distribution of real data. The red dashed line corresponds to the cutoff for hit calling. (B) x-axis represents the values of σ_tech and y-axis corresponds to the absolute difference of the mean values between real data and simulated data with given σ_tech. The red dash line represents the optimum value of σ_tech (C) x-axis represents the values of σ_α and y-axis corresponds to the absolute difference between the kurtosis values of real data and simulated data with given σ_α. The red dash line represents the optimum value of σ_α. (D) the distributions of simulated (grey) and real data (red) sets. For simulation, the estimated input parameters from figures (A-C) were used to generate 20 simulated data sets. (E) histogram of the mean values from 100 simulated data sets, with red line indicating the mean value of the real data. (F) histogram of the kurtosis values from 100 simulated data sets, with the red line indicating the kurtosis value of the real data.

More »

Expand

Fig 4.

Assessment of spatial normalization methods.

(A) shows the visualization of simulated datasets using heatmaps (1st column) and jitter plots (2nd column). The first row represents the simulated dataset without spatial bias. The second row represents the simulated datasets with spatial bias. Rows three and four represent the same bias-affected data set, after correction with LOESS and B-Score, respectively. (B) shows the pre-defined spatial bias added for the simulation where 3 columns were manually selected. (C) shows the distribution of SSMD values between simulated values of hits and no-hits. Each dot corresponds to a single simulation, and the simulation was repeated 100 time.

More »

Expand

Fig 5.

Assessment of hit calling pipelines.

The figure (A) shows a single batch of simulated arrayed CRISPR screening data with 4 plates. Plate 3 and 4 were chosen to be affected by plate effects and spatial bias. The plate effect was generated by increasing the overall signals with pre-defined amount. The spatial bias was generated with the same approach as in Fig 4. Each dot represents the endpoint value of each well, and blue dots correspond to the one with spatial bias. The figures (B-D) show the distribution of True Positive Rates (TPR), Positive Predictive Value (PPV) and F1 score for 6 workflows. Each dot corresponds to TPR, PPV or F1 value for a simulated screen dataset of 4 plates, and the simulation was repeated 100 times.

More »

Expand

Table 1.

Performance of 6 example workflows.

More »

Expand