CRISPR-Analytics (CRISPR-A): A platform for precise analytics and simulations for gene editing

doi:10.1371/journal.pcbi.1011137

Fig 1.

CRISPR-A capabilities for analysis and simulation of CRISPR-based experiments.

A) Diagram of CRISPR-A analysis pipeline. The analysis algorithm is composed of three mandatory steps: reads pre-processing for quality assessment, reads alignment against reference amplicon, and edit calling. The processes inside dashed line squares are optional, being: UMI clustering, reference discovery, size bias correction, and noise subtraction based on an empirical model from negative control samples. B) Diagram of CRISPR-A simulation pipeline. Simulation is based on the fitting of multiple parameters that describe the distribution of edits and their characteristics and proportions. Once the number of edited sequences is determined by the protospacer predicted efficiency, other probability distributions are applied to decide the number of each kind of edit (NHEJ deletions, MMEJ deletions, insertions, and substitutions). C) Heatmap to visualize hierarchical clustering of real samples and their simulations from the validation data set. The clustering distance used is the JS divergence between the two subsets. Data values are transformed to the color scale depicted on the right. Dark blue equals 0 distance and identical samples, while red is for the greatest distance value. The plot on the left shows the distances between the position of the variant and the cut site, while the plot on the right is for the size of the variants. For these plots, we have randomly selected 5 pairs of real samples from the T cells validation data set (k = 100): SRR7737126, SRR7737722, SRR7737698, SRR7736744, and SRR7736723. We have labeled the samples with the two last numbers of the sample name adding real for the real sample and sim for its simulation. For instance, SRR7737722 and SRR7737698, which cluster together, are the real sample and its simulated sample for two replicates.

More »

Expand

Fig 2.

Indels characterization algorithm development and benchmark with simulated data and real data.

A) Accuracy of indels detection after aligning with different alignment tools. We have used 139 samples simulated from different reference sequences and cut site location with SimGE. The accuracy, which is the number of reads correctly classified against all classified reads, has been calculated after characterizing the indels and wt sequences of each sample after aligning the samples with 6 different methods. The mean and standard deviation of the accuracy of the different characterized samples by alignment method are represented in the plot. B) Optimization of alignment penalty matrix. Best parameters have been determined through Monte Carlo to obtain alignments optimal for CRISPR-based indels characterization. In the PCA we can see how the four parameters of the alignment penalty matrix should be combined to achieve higher accuracy. C) Benchmarking of indels characterization between 6 different tools. CRISPR-A is compared with 5 other tools using a simulated data set of 139 samples. All samples contain the same percentage of indels (red dashed line) and the violin plot shows us the dispersion of the reported editing percentages by each tool. D) Reported editing of edited t-cells. 1656 unique edited genomic locations within 559 genes are characterized with 6 different tools. The percentage of editing reported by each tool for each sample is shown by the heatmap. E) Error characterization from the most discrepant values in t-cell edited samples. CRISPR-A results are compared with the results of other tools with more distant results (example at left side; explored samples are encircled in red). Errors are classified regarding their source.

More »

Expand

Fig 3.

Template-based edits and substitutions characterization benchmark with simulated data and real data.

A) Benchmarking of template-based editing characterization between 3 different tools. CRISPR-A is compared with 2 other tools using a simulated data set of 136 samples. All samples contain the same percentage of template-based reads (red dashed line). We can observe the mean template-based percentage among all characterized reads in each sample and the standard deviation reported by each tool: CRISPR-A, CRISPR-GA, and CRISPResso2. Differences between the three samples are significant (p-value <0.0001) performing an ANOVA and also in all cases when pairs of tools are compared (adjusted p-value <0.05) with TukeyHSD. B) Characterization of different modifications obtained by HDR. CRISPR-A and CRISPResso2 are used to characterize the result of 27 experiments with different modifications led by HDR: substitutions of one nucleotide by another, substitutions of 4 nucleotides for other 4 nucleotides, and insertions of 3 nucleotides. The modifications are found in 3 different targets. C) Characterization of substitutions done by BE. In this case, instead of looking for the overall percentage of editing we looked for the information of substitutions by position given by two different tools: CRISPR-A and CRISPResso2. There are 4 different targets and 3 replicates for each target. D) Characterization of PE substitutions. We have compared a total of 29 samples of PE editing of FANCF gene with templates with different PBS lengths. With CRISPResso2 analysis, the examples with a PBS shorter than 13 base pairs end up with error, and results are not obtained. E) Extended analysis of PE substitutions. 23 different samples edited in 5 different targets (HEK3, EMX1, FANCF, RNF2, and HEK4) are analyzed with CRISPR-A and CRISPResso2. In the plot on the left side, we can see the comparison of the reported PE percentage by each of the two tools. HEK3 edited samples have a higher PE reported by CRISPR-A than with CRISPResso2 (in orange). In the right part of the figure, we found the count of reads classified in each class. CRISPResso2 reports a high number of PE with modifications or reference with modifications due to a SNP (rs1572905; G >A).

More »

Expand

Fig 4.

HCT116, HEK-293, and K562 edited in 96 different targets analyzed by CRISPR-A.

A) MMEJ deletions ratio among all deletions. We have calculated the euclidean distance of the ratio of MMEJ deletions among all characterized deletions. The heatmap represents the hierarchical clustering of the mean between the euclidean distances calculated for each of the 96 different targets. The three replicates of each cell line are clustered together showing that the MMEJ ratio is a feature that depends on the cell line. B) Differential expression analysis of two genes associated with MMEJ in the two cell lines with higher differences in MMEJ ratio against NHEJ indel patterns in gene editing outcomes. Differences are significant (p-value < 0.05). C) Insertions against total edits. The proportion of insertions among all reported edits in three different cell lines. D) Percentage of editing. Indels among all characterized reads in 96 different targets with 3 replicates by each target in three different cell lines. E) Insertions considering the nucleotide upstream of the cut site. Percentages of inserted nucleotides in the function of the free nucleotide in three different cell lines. F) Variant diversity by samples. On the left is the percentage of the most abundant variant among all kinds of edits. In the right the distributions of variants from all the variants that are above 50% (dashed line in left plot), these are the variants with higher abundance.

More »

Expand

Fig 5.

Enhanced precision with spikes, UMIs, and mock characterization.

A) Spikes count in the Illumina experiment. Count of spike-in synthetic sequences with different deletion sizes. From each spike, the same number of molecules were added to the edited samples. On the left, linear regression of spike-in sequences mean percentages among all spike-in sequences at 30 cycles of amplification and a low number of molecules. At the right, count of spikes in the original sample and after correction by spikes. B) Size bias correction using the spike-in model. On the left, edited sample deletions distribution by position corrected by spikes (blue) against the original distribution (gray). At the bottom right, count difference between original and corrected in function for deletion size of sample deletions distribution shown at left. C) Noise reduction by UMIs cluster filtering. Standardized distribution of deletions (left) and insertions (right) without taking into account UMIs (gray) and after clustering by UMIs with a minimal identity of 0.95 and filtering by UMI bin size (UBS) >50 and <130 (blue) in Lama2 target. The Red dashed line corresponds to the cut site position. D) Mock-based noise correction. Samples with less percentage of editing tend to have a higher correction since the noise represents a higher proportion of the indel reads. The two plots comparing treated and mock files show that this subtraction is always specific, regardless of the editing percentage. E) Difference between the editing percentage reported by 8 tools and the 4 options of CRISPR-A and the manually curated percentage of 30 samples (10 different targets and 3 replicates for each).

More »

Expand