AlleleProfileR: A versatile tool to identify and profile sequence variants in edited genomes

Gene editing strategies, such as zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs), and clustered regularly interspaced short palindromic repeat/Cas9 (CRISPR/Cas9), are revolutionizing biology. However, quantitative and sensitive detection of targeted mutations are required to evaluate and quantify the genome editing outcomes. Here we present AlleleProfileR, a new analysis tool, written in a combination of R and C++, with the ability to batch process the sequence analysis of large and complex genome editing experiments, including the recently developed base editing technologies.


Example data
To illustrate the use and potential application of AlleleProfileR, we will analyse a few example datasets. These synthetic samples were generated by inserting mutations into the genomic sequences of four genes relevant to cardiomyocytes (NKX2.5, TBX5, PLN, and CAMK2D, Table A) and simulating reads using wgsim. AlleleProfileR provides a wrapper function linking to wgsim to simulate paired-end reads: AllelePro-fileR.simulate(). This function simulates paired-end reads from fasta files. If an array of filenames is given as filenames parameter, then the reads originating from all files in the array will be pooled.

Configuration File structure
AlleleProfileR requires a particular file structure for operation. In the working directory, a folder 'files' with 'config', 'index', 'input', 'output' as subfolders is required. The config folder must contain a csv file with details on the genes of interest (Table B). Reads are selected from the .bam-file for analysis if they span the region of interest as defined by the 'Start' and 'Stop' locations in the genes table. To assess whether the coding frame is altered, information on the theoretic start and stop codons of the exon need to be reported. 'StartType' or 'StopType' equals to N for normal codons: ATG, and TAG, TGA, and TAA respectively. 'PCRStart' and 'PCRStop' are the locations of the PCR primers used for amplification. 'CutSites' denotes the predicted cut positions of the editing strategy. If there are multiple cut sites, they should be separated by ';'.
The index folder must contain the reference genome in fasta format. It is not necessary to have the entire reference genome available. It is sufficient to provide the relevant chromosomes or the relevant regions within those chromosomes. The input folder must contain subfolders per sample comprising the fastq files or the bam files. The AlleleProfileR.read.folders() command scans the files/input folder for subfolders and .fastq.gz (type = "fastq") or .bam (type = "bam") files. The output folder will contain all the generated output, such as data tables and log files, but preprocessed fastq files and generated bam files will be saved into the sample input folders.

Configuration object
The analysis functions obtain their settings from a configuration list object, set using the AlleleProfileR.setup() command. By default, the sequence analysis scope of AlleProfileR is limited to the Start and Stop positions as defined in the gene configuration. The scope can be further limited by setting the cut.range value. This will restrict AlleleProfileR's analysis to this range around the cut site(s). Morover, a cutoff value (percent value, 0 to 1) can also be set for calling variants, such that infrequent variants or sequencing noise will be ignored. Variants occuring only once can be ignored by setting the ignore.single parameter to TRUE, SNPs will be ignored by setting ignore.snp to TRUE, and chimeric reads will be ignored by setting ignore.chimeric to TRUE. Finally, a list of alternate objects can be supplied to the alternate parameter to determine proportion homology-directed repair (HDR) in gene editing experiments.

Basic analysis example
The basic analysis operation comprises following steps: (1) pre-process the reads, (2) align reads to reference, (3) determine allelic variants, and (4) plot the results and generate summary statistics.
Step (1) and (2) can be completed using external software, or by using AlleleProfileR.

Determine allelic variants
The AlleleProfileR.batch() function executes the variant determination algorithm on all data as set by the configuration or its parameters. The computation speed can be enhanced by increasing the cores value, which will split the task across multiple CPUs. The subset parameter allows the user to alter the queued files for analysis: all samples will be analysed if subset is set to NULL. Alternatively, to analyse a subset of samples for a selection of genes, subset should be set to a list containing a vector with the indices of the samples as first element, and a vector containing the indices of the genes of interest as second element.

Summaries and plots
The variant calling algorithm writes all output to .csv-files which can be imported into R or any other software for further processing or plotting. AlleleProfileR contains several tools to visualize the results, as discussed in the sections below.
The AlleleProfileR.batch.summary() command plots an overview of the percentage WT sequence per embryo and per gene (default: param = "wt"), and lists the number of alleles detected ( Figure A). Allele distributions, characterisation of the alleles and alignments for individual samples and genes can be plotted using AlleleProfileR.plot() command ( Figure B). Allele characterization is reported using a boolean tile plot: blue indicates TRUE, white indicates FALSE, and gray indicates NA (Table C). Alignments can be plotted using AlleleProfileR.plot.alignment(), whereas AlleleProfileR.sample.distribution() and AllelePro-fileR.sample.distribution.boolean() will plot only the distributions or characterization, respectively. Finally, the read depth for a certain characterization can be plotted using AlleleProfileR.plot.readdepth() ( Figure C).
# general overview AlleleProfileR.batch.recompute(crispr_config, cutoff = 0.005, ignore.single = T, top = 0) AlleleProfileR.batch.summary(crispr_config, table=F, plot=T, subset = list(c(3:10),c(1:4))) # plot alignments AlleleProfileR.plot(crispr_config, 3, 1)    Sequence alignment and characterization plot generated by AlleleProfileR.plot(). This sample has an in-frame deletion of 9 bps downstream of the start codon. Hence, expression of this variant allele will result in a truncated protein. As a result, AlleleProfileR determined that the allele is coding (CODING true) and has a small indel (SM true). STOP was set to NA as the stop codon falls outside of the region of interest and hence no determination was made.

Notes and applications
Gene editing technologies, such as clustered regularly interspaced short palindromic repeat/Cas9 (CRISPR/Cas9), offer several applications such as introducing random indels by non-homologous end-joining (NHEJ), or altering the genomic code using homology-directed repair (HDR) with a template. In addition, the genomic sequence can also be editing in a cut-independent fashion by base-editing enzymes 7 .

NHEJ and chimeric reads
Several other tools have been developed to analyze NHEJ data, including CrispRVariants 8 . Here we compared the performance of CrisprVariants and AllelProfiler by analyzing the Synthetic data set 1 supplied by the CrispRVariants package. AlleleProfileR detects the same on-target alleles as CrispRVariants ( Figure F), but additionally detects a chimeric variant which was excluded in CrispRVariants ( Figure G).

Gene correction and alternative references
To quantify the degree of gene correction by HDR, the HDR template or alternative reference needs to be provided: # generate configuration information for an alternate reference alternateinfo <-AlleleProfileR.alternatereference(crispr_config, alternate = "files/index/alternate.fastq") By adding the parameter 'alternate' to the plotting functions, the HDR allele(s) will be marked ( Figure H).

Base editing
Base editing technologies, where a targeted DNA base is irreversibly converted into another without DNA cleavage or HDR template, have recently been reported. To illustrate AlleleProfileR's capabilities, we reanalyzed some of the data from the Gaudelli et al. study 9 describing adenine base editors (ABEs) that mediate the conversion of A•T to G•C in genomic DNA. Following code reproduces the display figure in the AlleleProfileR paper. Briefly, we will retrieve the NGS data for site 6 deposited by the authors in the NCBI Sequence Read Archive 10 database (Table D), preprocess the sequencing reads, reanalyze the reads using AlleleProfileR, generate summary statistics and plot the results ( Figure M).