Genome-wide identification and analysis of highly specific CRISPR/Cas9 editing sites in pepper (Capsicum annuum L.)

The CRISPR/Cas9 system is an efficient genome editing tool that possesses the outstanding advantages of simplicity and high efficiency. Genome-wide identification and specificity analysis of editing sites is an effective approach for mitigating the risk of off-target effects of CRISPR/Cas9 and has been applied in several plant species but has not yet been reported in pepper. In present study, we first identified genome-wide CRISPR/Cas9 editing sites based on the ‘Zunla-1’ reference genome and then evaluated the specificity of CRISPR/Cas9 editing sites through whole-genome alignment. Results showed that a total of 603,202,314 CRISPR/Cas9 editing sites, including 229,909,837 (~38.11%) NGG-PAM sites and 373,292,477 (~61.89%) NAG-PAM sites, were detectable in the pepper genome, and the systematic characterization of their composition and distribution was performed. Furthermore, 29,623,855 highly specific NGG-PAM sites were identified through whole-genome alignment analysis. There were 26,699,38 (~90.13%) highly specific NGG-PAM sites located in intergenic regions, which was 9.13 times of the number in genic regions, but the average density in genic regions was higher than that in intergenic regions. More importantly, 34,251 (~96.93%) out of 35,336 annotated genes exhibited at least one highly specific NGG-PAM site in their exons, and 90.50% of the annotated genes exhibited at least 4 highly specific NGG- PAM sites, indicating that the set of highly specific CRISPR/Cas9 editing sites identified in this study was widely applicable and conducive to the minimization of the off-target effects of CRISPR/Cas9 in pepper.


Introduction
In mutants, which are of great significance for both gene function analysis and crop genetic improvement, allelic variation mainly results from naturally or artificially induced mutation. Compared to natural variation, the most prominent advantage of artificially induced mutation is the high mutation frequency achieved. The main methods currently used for achieving artificially induced mutation include physical mutagenesis, chemical mutagenesis, random transposon insertion, and target gene editing technologies. Among these approaches, target gene editing, in which nucleotide variation is introduced at an appointed site and the target mutations are obtained accurately and efficiently, thereby speeding up the functional identification of target genes and genetic breeding improvement, is an ideal method for artificially inducing mutations [1]. A variety of target gene editing techniques, including the use of zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs) and the CRISPR/Cas system, have been developed to date [2]. The CRISPR/Cas system, which has entered the mainstream in recent years and been widely used in humans [3], animals [4], microorganisms [5] and plants [6], possesses the outstanding advantages of high simplicity and efficiency in contrast to the other two techniques. According to the number and functional characteristics of the Cas gene, CRISPR/Cas systems can be divided into 2 categories, including 6 different types (I to VI) [7][8][9]. The first category of CRISPR/Cas systems, including types I, III and IV, requires multiple Cas proteins to collaboratively interfere with the target gene, while the second category requires only one Cas protein. The type II CRISPR/Cas system, namely CRISPR/Cas9 system belongs to the second category and is now the most widely used gene editing system.
The CRISPR/Cas9 gene editing system is mainly composed of one Cas9 protein and one small guide RNA (sgRNA). The Cas9 protein from Streptococcus pyogenes (SpCas9) was first applied for use in the CRISPR/Cas9 system [10]; SpCas9 recognizes the protospacer adjacent motif (PAM) sequence 5 0 -NGG-3 0 (where "N" can be any nucleotide base) in the target DNA, then cleaves the target DNA at 3 nt upstream of the PAM site, generating a blunt end, and gene editing is finally achieved by nucleotide insertion, deletion and substitution at the cleavage site mediated by the receptor cellular DNA repair machinery, including the nonhomologous end joining (NHEJ) and homologous recombination repair (HDR) mechanisms [11]. The sgRNA of the CRISPR/Cas9 system, artificially designed based on crRNA (CRISPR RNA) and the core sequence of trans-acting crRNA (tracrRNA), is a short single-stranded RNA that guides the Cas9/sgRNA complex to perform cleavage at 3 nt upstream of the PAM site through complementary base pairing between the 5' end (~20 bp) of the sgRNA and the protospacer sequence of the target DNA, which determines the specificity of gene editing [12].
Previous studies have found that even if the sgRNA imperfectly matches the protospacer, the Cas9 protein can still perform cleavage at 3 nt upstream of the PAM site, making gene editing possible in nontarget regions; thus, off-target effects can occur [13][14][15]. To reduce or eliminate the risk of off-target effect, the identification of candidate editing sites with high specificity is a prerequisite for the application of the CRISPR/Cas9 system. To date, a variety of tools based on whole-genome sequence similarity analysis have been developed for target site design and off-target risk assessment, such as CrisprGE [16], Cas-OFFinder [17], Cas-Designer [18], CRISPRdirect [19] and CRISPOR [20]. However, the majority of those tools have been mainly applied in humans and animals. Based on whole-genome reference sequences, the distribution and specificity of genome-wide CRISPR/Cas9 editing sites in Arabidopsis thaliana, Medicago truncatula, soybean (Glycine max), tomato (Solanum lycopersicum), Brachypodium distachyon, rice (Oryza sativa), Sorghum bicolor, maize (Zea mays) and grape (Vitis vinifera) have been systematically analysed and compared [12,21], providing an important reference for choosing highly specific editing sites of related species.
Pepper (Capsicum spp.) belongs to the family Solanaceae and has a cosmopolitan distribution and considerable economic importance [22]. The reference genome sequences of pepper were first released in 2014 [23,24], marking the transition of pepper research from structural genomics to functional genomics. The identification and functional analysis of important genes controlling agronomic traits have become a significant direction in molecular genetics research in pepper. With the development and continuous improvement of technologies for pepper regeneration in vitro and its genetic transformation [25,26], the CRISPR/Cas9 gene editing system will become a powerful tool and will be widely used for the functional analysis of pepper genes. In this study, we first identified CRISPR/Cas9 editing sites at the genomewide level in pepper and then evaluated the obtained specificity through whole-genome sequence alignment. The purpose of this study was to provide a reference for the selection of highly specific CRISPR/Cas9 editing sites and facilitate the application of CRISPR/Cas9-mediated gene editing in pepper.

Genomic data and CRISPR/Cas9 editing site identification
The 'Zunla-1' (v2.0) pepper reference genome sequence and related genome annotations [23] were used for CRISPR/Cas9 editing site identification. There were two PAM sites recognized by the CRISPR/Cas9 system: 5'-NGG-3' and 5'-NAG-3', which were identified by using EMBOSS software [27] in both the positive and reverse strands of the Zunla-1 reference genome sequence. The 20-nt sequences before all 5'-NGG-3' and 5'-NAG-3' sites were extracted to form two protospacer sets, referred to as the GG_spacer set and AG-spacer set, respectively.

Identification of highly specific CRISPR/Cas9 editing sites
Based on the method reported previously, the specificity of CRISPR/Cas9 editing sites in pepper was evaluated. Class 0.0 and Class 1.0 spacers were expected to provide high specificity in CRISPR/Cas9 gene editing [12] and were thus classified as highly specific sites in this study. Since the sgRNA/Cas9 complex showed much less affinity and tolerance toward mismatches at the NAG-PAM site [5], in this study, we only assessed the specificity of the GG_spacers, for which the possibility of off-target effects was evaluated by using the AG_spacer set. The method is outlined as follows: 1. The hard-masking function of USEARCH [28] was used to mask and remove GG_spacers containing low-complexity sequences; 2. GG_spacers with the same sequences at the 6~20-nt region were removed; 3. GASSST [29] and UBLAST [28] were used to generate a pairwise alignment for the remaining GG_spacers. According to the GG_spacer position and the minimum number of mismatches (minMM_GG, including InDel and SNP) between each GG_spacer and other GG_spacers, the GG_spacers were graded into three classes: Class 0 spacers shared no significant matching sequence with other GG_spacers; Class 1 spacers showed no fewer than four mismatches (minMM_GG�4) or three mismatches adjacent to PAM sites; Class 2 included the other GG_spacers; 4. For Class 0 and Class 1 GG_spacers, pairwise alignments were performed with AG_spacers, which were further graded into four classes as follows according to their position and the minimum number of mismatches (minMM_GG, including InDel and SNP) between each GG_spacer and other AG-spacers: Class 0.0 spacers exhibited no fewer than three mismatches with AG_spacers (minMM_AG�3) or shared no significant matching sequence with AG_spacers; Class 0.1 spacers exhibited fewer than three mismatches with AG_spacers; Class 1.0 spacers exhibited no fewer than three mismatches with AG_spacers (min-MM_AG�3) or shared no significant matching sequence with AG_spacers; Class 1.1 spacers exhibited fewer than three mismatches with AG_spacers.

PCR verification and sequence analysis
Primer pairs flanking the selected target sites were designed by using the Primer3web ( ; and a final extension at 72˚C for 10 min. PCR amplication of each sites were repeated three times and then the products were directly sequenced and assembled. Alignment of each sequence to the reference genome was conducted by using the local blastn:2.9.0+.

PLOS ONE
Identification and analysis of highly specific CRISPR/Cas9 editing sites in pepper but the density of NGG-PAM in pepper was much less than that in monocot species such as rice (101.69/Kb) and maize (119.22/Kb) [12]. With respect to the composition of the PAM sites, the TGG and CGG types accounted for the highest (~38.88%) and lowest proportions (~7.44%) of total NGG-PAM sites, respectively (Fig 1A), similar to the composition pattern found in the grape genome [21]. For NAG-PAM sites, the AAG type was the most abundant, with a proportion of~36.07%, followed by TAG, GAG and CAG, accounting for 29.55%, 19.54% and 14.84% of the total NAG-PAM sites, respectively ( Fig 1B).

PLOS ONE
Identification and analysis of highly specific CRISPR/Cas9 editing sites in pepper and P8 exhibited the most and least CRISPR/Cas9 editing sites, respectively ( Table 1). The number of NGG-PAM and NAG-PAM sites on the pepper chromosomes was significantly positively correlated (R 2 = 0.997) with chromosome length (Fig 3).   (~1.49%) NGG-PAM sites were located in intron and exon regions, respectively, and the rest (~0.32%) were located in UTRs and splicing regions ( Table 2). Regarding the distribution pattern in different genomic regions, the pattern of NAG-PAM sites was similar to that of NGG-PAM sites ( Table 2). The density of CRISPR/Cas9 editing sites in genic regions (including UTRs, exons, introns and splicing sites,) was lower than that in intergenic regions for NGG+NAG-PAM (159.03/Kb versus 180.68/Kb, Fig 4A), NGG-PAM (60.55/Kb versus 68.87/ Kb, Fig 4B) and NAG-PAM (98.49/Kb versus 111.81/Kb, Fig 4C), which differs from the situation in grape [21].

Content of highly specific NGG-PAM sites in pepper genome
Through filtering and alignment analysis, 30,402,397 (~13.22%) NGG-PAM sites were successfully graded based on their specificity ( Table 3). The total number of highly specific NGG-PAM sites in pepper, including those belonging to Class 0.0 and Class 1.0, was 29,623,855, which was 4.50 times higher than that in tomato, accounting for~12.88% of the total NGG-PAM sites (Table 3), which was in line with the general rule that the number of specific gRNA spacers is positively correlated with genome size in eudicot species [12]. On average, there were 8.81/Kb highly specific sites in the pepper genome, which is comparable to that in the tomato genome (8.42/Kb, Table 3). PCR amplification of 19 highly-specific target sites. M, DL2000 plus, 1 to 10 represent A1 to A10 belonging to class0.0; 11 to 19 represent B1 to B9 belonging to class 1.0 (S1 Table).
https://doi.org/10.1371/journal.pone.0244515.g005 To validate the specificity of target sites belonging to the class 0.0 and class 1.0, a random set of 19 sites were chosen to be amplified by PCR, and then the PCR products were directly sequenced and assembled. After aligning them back to the Zunla-1 reference genome, all of the products were matched to one unique location in the genome (Fig 5, S1 Table and S1 Fig), indicating that the target sites of class 0.0 and class 1.0 had low risk of off-target.

Characterization of highly specific NGG-PAM sites' distribution in pepper genome
The highly specific NGG-PAM sites were evenly distributed on all 12 chromosomes (P1~P12) of pepper (Fig 2). With the exception of P0, chromosomes P1 and P2 contained the maximum and minimum number of highly specific NGG-PAM sites, respectively ( Table 3). The number of highly specific NGG-PAM sites in different genomic regions is shown in Table 4. Similar to the distribution of all NGG-PAM sites, there were a total of 26,699,387 (~90.13%) highly specific NGG-PAM sites located in intergenic regions, which was 9.13 times greater than the number in genic regions ( Fig 4D). However, the average density of highly specific NGG-PAM sites in genic regions was higher than that in intergenic regions on the whole (13.80/Kb versus 8.47/Kb, Fig 4D) for Class 0.0 (0.015/Kb versus 0.008/Kb, Fig 4E) and Class 1.0 (13.79/Kb versus 8.46/Kb, Fig 4F). The same phenomenon occurs in the grape genome [21].
We calculated the percentage of annotated genes that contained highly specific NGG-PAM sites identified in this study and found that 34,251 (~96.93%) out of 35,336 annotated genes exhibited at least one highly specific NGG-PAM site in their exons, and 90.50% of annotated genes exhibited at least 4 highly specific NGG-PAM sites (Fig 6 and S2 Table), indicating that the set of highly specific CRISPR/Cas9 editing sites identified in this study was widely applicable and will contribute to the minimization of off-target effects of CRISPR/Cas9 in pepper.
Supporting information S1 Table. Blast results of a random set of 19 highly-specific editing sites.