In silico analysis enabling informed design for genome editing in medicinal cannabis; gene families and variant characterisation

Background Cannabis has been used worldwide for centuries for industrial, recreational and medicinal use, however, to date no successful attempts at editing genes involved in cannabinoid biosynthesis have been reported. This study proposes and develops an in silico best practices approach for the design and implementation of genome editing technologies in cannabis to target all genes involved in cannabinoid biosynthesis. Results A large dataset of reference genomes was accessed and mined to determine copy number variation and associated SNP variants for optimum target edit sites for genotype independent editing. Copy number variance and highly polymorphic gene sequences exist in the genome making genome editing using CRISPR, Zinc Fingers and TALENs technically difficult. Evaluation of allele or additional gene copies was determined through nucleotide and amino acid alignments with comparative sequence analysis performed. From determined gene copy number and presence of SNPs, multiple online CRISPR design tools were used to design sgRNA targeting every gene, accompanying allele and homologs throughout all involved pathways to create knockouts for further investigation. Universal sgRNA were designed for highly homologous sequences using MultiTargeter and visualised using Sequencher, creating unique sgRNA avoiding SNP and shared nucleotide locations targeting optimal edit sites. Conclusions Using this framework, the approach has wider applications to all plant species regardless of ploidy number or highly homologous gene sequences. Significance statement Using this framework, a best-practice approach to genome editing is possible in all plant species, including cannabis, delivering a comprehensive in silico evaluation of the cannabinoid pathway diversity from a large set of whole genome sequences. Identification of SNP variants across all genes could improve genome editing potentially leading to novel applications across multiple disciplines, including agriculture and medicine.

Introduction with high THC and low CBD have B T /B T and B D /B D genotypes respectively and contain high levels of CBD with little to no THC, and B T /B D genotypes similar concentrations of THC and CBD. More recently, Grassa et. al (2018) completed the chromosome genome sequence assembly of cannabis finding that cannabinoid biosynthesis genes are not located at a single locus but are pericentromeric, nested in repeats leading to low levels of recombination.
Biosynthesis of cannabinoids is complex with numerous enzymatic steps and interactions. Fatty acids and isoprenoid precursors are synthesised via the hexanoate, methylerythritol 4-phosphate (MEP) and geranyl diphosphate (GPP) pathways. Hexanoyl-CoA is produced via the hexanoate pathway, acting as the substrate for olivetolic acid synthase (OLS) yielding OLA [21]. Prenyl sidechains are synthesised via the MEP pathway for the substrate for geranyl diphosphate synthesis. GPP and OLA are added by an aromatic prenyltransferase (PT) creating CBGA [22]. Finally, catalysation of THC and CBD oxidocyclases produce THCA and CBDA [23,24] (Fig 1). Identification of all genes encoding biosynthetic enzymes now allows biotechnological approaches to control cannabinoid content by allowing genomically informed decisions on molecular breeding with tools such as genome editing. The development of genome editing technologies, such as Clustered Regularly Interspaced Palindromic Repeats (CRISPR/Cas9), Zinc Finger Nucleases (ZFNs) and Transcription activator-like effector nucleases (TALENs) utilise sequence specific nucleases to induce a double strand break (DSB) at a specific genomic location through homologous binding of guide proteins [25]. Plants' predominant repair pathway mechanism is through non-homologous end joining (NHEJ), and less often through homologous recombination (HR) [26]. NHEJ repairs the cut DNA without a homologous DNA template, however NHEJ can be error-prone, causing mutations such as base pair deletions, insertions or rearrangements [26,27]. HR requires the provision of a DNA template, with homologous flanking regions used as a guide, to repair the break either correctly or by incorporating alterations that are desired into the DNA break point [28]. The use of genome editing techniques to manipulate gene function in a range of plant species has allowed for the generation of improved crop varieties, improved resistance and increased yield [29][30][31]. Within the CRISPR/Cas 9 system, the single-guide RNA (sgRNA), a 20nt oligo complementary to the gene of interest, guides the Cas9 endonuclease to the protospacer-adjacent motif (PAM) site, where Cas9 binds and cleaves the DNA strand [32]. Online tools available for sgRNA design and plasmid construction have been extensively reviewed [33] with CRISPR/Cas9 being broadly implemented in plants such as Arabidopsis, tobacco, rice and sorghum [34,35]. ZFNs contain a tandem array of Cys2-Hys2 finger domains linked to the FokI catalytic domain, with the finger domains each recognising 3bp of DNA [36]. The finger arrays are fused to the catalytic domain of FokI functioning as a dimer. Binding of the zinc-fingers to the target loci brings the two FokI monomers into close proximity causing them to dimerise, creating a DSB [37]. Similar in the mode of action to ZFNs, TALENs are comprised of a nonspecific FokI nuclease domain fused to a DNA binding domain containing highly conserved repeats from the transcription activator-like effectors (TALEs) secreted by Xanthomonas spp. [38].
Off-target mutations caused by inefficient guide design and FokI monomer dimerisation could disrupt the functions of unintended genes, causing genetic instability and unintended cytotoxic effects. Single nucleotide polymorphisms (SNP) in genomic DNA across large, diverse populations will disrupt the homology-based binding of sgRNA, ZFs and TALEs with CRISPR/Cas9, ZFNs and TALENs. Target specificity is tightly controlled by sequence homology, with an increasing number of mismatches, off-target cleavage also increases [39]. Avoiding off-target effects is critically important for effective and efficient genome editing, with the need for genomically informed designs based on thorough deep-read genome sequencing being more important than ever. If these tools are to be regulated and used in product design, absolute confidence in design based on homology is needed.
In this study we outline the best practice workflow for identifying target sequences and their corresponding design using sgRNA in cannabis for the manipulation of the entire pathway of THC and CBD synthesis. Through genomically informed decisions based on previously published cannabis pangenome, generic and specific sgRNA can be designed using online tools to successfully target genes of interest with no in silico detected off-targets. The workflow here can help make informed decisions on gene targeting in cannabis, leading to novel cannabinoid production by targeting cannabis biosynthesis genes, accelerating the understanding of the relationships of genes in cannabinoid production.

Cannbio-2 and pangenome gene analysis
Cannabinoid biosynthesis genes were accessed from a variety of sources and public databases ( Table 1) to annotate Cannbio-2. Sequences were downloaded and used as a query for BLAST analysis against the Cannbio-2 genome assembly with an e-value threshold set at <10 −10 . Identified regions of interest from the reference genome were annotated using NCBI nBLAST to confirm sequence identity and MEGANTE [42] and coding sequences (cds) visualised using FGENESH [43]. Sequences are available in S1 Table in S1 Data.
Publicly available cannabis genomes were downloaded (as described above) and were BLAST analysed using Cannbio-2 gene sequences described here with an e-value threshold set at <10 −10 to determine copy numbers within each respective genome (Table 1).

SNP discovery
SNP discovery was performed by Braich et al. [44], with a brief summary given here. Genomic DNA was extracted from fresh leaf material from a range of 660 mixed cultivars (high CBD, high THC, balanced THC:CBD, male and female plants) using DNeasy 96 Plant Kit (QIAGEN, Hilden, Germany) according to the manufacturer's instructions. Each library was prepared using enzymatic shearing using MspJI (NEB, MA, USA) in-house library prep protocol and sequenced on a HiSeq3000 instrument (Illumina Inc. San Diego, CA, USA)The resulting sequence data was reference aligned to the Cannbio-2 genome assembly previously described, using the BWA MEM algorithm [45]. Variants were identified using SAMtools [46] and a bed file with scaffold regions of interest matching to gene sequences of cannabinoid biosynthesis genes was created. Alignments were sorted and used for variant calling with an adjusted mapping quality (-C 50) and minimum read depth of 5 generating a consensus sequence. Consensus sequences for CDS sequences of genes of interest are available in S2 Table in S1 Data.

Determination between allele or gene
Presence of an allele, or extra copies of a gene, were determined based on genomic nucleotide multiple sequence alignments using MUSCLE [47]. Sequences of similar length with alignment similarity between 80-98%, which produced identical translated proteins were determined as alleles. Where large variation existed between genomic nucleotide sequence length or content, or where nucleotide sequences were <1000bp, predicted mRNA sequences were used from FGENESH [44] for alignment. Alleles were determined if similarity equalled >98%.
Additional gene copies were determined if greater than two haplotypes were found with similarities >90% but <98%, due to cannabis being an outbreeding species and the Cannbio-2 genome sequence assembly is based off a heterozygous plant.

sgRNA design and confirmation
CHOPCHOP [48], CRISPR MultiTargeter [49], Crispor [50] and ZiFit [51] were used for the selection of sgRNAs for use with CRISPR-Cas9. Entire CDS region, calculated by FGENESH [43] and MEGANTE [42], were used as search queries. sgRNA on and off-target parameters suggested by each online tool was used. For visual confirmation of SNP avoidance, sgRNAs were manually aligned to Cannbio-2 and consensus sequences using Sequencher [52]. sgRNA designs are available in S3 Table in S1 Data.

Genome mining for cannabinoid biosynthesis genes
To locate all the genes involved in cannabinoid biosynthesis, query references were downloaded from publicly available databases (Table 1) and BLAST analyses was performed against the Cannbio-2 genome assembly. All genes in the MEP, GPP, Hexanoate and Cannabinoid pathway were identified (Table 1). Two 1-deoxy-D-xylulose 6-phosphate synthase (DXS) genes were discovered in the MEP pathway alongside single copies of 1-deoxy-D-xylulose 5-phosphate reductoisomerase (DXR), 4-diphosphocytidyl-2C-methyl-D-erythritol synthase (MCT), 4-diphosphocytidyl-2-Cmethyl-D-erythritol kinase (CMK), 2C-methyl-D-erythritol 2,4-cyclodiphosphate synthase (MDS), 4-hydroxy-3-methylbut-2-en-1-yl diphosphate synthase (HDS) and 1-hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate reductase (HDR). Single genes of isopentenyl diphosphate isomerase (IPP/IPI), geranyl pyrophosphate synthase (GPP), small and large subunits, were identified in the GPP pathway. In the hexanoate pathway, four copies of fatty-acid desaturase (FAD2) were identified using the Purple Kush (PK) desaturase gene sequence as the query. Translated proteins from all FAD2 homologs were tBLASTn analysed for confirmation of correct annotation and all are believed to be involved in cannabinoid biosynthesis. Lipoxygenase (LOX) and hydroperoxide lyase (HPL) were identified using the associated PK gene sequences as the queries with very low (<1%) sequence variation existing. Acyl-activating enzyme (AAE1) was found using previously published sequences (Table 1) amongst the AAE superfamily, containing 15 AAE homologs. Translated AAE1 annotation was confirmed using tBLASTn and isolated from the large superfamily of highly homologous gene sequences. In the cannabinoid pathway a single copy of olivetol synthase (OLS) was discovered with >98% identity to deposited OLS sequences in NCBI. Two copies of olivetolic acid cyclase (OAC) were discovered. The CDS of the set of alleles and a single copy of OAC were aligned and 14 SNPs exist between the set. All OAC sequences were correctly annotated using MEGANTE and tBLASTn to confirm copy number. Two complete identical, functional CBDAS-like genes were discovered (CBDAS-like#1 and #2) with three closely related homologs also existing (CBDAS-like#3-5). CBDAS-like homologs contain several SNPs causing sequence variation in translated protein sequences. Four truncated CBDAS homologs were also discovered (CBDAS-truncated#1-4), with each containing stop codons resulting in truncated protein sequences. Two complete copies of cannabichromenic acid synthase (CBCAS) were found (CBCAS#1 +#2) with identical sequences except at base pair 662 with a SNP of C to T, though identical proteins are predicted. One closely related truncated homolog of CBCAS was also discovered (CBCAS-truncated) producing a substantially shorter predicted protein sequence. One single copy of THCAS was also discovered.

Pan-genome copy number variance comparison
Within the publicly available cannabis genome sequences, the assembled gene set was then used to query gene copy number and identify potential homologs. Differences exist between the datasets in terms of gene copy number due to the resolution of the sequence data, genetic mapping, scaffolding technologies and natural variation in different genomes. Variations in gene presence and copy number, using the assembled reference gene list, exist for DXS1, DXS2, DXR, IPP/IPI, GPP_SSU, FAD2, AAE1, OLS, OAC, CBDAS, THCAS and CBCAS (Table 1). Within the Finola genome, DXS1, DXS2, GPP_SSU and AAE1 were not discovered, with copy number variation existing for FAD2, OLS and OAC when compared to Cannbio-2 (Table 1). Within the CBDrx genome, no copy of IPP/IPI was discovered, which is confirmed by the most recent release of the CBDRx genome. Copy number variations exist for FAD2 compared to Cannbio-2, with 4 FAD2 genes being discovered in Cb-2 and 5 in CBDRx. The updated PK genome had at least one copy of each gene, with variations in copy number existing for DXR, FAD2, OLS and OAC and synthase genes compared to Cannbio-2.

Analysis of SNPs and informed sgRNA design
To assess gene variation, the six hundred and sixty whole genomes that were sequenced were used to establish a resource of SNP locations (consensus sequence) (S3 Table in S1 Data), which were then overlayed onto the identified genes integral to the cannabinoid biosynthesis. With the exception of FAD2, which belongs to a large, diverse family of desaturases, the cannabinoid biosynthesis genes are highly conserved with little variation within their sequences (Table 2). Each consensus sequence containing SNP locations was then used for intelligent guide designs to avoid all known nucleotide variations, creating universal sgRNA which can be broadly used on any cannabis genotype, and in the instance of highly similar gene sequences, unique sgRNA designed to target only a specific gene of interest (Fig 2). Sequences from the reference genome were entered into the online design tools CHOPCHOP, CRISPR MultiTargeter, Crispor and ZiFit to generate sgRNA based on their preferred scoring matrixes followed by manual and visual comparison. Taking the highest-ranking scores from each online tool, which predict off-targeting potential and greatest binding affinity, each sgRNA was visualised, using Sequencer, to identify regions the sgRNA would target, whether that be in regions of sequence homology across the pan-genome or in regions consisting of SNPs. A total of 145 sgRNAs were designed targeting every gene in the combined pathways (S1 Table in S1 Data). The sgRNA generated consists largely of a pool of universal sequences, which regardless of cultivar used, can target each gene in the combined pathways through the use of the consensus sequence generated. Multiple Cannbio-2 specific sgRNA were also designed in regions where sequence heterogeneity towards the 5' translated regions dictated universal sgRNA design was not possible. All sgRNA were re-BLAST analysed against the reference genome for detection of off-site targeting, with results confirming no complete 20-nt sgRNA had potential off targets outside their respective gene sets.

Discussion
Phytocannabinoids are of particular interest for their pharmacological applications in a growing number of medical conditions. Knowledge and understanding of the gene interactions and their relationship to final cannabinoid concentration can facilitate improved cannabis strains with desired novel cannabinoid levels. Creating a pangenome consensus of each gene in the contributing pathways allows for genomically informed decisions, based on known SNP location and frequency as well as presence absence variations (PAV), for crop improvement by means of genome editing. Using publicly available sequence information, at least one full length transcript for all genes involved in cannabinoid biosynthesis were found agreeing with previous genome sequencing and genome mining reports [24,53]. Gene copy number in the MEP pathway also agrees with previously published analysis [53]. Two DXS genes were discovered, with previous reports showing DXS1 having elevated expression levels in photosynthetic tissues, underlining its importance in isoprenoid production [54]. DXS2 accumulates in the roots with expression patterns suggesting synthesis of specific isoprenoids, however, it's role in cannabinoid biosynthesis is yet to be determined. Multiple genes for DXR [55], HDR [56] and IPI/IPP [57] have been previously reported, however, only singular copies of these genes were discovered in the Cannbio-2 genome. It is possible that multiple copies of these genes could be responsible for the accumulation of cannabinoid precursors, leading to novel cannabinoid levels. Fatty acid desaturase enzymes belong to two large multifunctional classes, either membrane bound, or soluble. The desaturase of interest in cannabinoid production, FAD2, is involved in the hexanoate pathway, leading to the production of hexanoyl-CoA, the first precursor in the cannabinoid pathway. Despite the complexity of the number of FAD2 gene sequences, it is believed that the correct version was identified, although our data shows four copies of this gene, where previous comparative studies discovered seven gene copies in the Finola genome [58] and only 2 copies in the CBDRx genome [59]. Further evidence of gene copy number variance, across published genomes, exists for OLS, and OAC posing the question if gene copy number directly influences chemovar determination. Previous studies have utilised short read sequence data in the identification of gene sequences and due to the anticipated degree of sequence similarity from the duplicate gene copies, taking a reference-aligning approach would be inaccurate to use the data generated to infer CNVs. However, with the availability of long read sequencing technology that can generate sequence data through extended repetitive regions, describing genome architecture and gene sequence and structure at a much higher level, makes it a reliable platform to use for the determination of CNVs. THC-rich PK cultivar has two copies of OLS and OAC, whereas CBD-rich cultivar, CBDrx, has just one copy of each from our BLAST search results, though 2 copies of OLS and no copies of OAC are reported. The presence of OAC is a polyketide synthase enzyme catalyses olivetolic acid, which forms the polyketide nucleus of cannabinoids [21]. This suggests that this particular polyketide was not included in the CBDRx genome, though it is considered essential for cannabinoid biosynthesis. The Cannbio-2 cultivar, with relatively equal (1.8:1) THC and CBD cannabinoid concentrations contains a single copy of OLS and 2 copies of OAC. The exact relationship between gene copy number and cannabinoid production needs to be further studied through metabolic engineering in heterologous hosts or through genome editing. Using the discovered synthase genes from the Cannbio-2 genome sequence as the query against CBDrx, Finola and PK genomes, the total number of synthase genes varies considerably between the cultivars. In the CBDrx genome [59] 16 synthase genes are reported, however only 11 were discovered in CBDrx using sequences from Cannbio-2 as queries. Identification of which synthase genes were not identified is difficult due to the nested repeating nature of synthase genes around the centromere.
As long read sequencing is error prone, the correct assembly of CBDAS in the Cannbio-2 assembly has proven problematic, potentially exacerbated due to the hybrid nature of the genotype. It is therefore likely that the CBDAS gene has been incorrectly assembled and either a chimeric version of the functional and non-functional gene alleles, or that the non-functional allele only has been assembled, most likely as the gene that is referred to as CBDAS-trun-cated#3. The Cannbio-2 genome clearly has a functional CBDAS allele as a 100% identity sequence has been identified from the transcriptome data set [60] (Cannbio_016865). Grassa et.al (2021) has identified the total number of potential synthase genes in reference to a sequence alignment to THCAS mRNA >82%. The variation in synthase genes is most likely due to PAV across different cultivars, which in the case of maize is common [61]. Total synthase gene number for Finola and PK is not given in the original genome [62], however 9 and 14 genes were found when querying with Cannbio-2 sequences. Grassa et.al (2021) has identified 5 and 16 synthase genes within the PK and Finola from their respective approach to discovering copy numbers.
THCAS and CBDAS CNV have recently been reported from multiple cannabis cultivars with similar findings that this CNV partially explains variation in cannabinoid content [63,64]. Multiple gene copies is a known method to increase production of secondary metabolites [65] which could lead to the understanding that increased copy number of synthase genes would in turn increase cannabinoid production. However, possibly a greater explanation of increased cannabinoid potency was discussed by Grassa et al. (2018) with the discovery that separate QTLs, not linked to synthase gene clusters, were responsible for up to 17% variation in cannabinoid quantity. This could possibly help explain the current gene copy number variation in the observed genes mentioned.
Complete absence of sequence data is present for specific genes in the CBDrx, Finola and PK genomes posing the question whether genome assembly, or actual PAV mechanisms are responsible. Within the Finola genome, 4 genes could not be identified. Both forms of DXS are not present and with previous studies demonstrating DXS knock down lines produce reduced levels of isoprenoids and contain more severe phenotypic characterisations [66,67], suggesting the fragmented genome failed to identify and assemble the specific genes of interest. GPP SSU and AAE1 were also not identified, however, from previous reports both these genes are critical for isoprenoid and cannabinoid production indicating they are missed in the genome assembly. AAE1 was found to be the gene which synthesises hexanoyl-CoA from hexanoate supplying the cannabinoid pathway [68] and since Finola still produces cannabinoids, it is concluded that it was also an assembly error. GPP is a heterodimer requiring both subunits, large and small, for optimum activity. GPP activity has shown to still be active but at lower levels when the small subunit was inactive [69], however both subunits were still present, suggesting the absence of GPP SSU in the Finola genome is also due to assembly error. The absence of IPP/IPI in the CBDrx genome is also strongly suggested to be due to assembly error, since previous studies on Arabidopsis double mutant knockdown of IPP/IPI produced dwarfism and male sterility [70].
The SNP location resource revealed some genes are more highly conserved than others. The variable conservative nature of genes was observed indicating a continuing evolution of recombination and divergence. Comparative analysis of SNPs present in genes of variable copy number in Cannbio-2, CBDrx, Finola and PK genomes was performed (excluding results of no gene presence). Through multiple sequence alignments of coding sequences, it was observed that the presence of SNP's occurred in the extra gene copy where the presence of homozygous alleles exists. This suggests that either sequencing error has occurred, or in fact there is an extra copy of the gene and a set of alleles. Within the Cannbio-2 genome, OAC produced three sequence similarity matches with two sequences determined as alleles with an extra copy of the gene existing as a truncated version of the gene. When gene sequences were aligned, SNPs occurred in all genes and when translated, nearly identical protein sequences (>99%) were produced confirming that an extra copy of the gene was present, potentially in a hemizygous condition. Within the PK genome, copy number variation exists for OLS and OAC. In a similar way to OAC in the Cannbio-2 genome, OLS produced three hits, two of which were determined to be alleles and one to be an extra copy. SNPs existed in all three sequences when coding regions were aligned with similar results obtained from protein sequence alignment. Initial alignment of both OAC hits, in PK, found a 98.5% similarity in genomic sequences, however no gene prediction was possible on one of the sequences, possibly due to a premature stop codon from a SNP rendering this gene inactive potentially indicating that it exists as a pseudogene.
How this copy number variation contributes to differential cannabinoid production is yet to be fully elucidated, however using the known SNP location for each extra copy gene in Cannbio-2, sgRNA could be designed to help understand this relationship. Using multiple online tools for the design of sgRNA ensured that all possible guide designs could be assessed for in silico off-targeting. Each tool implements different scoring rules based on off-targets, mismatches, efficiency score, existence of self-complimentary regions, GC content, location of guide and multiple sequence alignments [48,49]. Due to the diversity in gene content and sequence variation and the absence of a well characterised pan-genome for cannabis, analysis by these multiple tools was necessary and essential. The presence of a PAM site is necessary for sgRNA binding and even though these tools scanned the gene sequence for the PAM sites, results occasionally varied between the online tools. Visualisation of sgRNAs was clear using CHOPCHOP compared to the other tools and regularly provided the best guide designs. However, when highly homologous sequences were used MultiTargeter was able to perform sequence alignments and produce unique sgRNA for each sequence, a feature not possible within the other tools. Designing the sgRNA for the unique synthases were first run using Mul-tiTargeter and further verified using CHOPCHOP for visualisation. sgRNA designed were targeted to the earliest possible exon for maximum likelihood of a frame shift mutation. The error prone nature of NHEJ often occurs with small deletions, or insertions, occurring at the DSB leading to protein misfolding and thus production of a knockout gene. Each identified gene, with accompanying allele where applicable, were analysed and sgRNAs were designed to be either universal, inactivating both related genes, or if sequence heterozygosity exists, specific sgRNA were designed (S1 Table in S1 Data). Mutational studies identifying differential expression in isoprenoid biosynthesis genes, including DXS [67], DXR [71], IPP/IPI [70] and MDS [72] have previously been reported. Mutational studies on the unique synthase genes are yet to be reported, potentially due to the high homology between enzymes. Using genome editing, sequence homogeneity between synthase genes could potentially lead to off-target editing, with targets suggested to have at least several nucleotides different for discrimination [73]. Where possible, each synthase gene, and accompanying homologs, had universal and specific sgRNA designed that could be used regardless of cultivar, strain or population chosen as the target. The reported sequence similarity between THCAS, CBDAS and CBCAS, up to 95% [62], requires precise, intelligent design, using multiple online tools and a large consensus population to improve the likelihood of correct gene knock down. Potential off targeting predictions given by sgRNA online tools currently use the previously fragmented genome of PK [24]. To circumvent this, each sgRNA was used as a query to BLAST against the Cannbio-2 genome for potential off-targets. From the BLAST results no sgRNA had an unexpected sequence match elsewhere in the genome, however singular nucleotide mismatches do occur. How these mismatches are tolerated during directed genome editing is yet to be determined, however it is expected that off-targeting will be more prevalent with more highly homologous gene sets.
Applying this logical workflow in silico is the benchmark standard, essential to ensure that correct genes and associated SNPs are identified before genome editing can begin. This approach has wider applications in all genome editing efforts within species that have paleopolyploidy, large PAV gene populations or crop species with high levels of variations within the genome. This workflow explains each step taken and the tools to use to obtain universal or specific sgRNA to any gene of choice quickly and effectively, where each step can encounter issues and how to correct them making this approach critical for effective genome editing with minimal off-targeting. This same approach can easily be applied to the more recent CRISPR--Cas12a system which has been gaining popularity with editing plant genomes. The availability of fully sequenced genomes, pangenomes and the ability to accurately predict potential off-target effects and edits makes this method applicable to all plant gene editing applications regardless of species. Only recently the ability to analyse the cannabis genome has become available showing that using this approach, with current technologies available, this method can be used quickly and effectively. Even with the limited literature and resources available for completed cannabis genomes, quick, intelligent design for genome editing in cannabis is now possible. Understanding the effect of gene copy number, PAV and SNP location and density on cannabinoid production can help create unique cannabinoid profiles for medicinal purposes.