Broad-Scale Genetic Diversity of Cannabis for Forensic Applications

Cannabis (hemp and marijuana) is an iconic yet controversial crop. On the one hand, it represents a growing market for pharmaceutical and agricultural sectors. On the other hand, plants synthesizing the psychoactive THC produce the most widespread illicit drug in the world. Yet, the difficulty to reliably distinguish between Cannabis varieties based on morphological or biochemical criteria impedes the development of promising industrial programs and hinders the fight against narcotrafficking. Genetics offers an appropriate alternative to characterize drug vs. non-drug Cannabis. However, forensic applications require rapid and affordable genotyping of informative and reliable molecular markers for which a broad-scale reference database, representing both intra- and inter-variety variation, is available. Here we provide such a resource for Cannabis, by genotyping 13 microsatellite loci (STRs) in 1 324 samples selected specifically for fibre (24 hemp varieties) and drug (15 marijuana varieties) production. We showed that these loci are sufficient to capture most of the genome-wide diversity patterns recently revealed by NGS data. We recovered strong genetic structure between marijuana and hemp and demonstrated that anonymous samples can be confidently assigned to either plant types. Fibres appear genetically homogeneous whereas drugs show low (often clonal) diversity within varieties, but very high genetic differentiation between them, likely resulting from breeding practices. Based on an additional test dataset including samples from 41 local police seizures, we showed that the genetic signature of marijuana cultivars could be used to trace crime scene evidence. To date, our study provides the most comprehensive genetic resource for Cannabis forensics worldwide.


Introduction
Cannabis is one of humanity's oldest cultivated plant. It is thought to have originated in central Asia and was domesticated as early as 8 000 BP for food, fibre, oil, medicines and as an inebriant. This crop was since distributed across the world during the last two millennia and, due to its recent legalization in several countries, is increasingly exploited by several industrial sectors a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 (hemp) and as a recreational drug (marijuana). The taxonomic status of Cannabis has always been disputed, as it encompasses multiple cultural, geographic, historical and functional aspects (reviewed in [1][2][3][4]). Whereas most authors now consider it a monotypic panmictic taxon, Cannabis sativa, three species or subspecies (sativa, indica and ruderalis) are often mentioned but without a comprehensive taxonomic grouping so far. The nomenclature may thus differ depending on whether it refers to morphological or chemical variation, geographic distribution, ecotype, as well as crop-use characteristics and intoxicant properties resulting from human selection [4][5][6][7]. Cannabis presumably diversified following selection for traits enhancing fibre and seed production ("hemp") or psychoactive properties ("drug"). Importantly, Cannabis types differ in their absolute and relative amounts of terpenophenolic cannabinoids, notably Δ 1 -tetrahydrocannabinol (THC), the well-known psychoactive compound of marijuana, and the non-psychoactive cannabidiol (CBD). In this context, drug-type Cannabis (marijuana) is broadly characterized by a higher overall cannabinoid content than fibre-types. However, the most widely recognized criteria to assign a Cannabis plant to either "drug" or "hemp" type is the THC:CBD ratio, according to which three main chemical phenotype (chemotype) classes are recognized: hemp-type plants with a low ratio (THC:CBD < 1), drug-type plants with a high ratio (THC:CBD > 1), and intermediate-type plants with a ratio close to one [6,8]. The informal designation sativa and indica may have various, controversial meanings. Morphologically, the name sativa designates tall plants with narrow leaves, while indica refers to short plants with wide leaves. Among the marijuana community however, sativa rather refers to equatorial varieties producing stimulating psychoactive effects (THC:CBD % 1), whereas indica-type plants from Central Asia are used for relaxing and sedative drugs (THC:CBD > 1) [8].
The commercial interest for Cannabis has declined during the XX th century due e.g. to the development of synthetic fibres and the stringent policies regarding its exploitation, but this iconic weed is recently regaining attention in many countries for its high medicinal, industrial and agricultural potentials (reviewed in [9]). However, its usage is still controversial, in particular from agro-economic, public health and forensic perspectives. Due to its intoxicant properties, the cultivation and possession of Cannabis is under strict legal regulations. High-THC: CBD varieties are prohibited in many countries but remain the most frequently-used illicit drug worldwide [10] (~180 million consumers in 2013, [11]), in the form of marijuana (dried inflorescences) or hashish (resin). In contrast, low-THC:CBD hemp crops can be exploited under licensed control for seed oil, fibres and pharmaceutical industries. For instance, quantitative measures of THC-content are currently considered by the EU for approval as a licensed hemp cultivar (below 0.2% THC weight per weight in the mature dry inflorescences; http://ec. europa.eu/food/plant_en). Yet, hemp and marijuana varieties are hardly distinguishable morphologically and discrimination of drug vs. non-drug chemotypes by quantitative THC-dosage has also proven inadequate due to its dependence on environmental factors, to the strong variation during the plant's life cycle, as well as between individual plants [12][13]. In addition, the qualitative assessment of THC:CBD ratio is also problematic for an unequivocal discrimination between fibre and drug types, due to the presence of a largely variable intermediate chemotype class, the occurrence of several exceptions (e.g. hemp accessions with a THC-predominant chemotype; [14][15][16]) and the common practice among drug breeders to produce hybrid varieties.
This issue largely impedes crops' improvement and full-scale industrial development; it even causes a security risk, as licensed crops may be used as a cover for illegal drug production. Moreover, it significantly limits the ability of law-enforcement agencies to trace drug seizures and link illegal producers to organized crime syndicates supplying the black market of Cannabis drugs. In addition, Cannabis can have long-distance dispersal capabilities [17], and fibre crops might face cryptic contamination by pollen from drug varieties.
Genetic tools offer a promising avenue to overcome these issues, especially to distinguish between drug vs. non-drug plants [18]. Importantly, genetics requires small amounts of tissues as a DNA source, whereas chemical analyses necessitate inflorescences. A promising aspect has been to genotype loci directly linked to THC synthesis [8,19] in association with chemotype profiling. However, this association is not ubiquitous [14][15], and genotyping may be compromised by complex gene duplications, pseudogenes [20][21][22] and, that only a limited number of varieties among the tremendous Cannabis diversity has been validated [15]; moreover, chemotype seem to greatly vary even among genotypes [20].
A parallel, complementary approach is to discriminate drug vs. hemp plants from their nonadaptive genetic variation. Until the recent past, the genetic diversity of Cannabis has remained surprisingly under-investigated, partly due to the important restrictions imposed by anti-drug policies, even for scientific inquiries. In the last few years, a draft genome of Cannabis was published [22] and high-density Single-Nucleotide-Polymorphism (SNP) data obtained from Next-Generation-Sequencing (NGS) techniques evidenced genome-wide differentiation between hemp and marijuana plants [23]. However, genetic resources applicable for forensics remain under-developed. Forensic investigations require sets of sufficiently informative loci that can be genotyped in large batches of samples in a rapid and affordable manner, such as microsatellites (Short-Tandem-Repeats, STRs). Another prerequisite is that the species' diversity is exhaustively represented in reference databases, both within and among varieties, so that investigated samples of unknown origin can be identified with statistical confidence. In Cannabis, these two aspects are challenging given the diversity of varieties, their complex breeding histories, as well as the rapid shifts of the drug varieties available on black markets. In addition, hemp and marijuana diverged during the human era and still largely share a common pool of genetic variation [23].
Several microsatellite analyses were previously performed on Cannabis. Some loci became available in the early 2000s [24][25][26] but remained scarcely tested at the individual or population level. The first STR multiplex kit for forensics was validated years later [27], and subsequently trialed to distinguish fibres from confiscated drug seizures in Australia, with moderate success [28]. Another STR kit was developed by Köhnemann et al. [29], although without reference data. Using transcriptomic sequences (EST), Gao et al. [30] isolated >100 STRs, allowing them to discriminate between Chinese and European hemp samples according to their geographic origin. Other studies genotyped Cannabis, notably from police seizures, using new or published markers [31][32][33][34][35]. However, although these studies are regionally and timely relevant, they rely on limited sample sets (i.e. few varieties and few individuals per variety, and/or only representing plants available on a regional black market at the time of confiscations), thus hardly accounting for the different levels of genetic variation of Cannabis stocks. So far no comprehensive database of Cannabis diversity exists for broad-scale forensic enquiries.
Considering these limitations, we developed a new STR resource for Cannabis forensics. We analyzed intra-and inter-populational variation at 13 published STRs markers in >1 300 Cannabis samples from 48 fibre and drug accessions, broadly representative of known hemp and marijuana varieties (S1 Table), and characterized unknown samples of various origins, notably police seizures. We aimed at (i) showing that these loci fully recover the genetic structure between marijuana and hemp; (ii) demonstrating that anonymous samples can be confidently assigned to either plant types; and (iii) documenting the genetic diversity among and within samples and its potential for forensic investigations.

Results and Discussion
The selected STR markers (detailed in S2 Table) unanimously recovered the strong structure between fibres and drug Cannabis samples. This is clearly depicted by a Principal Component Analysis (PCA, Fig 1A), genetic distances between accessions (F st , S1 Fig) and genotype clustering by STRUCTURE (Fig 1B), where two groups appears as the best clustering solution (ΔK 2 = 1205.6). As recently evidenced from NGS data [23], this pattern reflects differentiation between hemp and marijuana over the entire genome, not only at genes underlying THC and fibre synthesis. Some drugs and fibres show weak signs of genetic admixture (intermediate PCA scores and STRUCTURE probabilities, Fig 1; lower F st , S1 Fig), which might stem from introgressive crossbreeding, as reported elsewhere [23]. Interestingly, except for RI (indica/ ruderalis hybrid), all drug varieties closely-related to hemps are of sativa ancestry (HMW, HA, SWA, MS; based on available information from suppliers). This would support the common assumption that hemp varieties selected for fibre and seed production derived from sativa, although this view has been challenged by other studies that found more similarities between hemp and indica [7,23,36]. Alternatively, sativa drugs, which are nowadays distributed in more equatorial regions, may be frequently crossbred with indica and agricultural varieties to facilitate their cultivation in temperate countries. In any case, marijuana genetic diversity seems weakly associated with the documented breeding history: we also performed a PCA solely on drugs, which only marginally clustered according to their main sativa and indica pedigree (S2 Fig). Some cultivars of the same appellation appear genetically distinct (e.g. Alpine Rocket, ARa and ARb, F ST = 0.36) whereas others harboring different names are genetically identical (e.g. PM, T44, BS, F ST = 0.00; identical clones shared by ARa and B52, S1 Table). Overall, these observations are in line with the general conclusions of Sawler et al. [23] that drug varieties are often misinformed due to the clandestine nature of Cannabis breeding over the last century, and that names do not necessarily reflect a meaningful genetic identity. In addition, hemp varieties grouped according to reproductive characteristics, as expected (dioecious versus monoecious; S1 Table), as a result of their breeding history (illustrated on the PCA, Fig 1; F st tree, S1 Fig).
Intra-variety diversity was relatively similar among hemps (Fig 2). Allelic richness (average number of alleles per population A R , scaled to eight individuals) and heterozygosity (H O ) averaged 4.0 ± 0.8 and 0.59 ± 0.10 respectively (Fig 2). All varieties had positive inbreeding coefficients (F IS = 0.19 ± 0.05), potentially reflecting bottlenecks linked to current breeding practices. The overall differentiation among hemps was relatively low (F ST = 0.15 ± 0.07; S1 We detected identical genotypes (clones) and strong excess of heterozygosity among several breeds (all of indica or mixed origin, S1 Table), which translates into A R of 2, H O of 0.5 and F IS reaching -1 (Fig 2), resulting from clonal breeding from hybrids of two different parental strains. Interestingly, sativa drugs featured more hemp-like patterns of diversity. Overall, the homogeneous gene pool of hemps suggests more frequent crossbreeding compared to drugs [23], especially of indica content, and/or that a wider genetic base has been sourced by the hemp industry. Marijuana is often propagated clonally for practical reasons as well as to protect the genetic identity of varieties from contamination by wind-dispersing pollens, thus reducing diversity and triggering strong heterozygosity in F1 cross-breeds. Moreover, all Cannabis drug forms are dioecious, and males, which produce lower amounts of THC than females, are discarded by breeders, which further reduces diversity.
The diversity captured by our STR markers appears well representative of the genomic background of Cannabis: our results are overall very concordant with high-density SNP data [23]. Our STR database thus seems appropriate for broad-scale forensic applications, in particular to discriminate between drug vs. non-drug samples, one of the main task of Cannabis forensics. To demonstrate this ability, we performed genetic assignment tests (direct or resampling-based) on random subsets of drug and fibre samples, using the remainder of the dataset as reference (detailed in the Methods section). The direct test always correctly assigned every sample to their plant type ( Table 1). The more conservative resampling approach never misassigned any specimen (Table 1). Many individuals are yet not assigned to any group (even the correct one) because genotypes are considered not statistically informative enough by this conservative analysis. We further evaluated the database by genotyping 340 additional Cannabis samples of various origins (bird food, drug and fibre specimens, uncertain industrial cultivars and police seizures). Known specimens (n = 8) were all correctly assigned with high confidence ( Table 2). All but one industrial cultivars (n = 37) consisted of hemps, with few getting assignment probabilities below 0.95 (Table 2). Confiscated samples (n = 295, from 41 different seizures) could be unambiguously assigned except for three specimens ( Table 2).
These results clearly illustrate the relevance of our new database for forensics. Notably, it outperforms the reference published by Howard et al. [28] for Australian seizures, which suffered substantial mis-assignment risks, but yet so far was the only available resource properly tested by statistical assignments. Moreover, compared to previous studies, our sampling scheme has the advantage of covering a broad range of Cannabis varieties and accounts for their intra-variety variation. The latter seems important to consider, as some marijuana (sativa) and hemp cultivars share closely-related gene pools, sometimes making their discrimination difficult.
In addition, the strong genetic structure between drug cultivars may provide opportunities for police investigations of narcotrafficking. One challenge for law-enforcement agencies is to trace evidences collected at crime scenes in order to connect and convict acting members of crime syndicates. Most marijuana individuals/germlines show unique genetic profiles at our markers ( Fig 1A, S2 Fig), so they could be suitable for this task. We screened for identical genotypes among the seized Western-Swiss samples of our test dataset, where the probability of identity P I-sib is 8.9 × 10 −5 . We could established five groups of related seizures (some even matched by several germlines) thus with 99.991% confidence (S3 Table); the remaining 25 seizures were genetically different (S3 Table). Given the high resolution at such narrow regional scale, this approach could also be applied at national or international levels. The illegal trade of Cannabis is one of the most developed illicit industries in the world (> 7 000 tons seized in 2013, [11]), yearly generating enormous profits used to finance other criminal activities. Exploiting the genetic heterogeneity of marijuana should be the focus of further forensic development to aid the international fight against narcotrafficking.
To date, our STR database is the most powerful resource suitable for routine forensic analyses of Cannabis. Yet, it remains limited by several aspects. First, drug vs. non-drug discrimination can be ambiguous for some samples, given the lack of differentiation and/or crossbreeding practices between few hemp and marijuana varieties. Second, the plant type of our reference samples rely on the information provided by the suppliers, which could be confirmed by chemotyping analyses. Third, more sensitive applications such as tracing drug evidences might require a finer resolution. In both cases, updating the database with additional markers and reference populations, especially new drug varieties, seems a worthy investment. Further development would benefit from international collaborations. An array of genetic studies have been conducted on Cannabis in just a few years by different research teams (see Introduction), each contributing specific sets of samples and markers. Given the tremendous diversity of marijuana and the legal difficulty to access samples, joint efforts between Cannabis genetics' experts worldwide would

Ethics statement
This study does not involve any endangered or protected species.

Sample collection
We built a collection of 1,324 Cannabis samples from 30 accessions of fibres (n = 972 from 24 different varieties) and 18 accessions of drug (n = 352 from 15 varieties). These accessions broadly cover the legal European hemp varieties (landraces, cultivars selected from landraces and cross-bred cultivars) and marijuana diversity (identified a priori as sativa, indica and hybrids by breeders). In order to also capture intra-variety variation, we included large population samples for each accession (27 samples on average, from 9 to 50). Seeds and leaves were obtained from agronomic companies, germplasm collections, police seizures or commercial stores; seeds were germinated at the University of Lausanne (Switzerland). S1 Table provides sample origin and reported breeding history, given available documentation and information provided by the suppliers.
To evaluate our reference database, we further considered 340 additional test samples from uncertain (police seizures, industrial cultivars) or known types (fibre and drug samples not included in the reference database). Confiscated plants (n = 295) represented 41 police seizures across Western Switzerland from 2005 to 2010. Details are provided in S3 Table. DNA extraction and microsatellite genotyping DNA was extracted from approximately 25 mg of dried plant leaves using the FastDNA Kit (Qbiogene, Carlsbad, CA) following the manufacturer's instructions. Thirteen published Table 1. Database auto-evaluation by assignment tests of random subsets of fibre and drug samples. Values indicate the probabilities P of assignment (direct method) and inclusion to either groups (resampling method), as well as their standard deviations among replicate subsets (n = 10).

Direct method
Resampling-based method Probability of assignment to the correct group Probability of inclusion to fibres to drugs  Table 2. Assignment trial (direct method) of 340 test samples from known (bird food, known fibres and drugs) and unknown nature (industrial cultivars and police seizure). We considered assignments "safe" where the probability of assignment P was above 0.95. microsatellite loci were analyzed [24][25], including the ten from Howard et al.'s forensically validated kit [28]. DNA amplifications were performed according to their STR multiplex system (M1 and M4), slightly modified to include ANUCS202 to multiplex M4. In addition, we integrated a new multiplex M5 to amplify loci ANUCS201 and H09-CANN2. Detailed information on markers and multiplexes are available in S2 Table. PCR conditions were as follows: 95˚C for 5 min (initial denaturation); 10 cycles consisting of 30"at 95˚C, 30" at 66˚C down to 54˚C (-3˚C/2 cycles) and 45" at 72˚C (top-down PCR); 30 regular PCR cycles consisting of 30" at 95˚C, 30" at 50˚C and 45" at 72˚C; 90"at 72˚C (final elongation). Amplicons were run on an ABI PRISM 3130 Genetic Analyzer (Applied Biosystems) and genotyped were scored using GeneMapper v3.2 (ABI).

Population genetic analyses
We analyzed the genetic structure and diversity of Cannabis by three different approaches. First, we performed a Principal Component Analysis (PCA) on individual genotypes using the R packages ade4 and adegenet [37]. Second we conducted Bayesian clustering of genotypes into groups with STRUCTURE [38]. We used the admixture model without prior on sample origin, and tested from 1 to 11 groups (K), with 10 replicates per K. Each run consisted of 100'000 iterative steps following a burn-in of 10'000. We applied the Evanno method [39] to determine the most likely number of groups summarizing the data, as implemented in STRUCTURE HARVESTER [40]. Replicates were combined using CLUMPP [41] and graphical displays of admixture proportions (barplots) were built with DISTRUCT [42]. Third, we conducted population-based analyses with FSTAT [43], by calculating pairwise genetic distances between accessions (F ST ) as well as the following diversity indices for each accession: observed heterozygosity (H O ), inbreeding coefficient (F IS ) and allelic richness (A R , scaled to 8 individuals).

Genotype specificity and assignment tests
We used GenAlEx 6 [44] to compute, within and among accessions, the number of private alleles (P A ) and probabilities of identity (P I ), i.e. the probability to have identical genotypes by chance. For the latter, we considered the conservative estimate of P I-sib when the data potentially includes siblings, as appropriate for Cannabis samples. We also used GenAlEx to match identical genotypes, notably to identify clones (function "match").
To assess the power of discrimination between hemp and drug types, assignment analyses were performed with GeneClass2 [45]. First, we auto-evaluated our database by assigning ten re-sampled random subsets (representing about 10% of the total dataset, n = 100 for fibres, n = 40 for drugs) using the rest of the data as reference. To this end, two different methods proposed by the software were applied, using Bayesian criteria [46]. The first approach (direct method) estimates the proportion of correctly assigned samples to the most likely population of origin. The second approach (resampling-based method) computes the probability that samples belong to each reference population, and aims at minimizing the risk of mis-assignment, i.e. when individuals feature genotypes that can occur in the "wrong" reference population (type I error). This was achieved by simulating the likelihood distribution of 10,000 independent genotypes, for each reference population (with a Monte-Carlo resampling algorithm, [47]), against which the genotypes to assign can then be compared. Rejection or inclusion is then decided upon a threshold (P < 0.01). This approach does not assume that all source populations have been sampled. Second, we assigned (direct method), our 340 test samples, which consist mostly of unknown varieties.