Development of Molecular Markers for Determining Continental Origin of Wood from White Oaks (Quercus L. sect. Quercus)

To detect and avoid illegal logging of valuable tree species, identification methods for the origin of timber are necessary. We used next-generation sequencing to identify chloroplast genome regions that differentiate the origin of white oaks from the three continents; Asia, Europe, and North America. By using the chloroplast genome of Asian Q. mongolica as a reference, we identified 861 variant sites (672 single nucleotide polymorphisms (SNPs); 189 insertion/deletion (indel) polymorphism) from representative species of three continents (Q. mongolica from Asia; Q. petraea and Q. robur from Europe; Q. alba from North America), and we identified additional chloroplast polymorphisms in pools of 20 individuals each from Q. mongolica (789 variant sites) and Q. robur (346 variant sites). Genome sequences were screened for indels to develop markers that identify continental origin of oak species, and that can be easily evaluated using a variety of detection methods. We identified five indels and one SNP that reliably identify continent-of-origin, based on evaluations of up to 1078 individuals representing 13 white oak species and three continents. Due to the size of length polymorphisms revealed, this marker set can be visualized using capillary electrophoresis or high resolution gel (acrylamide or agarose) electrophoresis. With these markers, we provide the wood trading market with an instrument to comply with the U.S. and European laws that require timber companies to avoid the trade of illegally harvested timber.


Introduction
Illegal logging is a serious issue not only for tropical rainforests and tropical trees, but it is also a concern for tree species in temperate latitude forests. White oaks from the genus Quercus sect. Quercus (Fagaceae) provide a relevant example of illegal logging in a temperate zone tree, and they highlight the challenge facing importers and regulatory agencies responsible for validating the taxonomic and geographic sources of timber products. White oaks account for a significant percentage of the hardwood flooring and furniture trade in Europe and the USA, and To develop a panel of polymorphisms for North American white oaks, we sequenced chloroplast genomes from additional specimens representing the following species: Q. alba, Q. bicolor Willd., Q. garryana Douglas ex. Hook., Q. lyrata Walter, Q. macrocarpa Michx., Q. michauxii Nutt., Q. prinoides Willd., and Q. stellata Wangenh.
None of the oak specimens used in this study has been sampled in protected areas that require permission of any authority. Some of the samples were done on private land and all owners of the lands gave permission for sampling. The field studies did not involve endangered or protected species.

Next-generation sequencing analyses
Aliquots of DNA (~0.5-1 μg) were sheared to a median length of~300 bp and converted into sequencing libraries using Illumina TruSeq v.2 kits at the USDA Forest Service (Corvallis, OR; individual samples, each indexed with dual-index adapters) or GATC Biotech AG (Konstanz, Germany; pooled samples). Sequencing was performed using three approaches: (A) using the Illumina MiSeq with 2x150 bp paired-end reads for individual de novo genome reference assemblies; (B) using the Illumina MiSeq with 2x300 bp paired-end reads for pooled samples and reference-guided mapping; and (C) using the Illumina HiSeq with 100 bp single-end reads for individual North American species and reference-guided mapping. All reactions used version 3 sequencing chemistry. Information on raw clusters, sequence yield, and approximate target sequence (chloroplast genome) coverage depth is provided in Table 1.
De novo reference construction. Raw read quality filtering was accomplished using Trimmomatic v0.30 [20], and we removed reads with a mean Phred score less than 33. Reads were digitally normalized to a coverage of 20 using, khmer v0.7.1 [21], and kmer-filtered sequences were assembled using Velvet Optimizer v2.2.5 (https://github.com/Victorian-Bioinformatics-Consortium/VelvetOptimiser.git) and Velvet v1.2.10 [22]. K-mer lengths ranging from 21 to 121 were evaluated, with a final k-mer length of 121 selected for assembly. De novo contigs 100 bp in length were screened for homology to the Quercus rubra chloroplast Table 1. Results of next-generation sequencing for oak reference assembly and polymorphism screening. Indexed individuals of oaks were sequenced using 150 bp paired-end reads and evaluated using de novo assembly (reference assembly). Pooled individuals were sequenced using 300 bp paired-end reads and evaluated using reference-guided assembly (polymorphism screen).

Reference Assembly
Polymorphism Screen genome (NCBI NC_020152) using BLAT [23]. Contigs showing high similarity were retained for reference assembly and ordered against the Q. rubra chloroplast genome. Reference sequences were constructed to include the large single-copy (LSC) region, one of two inverted repeats (IR), and the small single-copy (SSC) region, equivalent to positions 1-135,502 from Q. rubra NC_020152.
Reference-guided identification of SNP and indel variation. Reference guided read mapping and polymorphism detection was performed using CLC Genomics Workbench version 7.5.1 (CLC-bio, a Qiagen company; Aarhus, Denmark). The reference chloroplast sequence of the Q. mongolica individual QUMO5_CH_1 generated by de novo reference construction (see above) was used as reference for read mapping. The trimmed Illumina data of the two pools (Q. mongolica, Q. robur) and the trimmed HiSeq data of representative North American individuals were mapped to the reference scaffold using a length fraction of 0.9 and a similarity fraction of 0.94. Variants detected by CLC Genomics Workbench included SNPs and small indels, and these were exported to tab-delimited files and processed using an inhouse script (Variant Tools, see below) to identify species-specific polymorphism. Because the aim of this study was to develop markers that differentiate between the continents, the data set was reduced to these variants appearing with a frequency between 95 and 100%. For marker development, we focused in indels due to their simplicity of analysis and their easier handling using DNA extracted from timber.
Maximum likelihood (ML) analysis was performed so that variants could be understood within the phylogenetic context of New and Old World oak chloroplast genome evolution. We performed ML analyses using RAxML-HPC2 version 8.2.8 through the Cipres Science Gateway (http://www.phylo.org/) using the GTRGAMMA substitution model for determining the final best ML tree, and the GTRCAT model to conduct 1000 rapid bootstrap replicates [24]. For this analysis, de novo contigs were aligned to the Q. rubra chloroplast genome (NC_020152), Q. rubra was specified as the gougroup, and gaps were treated as missing information.

Post processing of identified SNPs and indels
To merge the SNP and indel tables and find common variants present in two or more individuals/pools, we developed Variant Tools, a command line program implemented in Ruby. This program merges individual sample SNP and indel tables (CSV format) produced by CLC GWB to create a multi-individual SNP and indel matrix. Required input options include the reference sequence (fasta format), an input directory containing the variant CSV tables, and an option specifying the input data type (SNP or indel). The reference fasta can contain one reference sequence or multiple reference contigs. Optionally, coverage tables (produced by CLC from read mappings to a reference) of every individual can be included in the analysis by specifying a directory containing coverage files (CSV format). Furthermore several filtering options are available to reduce the output according to user-provided thresholds. The output from Variant Tools is stored in a CSV file and contains several data columns: the reference sequence name, the reference position, the variant length, the calling type (SNP, MNP, deletion or insertion), the reference base(s), the alternative base(s) for every individual, the coverage for every individual at the reference position, different summary statistics, and sequences flanking the called variant.
The flanking sequences are calculated based on two given distance thresholds. An upper and a lower threshold define minimum distances in base pairs between two called variants on the genomic scale. If a variant occurs within the genomic range of the lower threshold no flanking sequence is created. If a variant resides within the genomic range of the upper threshold the length of the flanking sequence created is defined by the lower threshold. If no variant is found within the range of both thresholds a flanking sequence with the length of the upper threshold is calculated. The default thresholds are 75 bp and 50 bp.
The Variant Tools create different summary statistics while the variant matrix is generated. The number of individual alleles deviating from the reference is a count for all found variants in all individuals at a specific genomic position. The number of alleles matching the reference with minimal coverage is a count for all positions in all individuals where no variant has been called and that are supported by a minimum coverage. The threshold for the minimum coverage is specified by the user. The default threshold is set to a minimum coverage of 8. Critical forward reverse balance is an indicator for systematic sequencing errors and describes how many forward and reverse reads are supporting the called variant. The value is averaged over all individuals showing the variant.
The Variant Tools are open source software under ongoing development. They are available under the terms of the ICS license and can be obtained from https://github.com/ThuenenFG/ varianttools.

DNA extraction, PCR, restriction, and genotyping
Leaves. One cm 2 of a single leaf was ground to powder in liquid nitrogen. Total DNA was extracted, following a modified ATMAB protocol by [25]. PCR reactions for leaf-derived DNA contained~30 ng template DNA, 10x PCR buffer, 1.5 or 1.75 mM MgCl 2 , 200 μM dNTPs, 0.4 unit AmpliTaq Gold DNA polymerase (ThermoFisher Scientific, Darmstadt, Germany), and 0.05 to 0.13 μM of each primer in a total volume of 15 μl. PCR was carried out in a Sensoquest Thermocycler (Göttingen, Germany) with a pre-denaturation step at 94°C for 10 min, followed by 25 to 30 cycles of 94°C for 45 sec (30 sec trnCD), suitable annealing temperature for each primer combination (between 52°C and 57°C) (details of primer conditions are given in Table 2) for 45 sec (30 sec trnCD), 72°C for 45 sec (1 min trnLF) and a final elongation at 72°C for 10 min. PCR amplification products were checked relative to a 100 bp ladder (Life Technologies, Martinsried, Germany) on a 1% agarose gel stained with Roti-Safe GelStain (Carl Roth GmbH & Co. KG, Karlsruhe, Germany); afterwards, PCR products were run on an ABI3730 capillary sequencer. Fragment analysis was performed using GeneMarker™ software v. 2.4.0 (Softgenetics, State College, PA, USA).
Timber. For genotyping analysis of timber-derived DNA, mostly a special DNA extraction protocol has been developed and patented [26] based on the CTAB method. Exceptionally the innuPREP Plant DNA Kit from Analytik Jena (Germany) was used. Due to the small amount Table 2. List of primers for the amplification and resequencing of the newly developed markers. Fluorescent-labeling of the primers is given in column "sequences": FAM = blue, VIC = green, PET = red. In the last column, the accession numbers of the related markers for the three species Q. robur, Q. mongolica and Q. alba are given. "Length" means sequence length. of total DNA extracted from timber, the DNA quantity wasn't measured; rather, a standard dilution of 1:10 (DNA:water) was used for all PCR reactions. PCR conditions were similar to those used for leaf material, but with slight modifications (described in Results). For one marker (trnDT), a restriction enzyme digest was used to reveal a SNP polymorphism in the amplified fragment. The restriction digestion reaction contained 10μl PCR product, 2μl 10x CutSmart 1 buffer, 0.5μl enzyme (FastDigest 1 HinfI, New England Biolabs, Ipswich, MA) in a final volume of 20μl. The reaction lasted 15 min at 37°C followed by an inactivation at 80°C for 20 min. Restriction products were either visualized relative to a 50 bp ladder (Life Technologies, Germany, Martinsried) using an 8% polyacrylamide gel stained with ethidium bromide, or using an ABI3730 capillary sequencer.

Probabilities for fixation of gene markers
The Thünen-Institute of Forest Genetics possesses a large collection of reference samples that contains oak species from Europe (Quercus robur, Q. petraea), North America (Q. alba, Q. macrocarpa, and others) and Asia (Q. mongolica, Q. dentata). Based on the numbers in this collection of white oaks from different continents, we computed the maximal potential frequency of variants that were not observed using 95% confidence intervals [27]. This can be described as a method to determine the risk (potential error rate) that a genetic variant (allele/haplotype) assumed to be exclusive to one continent is found in one or more individuals originating from another continent. The calculations were carried out using the online forma at http:// vassarstats.net/prop1.html based on 962 European, 325 Asian and 61 American white oak individuals for the gene markers psaI-ycf4, psbE-petL, trnLF, and trnCD. For the gene marker trnDT the sample sizes were 115 European, 425 Asian and 19 American white oak individuals.

Results
Next-generation sequencing, reference genome assembly, and identification of cpDNA length variants in white oaks Next-generation sequencing of four indexed oak exemplars (Q. alba, Q. mongolica, Q. petraea, Q. robur) yielded between nearly 1 and 1.75 Gbp per individual (Table 1). De novo contig assembly with Velvet produced between 872 and 7,085 contigs, and the longest contigs from each assembly were from the chloroplast genome. Our best de novo chloroplast assembly derived from Q. mongolica QUMO5_CHI_1, which was represented by a single large contig totalling 135.6 kb, and it spanned the three main chloroplast genome regions (large single copy, inverted repeat, small single copy) in their entirety. Alignment of these contigs against the published Q. rubra chloroplast genome yielded an alignment of 135,603 nucleotides (alignment excludes one inverted repeat).
Maximum likelihood analysis of these selected white oaks yielded a topology similar to the topology previously established with chloroplast DNA restriction site analysis [28] (Fig 1), with the Old World white oaks Q. mongolica, Q. petraea and Q. robur resolving as a sister group to the New World Q. alba (Fig 1). The best maximum likelihood tree resolved Q. mongolica and Q. petraea as sister, but bootstrap support for this resolution was low (40%) and supported by a small number of characters. Across white oak chloroplast genomes, we identified 672 single nucleotide variants and 189 indels, the vast majority of which discriminate white oaks from the outgroup Q. rubra, and New World from Old World oaks (Fig 1).
For production of a best maximum likelihood tree, we used all these variants (Fig 1) but indicated in Fig 1 only the indels and one SNP that were selected for a marker set due to our earlier described aim of this study. Nevertheless, the whole dataset is publically available via NCBI (Data Accessibility).
Next-generation sequencing of the Q. mongolica and Q. robur DNA pool produced 17.4 and 28.4 million paired-end 300 bp sequences (10.4 and 17.1 Gbp of sequence data respectively; Table 1). Mapping of paired-end reads from the Q. mongolica and Q. robur pools against the Q. mongolica QUMO5_CHI_1 chloroplast genome reference revealed 346 variant positions for the Q. robur pool, and 789 variant positions for the Q. mongolica pool (Table 1). After read mapping, the two variant tables were compared to filter those variants that showed fixed differences between the continents. The next step was to reduce the dataset to these variants appearing with a frequency between 95 and 100%. This analysis left five indels and 15 SNPs. For marker development, we focused in indels. Checking of these indels within the mapping revealed a (T) N microsatellite, two indels with a difference in the fragment length of two bp, and one indel with one bp difference. These were removed from the further analyses and only the two longest indels in two spacer regions (psaI-ycf4, psbE-petL), one with four and one with six bp difference, remained.
Short read sequences from eight North American species were also mapped to the QUMO5 reference, and we specifically searched for indels differentiating North American species from the reference QUMO5 with a frequency of 95 to 100%. We identified three indels that consistently differentiated North America from Asia. These length polymorphisms were found in three spacer regions (trnLF, trnCD, and trnDT) and they ranged in length from 2 bp to 8 bp.

Primer design, marker validation and resequencing
For the five indel-including cpDNA regions, primers were designed using the reference QUMO5 to amplify fragments ranging from 110 bp to 190 bp. A preliminary validation performed with three individuals each of Q. robur, Q. petraea, Q. mongolica, Q. alba, and Q. macrocarpa revealed that all five cpDNA regions could successfully be amplified by PCR. Subsequent Sanger sequencing validated the sequence of the intervening region, the repeat type, and BLASTN analysis confirmed annotations (Table 2).
Two indels differentiate European (Q. petraea, Q. robur) and Asian (Q. mongolica) white oaks, and these are located in the psaI-ycf4 linker (4 bp difference) and the psbE-petL linker (6 bp difference; Table 2), and these are inferred as mutations (deletions, specifically) are restricted to Asian white oaks. The further validation of these indels was conducted by screening the amplification products of 10 additional individuals of North American species, 50 individuals of two European species (Q. robur, Q. petraea), and two Asian species (Q. mongolica, Q. dentata). This validation revealed that white oaks from North America and Europe showed the same fragment length (Table 3, S1 Fig), and confirmed that the Asian species shared deletions that resulted in shorter, diagnostic fragment lengths.
Three indels differentiate Old World (Q. petraea, Q. robur, Q. mongolica) and North American white oaks, and these are located in the trnL-trnF (trnLF) linker (5 bp difference), the trnC-petN linker of the broader trnC-trnD (trnCD) linker region (8 bp) and in the trnE-trnT linker of the broader trnD-trnT (trnDT) linker region (2 bp; Table 2). These indels were validated with the same individuals of all above mentioned species. The validation revealed that white oaks from Asia and Europe showed identical fragment lengths ( Table 3), and that the North American species shared mutations that resulted in diagnostic fragment lengths.
Resequencing (Sanger) of the trnDT region identified a single nucleotide polymorphism (SNP) differentiating Asia from Europe and North America. The SNP lies within a HinfI restriction site, and restriction digestion of this region was predicted to yield three fragments in Asian white oaks and two fragments in European and North American species (all oaks share Table 3. Details for used species, individuals and markers. Given are number of individuals per species and continent tested with the five markers, and fragment length based on sequencing for each marker and species. Consensus sequences of the five markers are given in S1 Fig 1 The number of individuals tested includes numbers for all loci except trnDT, which is given in parentheses.
2 Fragment lengths for trnDT show the lengths after restriction digestion with HinfI. Asian species contain an additional internal restriction site that yields one additional unlabeled fragment after HinfI digestion; these are shown in brackets. doi:10.1371/journal.pone.0158221.t003 one HinfI site). By labeling the forward and reverse amplification primer, restriction digestion of the PCR fragment with HinfI allows the visualization of two of the three fragments, one of which is diagnostic for Asian white oaks due to its truncated length. Since this single region offers the possibility to differentiate all three continents with one assay, we decided to include this SNP into the marker set. In total, the five markers have been evaluated with sample sizes ranging from 559 to 1241 (in detail: trnCD 1241, trnLF 1078, psaI_ycf4 1230, psbE_petL 1183, trnDT 559), with samples representing 13 oak species from the three continents (Table 3, genotype information per specimen is given in S1 File). The nucleotide variations shown to be characteristic for Asian white oaks (Q. mongolica and Q. dentata) are also present in the complete chloroplast genome of one individual of Q. aliena, another Asian white oak species (GenBank accession KP301144.1) [29].
We computed the probabilities of fixation for the gene markers. Since we require a minimum of two independent markers for continent assignment, we performed the calculations for the risk of not identifying a rare variant in the reference samples based on a combination of two markers. By this means the risk of not identifying a rare variant in the reference samples was calculated to be less than 0.022% for Europe, 0.0051% for Asia and less than 0.098% for America. Thus, it is extremely unlikely that studied gene markers are not fixed to just one variant in the different groups.

Marker set design and optimization for timber
All above described analyses were performed using single PCR reactions and fragment analyses to optimize each marker separately. Subsequently, the markers were successfully multiplexed for fragment analysis using the fluorescence labeling as given in Table 2 (Fig 2).
For the development of the markers and multiplexing, DNA from fresh leaves was used. The protocol was later optimized for DNA from timber. From our experience, DNA from and Europe (bottom). The sequence sizes for each peak as given in Table 3 are shown beneath the peaks. The first blue peaks appear smaller (112, 120) than the sequenced length (115, 123) given in Table 3. The color code of the peaks is as described in Table 2. timber is more sensitive to all PCR parameters, thus, all markers were singly tested with DNA from timber and the PCR was optimized for the DNA from timber. Differences in the PCR conditions used for the two different materials are given in Table 4. Due to the sensitivity of timber DNA in PCR, multiplexing of the PCRs of the five markers is not advisable, but multiplexing of the markers on the sequencer worked as well with timber DNA as with DNA from leaves. The only difference is that the PCR product from leaf DNA is diluted 1:50 and the PCR product from timber DNA 1:10 for use on the sequencer.
Our analyses used a capillary sequencer to visualize length polymorphisms of these fragments. However, due to the large size differences of these indels, all markers can be distinguished on a polyacrylamide gel, even for differences as small as two base pairs, as shown for the fragment trnDT (Fig 3). In this way, polymorphisms can be screened in laboratories where no sequencer is available.
The functionality of the optimized PCR protocols and the multiplexed sequencer runs has been tested by means of orders worked on in the "Thünen Centre of Competence on the Origin of Timber". Timber from these orders included highly processed wood as flooring or parquet, as well as treated solid wood samples as different parts of furniture, barrels or boards, and unused solid wood as firewood. In total, over 80 processed timber samples and 130 treated solid wood samples have been evaluated (data not shown). Based on our experiences so far we have a success rate of sufficient DNA amplification for our gene markers for 58% for solid wood samples.

Discussion
A set of five chloroplast markers have been developed and optimized to analyze DNA from timber to identify the continental origin of white oak wood. Small fragment sizes (< 200 bp) were chosen because genotyping success with DNA from timber is highest when fragments under 200 bp are targeted. This has recently also been shown for DNA from old and dried insect specimens of museum collections when using mitochondrial barcoding regions [30]. For the identification of haplotypes within oak species from wood samples, Deguilloux et al. [31] similarly developed chloroplast microsatellite and SNP markers that targeted small DNA fragments.
The sequencing data revealed no specific indels to differentiate oak species within the classical barcoding regions matK, rbcL or the linker trnH-psbA [6,11,[32][33][34]. Recently, the barcode regions matK and trnH-psbA were evaluated for their power to discriminate select species from oak sections Cerris, Heterobalanus (= "Group Ilex"), Lobatae, and Quercus [34]. In this study, the matK region proved to have too low resolution for the differentiation within the genus; interestingly, for trnH-psbA, the variability was too high to identify fixed interspecific differences [35]. The intergenic linkers trnLF, trnCD and trnDT we found valuable in this work have been widely tested in population and evolutionary genetic studies of plants, and they show wide variation in their ability to discriminate species and lineages [36]. For example, the intergenic linker trnLF proved to be not variable enough for overall barcoding approaches [6]. For the trial of phylogenetic reconstructions this trnLF linker lacked variation in closely related species [37]. Similarly, differentiation within the genus Populus failed using trnLF [38]. Nevertheless, there are other examples for successful use of this marker in molecular systematics [39,40] and citations therein) and for unraveling of the phylogeny of different plant species [41]. Similarly, trnCD and trnDT have been used in comprehensive studies of chloroplast DNA diversity in European white oaks [42][43][44][45][46][47][48][49][50][51]. In Japan, four oak species have been differentiated using trnDT among other chloroplast markers [52]. Hence, as for many regions within the chloroplast, the applicability of these spacers to questions of species identity depends on the specific phylogenetic and geographic context where they are used.
Forensic applications of molecular markers are already established with regard to illegal wildlife trade (parrots: [17]; sea turtles: [53]) or for identification of products made of endangered animal species ('whale meat': [54]; horn: [55]). The barcode of wildlife project (http:// www.barcodeofwildlife.org/) has been originated especially for this purpose. Further on control of the seafood market in different countries is well supported by usage of barcoding markers  [18,56]. For identification of illegal logging of tropic tree species the use of molecular markers is already widespread [57][58][59], but the methods are less established for tree species from temperate zones. Thus, the presented markers should be applied to give commercial vendors of white oak wood the possibility to exercise 'due diligence' when placing timber on the European market and the public authorities to control timber imports should questions emerge on the correct declaration of wood.
Supporting Information S1 Fig. Alignment for five cp regions with marked indels and SNPs used in the marker set. For each of the described markers an alignment of a different number of individuals (between three and 50) of three species from the three continents has been made. Thus, the sequence for each species given in the figure is a consensus sequence. Further details of the markers and the related accession numbers of single individuals are given in Table 2. QUALB: Quercus alba (USA), QUMON: Q. mongolica (Asia), QUROB: Q. robur (Europe). (TIF) S1 File. Information details including genotypes of used Quercus individuals as references. (XLSX)