DNA barcoding evaluation and implications for phylogenetic relationships in Lauraceae from China

Lauraceae are an important component of tropical and subtropical forests and have major ecological and economic significance. Owing to lack of clear-cut morphological differences between genera and species, this family is an ideal case for testing the efficacy of DNA barcoding in the identification and discrimination of species and genera. In this study, we evaluated five widely recommended plant DNA barcode loci matK, rbcL, trnH–psbA, ITS2 and the entire ITS region for 409 individuals representing 133 species, 12 genera from China. We tested the ability of DNA barcoding to distinguish species and as an alternative tool for correcting species misidentification. We also used the rbcL+matK+trnH–psbA+ITS loci to investigate the phylogenetic relationships of the species examined. Among the gene regions and their combinations, ITS was the most efficient for identifying species (57.5%) and genera (70%). DNA barcoding also had a positive role for correcting species misidentification (10.8%). Furthermore, based on the results of the phylogenetic analyses, Chinese Lauraceae species formed three supported monophyletic clades, with the Cryptocarya group strongly supported (PP = 1.00, BS = 100%) and the clade including the Persea group, Laureae and Cinnamomum also receiving strong support (PP = 1.00, BS = 98%), whereas the Caryodaphnopsis–Neocinnamomum received only moderate support (PP = 1.00 and BS = 85%). This study indicates that molecular barcoding can assist in screening difficult to identify families like Lauraceae, detecting errors of species identification, as well as helping to reconstruct phylogenetic relationships. DNA barcoding can thus help with large-scale biodiversity inventories and rare species conservation by improving accuracy, as well as reducing time and costs associated with species identification.


Introduction
1. evaluate barcode universality in Chinese Lauraceae species; 2. assess DNA barcoding performance relative to species identification; 3. determine if these barcodes can also allow for the reconstruction of phylogenetic relationships within the Lauraceae, relative to previously recognized subdivisions and affinities.

Ethics statement
Collection of these species was conducted in compliance with existing regulations for plants defined as non-commercial, as determined by local government offices. In addition, these sample collections were performed in China with the written approval from the National Forest Bureau and relevant local governments, complying with Chinese and international regulations for the collection of native plant samples.

Sampling
A total of 409 individuals of 133 species from 12 genera of Lauraceae were included in this study (S1 and S2 Tables, Supporting Information), distributed across eight provinces: Chongqing, Guangdong, Guangxi, Hainan, Hunan, Sichuan, Yunnan and Zhejiang, representing much of the diversity of this family in China. Materials for this study were collected in the field from 2002 to 2012, with 22 species represented by a single individual and 111 species represented by two to nine individuals (an average of three samples per species). The Lauraceae expert at KUN, Hsi-Wen Li, who is one of the co-authors, identified the vouchers (S1 Table) based on the reproductive or vegetative characters available. All vouchers were stored at the Herbarium of Xishuangbanna Tropical Botanical Garden (HITBC).

DNA isolation, amplification and sequencing
Total genomic DNA was extracted from silica gel-dried leaf tissue or herbarium specimens using a modified CTAB method [49]. The plastid markers rbcL, matK, trnH-psbA and nuclear markers ITS and ITS2 were amplified using multiple primers, following the suggestions of Dunning and Savolainen [50] and Yu et al. [51]. For example, four primers sets were tested for matK due to its generally poor performance of amplification and sequencing [52]. DMSO and BSA were also added to enhance the PCR performance for matK and ITS [53,54]. PCR was performed in 20 μL reaction mixtures containing 0.2 μL of Taq polymerase (5 U), 2.0 μL of 10 × PCR buffer, 2.0 μL of 25 mM MgCl 2 , 2 μL of 2.5 mM dNTPs, 1 μL of 10 uM of each primer, 1 μL of DMSO, 2 μL of 1 mg/ml BSA and 2 μL template DNA. For primer combinations, PCR thermal conditions and references, see Supporting Information (S3 Table). All PCR products were sequenced at the Beijing Genomics Institute (BGI).

Sequence editing and alignment
Raw sequences were assembled and edited using Sequencher 4.14 (GeneCodes Corp., Ann Arbor, Michigan, USA) and deposited in GenBank (see S2 Table for GenBank accession numbers). Edited sequences were then aligned using Geneious 6.1.2 (Biomatters Ltd.), Clustal W [55] and MUSCLE [56], with final manual adjustment undertaken with Geneious 6.1.2 and BioEdit 7.0.9.0 [57]. All variable sites were rechecked on the original trace files for final confirmation. For the rbcL and matK markers, a global multiple sequence alignment was used. The rbcL sequences were unambiguous, due to the absence of insertions or deletions, but alignment of matK was more difficult due to the insertion of triplet codons, so the alignment results were checked visually. The trnH-psbA and ITS sequences were highly variable and very difficult to align with Geneious, so these markers were aligned several times by Clustal W and MUSCLE and then a supermatrix was created by concatenating them with the aligned sequences of the remaining markers.

Data analysis
Two widely applied methods (tree-based and similarity-based) were used to evaluate species discrimination success, following Huang et al. [22]. Five single markers and all possible combinations were applied. For the tree-based method, we used Geneious 6.1.2 to construct Neighbour-Joining (NJ) trees. For the similarity-based method, we used BLAST [58] for building local reference databases against which all sequences were then queried using the blastn program. The 22 species with only a single individual were excluded in NJ trees and BLAST (n ! 2) analyses. Species discrimination was considered successful only when all conspecific individuals formed a single clade supported by bootstrap values greater than 50% in the NJ tree [59], and when all individuals of the species or genus only had a top matching hit with a conspecific/congeneric individual in BLAST (the query sequence itself was excluded from the list of top hits when there were multiple individuals).
In detecting identification errors, a two-step procedure of reciprocal illumination was used. We evaluated errors in the initial morphology-based identifications combining morphology and DNA sequence data to uncover and correct mistakes in Lauraceae identification. A schematic illustration is used to show the identification process in the present study (Fig 1). Firstly, our initial morphological delimitations were identified by the Lauraceae expert and defined as morphospecies. Then we compared the specimens with herbarium specimens from HITBC, KUN and PYU. Finally, we combined DNA sequences with existing morphological characters. Potential errors were identified through examination of the NJ trees (using rbcL, matK and the combination of rbcL+matK+trnH-psbA+ITS) and BLAST. If the result indicated that the sample did not belong to an a priori assigned taxon, it was flagged as a possible error and the sample was then compared with descriptions and herbarium specimens of the species involved, using morphological characteristics in order to confirm whether an error had been made.
In phylogenetic analyses, combined data sets are often able to generate more resolved and better-supported phylogenies [41,60], so this approach was also used for Lauraceae. In this study, phylogenetic analyses are inferred from sequence variation in the four-locus combination of rbcL+matK+trnH-psbA+ITS. Bayesian Inference (BI) and Maximum Parsimony (MP) phylogenetic analyses were conducted to reconstruct phylogenetic relationships using PAUPÃ4.0b10 [61] and MrBayes 3.1.2 [62], with gaps coded as simple indels using the program Gapcoder [63]. For the Bayesian analysis, the dataset was partitioned by markers. Modeltest 3.7 [64,65] was used to select the best-fit evolutionary model for each partition according to the Akaike Information Criterion (AIC) [66]. The Markov chain Monte Carlo (MCMC) algorithm was run with one cold and three heated chains for 5,000,000 generations, which started from random trees and sampling one out of every 500 generations. Inspection of the log likelihood values suggested that stationarity was reached well before the first 25% implemented as default value for the burn-in and the remaining 75% were used for constructing the consensus tree with the proportion of bifurcations found in this consensus tree given as posterior probabilities (PP). MP analysis was conducted using the following heuristic search options: treebisection-reconnection (TBR) branch swapping, collapse of zero length branches and Mul-Trees on, with 1000 random taxon additions, saving 100 trees from each random sequence addition [66]. All character states were regarded as unordered and equally weighted. Bootstrap support values (BS) for internal nodes were estimated with 100 heuristic bootstrap replicates. The reliability of clades as judged by the posterior probability in Bayesian analysis was generally higher than that as judged by the bootstrap probability in MP analysis [67]. Based on known phylogenies and simulations, bootstrap values of 50% corresponding to posterior probabilities of 90% are generally considered as moderate support of true clade probabilities, and a strong relationship between bootstrap values of 70% corresponding to posterior probabilities of 95% are generally considered as strong support [68,69]. Three species of Monimiaceae, plus Gomortega nitida Ruiz & Pav. (Gomortegaceae) were selected as outgroups, based on their sister relationship to Lauraceae in a previous study [7]. A sample of the monotypic African genus Hypodaphnis was also included, as the genus is considered to be sister to the remainder of Lauraceae [7], with ITS sequences for these five species downloaded from GenBank.

Mistakes in taxonomic identification
After combining DNA sequences with existing morphological characters, various putative species were found to comprise 1-4 individuals that were divergent from the majority of individuals sequenced for their species and that were nested within other species. In these cases, a detailed reanalysis of voucher specimens combined with NJ Tree analyses and BLAST examinations was needed. The results showed that the divergent individuals had been identified incorrectly. In total, 44 individuals (10.8%) had been misidentified by the expert ( Table 2, Fig  2; S1 and S2 Figs), 34 at the generic level and 10 at the species level. Following these corrections, we recognised 133 OTUs for the study. The misidentified samples and their identification after revision are listed in Table 2.

Discrimination efficiency in Lauraceae
After morphological error correction, the resolution rates of species (8.2-57.5%) and genera (25-70%) were calculated, both for individual barcode sequences, as well as for various combinations (Table 1 and Fig 3). For single barcodes, ITS showed the highest discriminatory power of the five markers (Figs 3 and 4), but the discrimination rate was only 57.5% at the species level in BLAST (n ! 1) (see Fig 3A). At the genus level, ITS was again the most accurate (70%) in BLAST (n ! 2) (see Fig 3B). ITS2 showed lower sequence variation and species discrimination than ITS (see Fig 3A, 44.7% at species level; Fig 3B, 63.6% at genus level), despite its sequence recovery being more or less double that of ITS (Table 1). The discrimination rates of rbcL were the lowest (see Fig 3B, 8.2% at species level; 25% at genus level).
Overall, the tree-based method (NJ Tree) and the similarity-based method (BLAST) provided unsatisfactory discrimination rates.

Universality of DNA barcodes
Primer universality is an important criterion for a useful DNA barcode [27]. In this regard, the core barcodes (rbcL and matK) for Lauraceae plants had the best performance in PCR amplification and sequencing among the five regions (successfully amplifying and sequencing 92.5% individuals), consistent with a previous study [70]. Compared to the above core barcodes, ITS had a relatively low sequencing success rate of 39.1%, because of the lack of universal primers (either published or with potential development by using current information) and poor success by using existing primers [25]. The poor success by using existing primers is probably due largely to the problem of secondary structure formation resulting in poor quality sequence data, multiple copy numbers, etc. [29, 32,33,71,72]. Thus, this region is probably unsuitable as a universal barcode, although it may be useful in particular cases.

Detecting identification mistakes
Characters such as phyllotaxis, perianth, inflorescence type, size of tepals, or fate of tepals in fruit have been used to delimit the species of Lauraceae [1,3,5,41]. Among these characters, there are some polymorphic characters considered useful at the between-genus level, while they are rarely present together on a specimen when sampled. In Cryptocarya, the fruit   completely enclosed in the accrescent receptacular is a remarkable character distinguishing it from other genera; however, only some species were flowering when sampled. Hence, Beilschmiedia purpurascens L061 was wrongly recognized as Cryptocarya calcicola (Table 2). Likewise, the persistent and spreading to reflexed tepals in the fruit of Machilus are important morphological characters for generic delimitation from the closely related genus Phoebe, in which tepals are leathery to woody, conspicuously thickened and clasping the base of the fruit [41]. These characters are also obviously different, but some of these species were also only flowering when sampled, resulting in identification errors, such as Phoebe tavoyana CXQ0426 ( Table 2). There are also some morphological identification errors due to scant information about the species. For example, Cinnamomum chago B.S. Sun et H.L. Zhao [73], which had not been included in Flora of China, where if the expert had seen the topotype prior to this study (which has an axillary panicle and short perianth tube), identification errors may not have happened. Furthermore, some genera, such as Lindera, Litsea, Neolitsea and Actinodaphne, which form the Laureae, are really not well defined. All the above factors hampered the accurate identification of Lauraceae. Although each sample in the current study is represented by a voucher that was compared to a reference collection, some species often cannot be distinguished in the absence of complete flowering and fruiting material. DNA barcoding can act as a tool for detecting errors in species identifications [23]. The tree-based and similarity-based approaches using DNA barcoding in combination with morphology are thus very useful to address identification mistakes based only on morphology [22, [74][75][76][77]. Examination of the initially misidentified samples showed that misidentifications were most likely to occur when the samples were only flowering or fruiting and their morphological characters and geographical distributions were similar. Once morphology-based errors listed above were taken into account, mistakes in individual identifications were then only detectable through DNA sequencing.
Revision of morphological identifications based mainly on the core barcodes, or the combination of rbcL+matK+trnH-psbA+ITS, supplemented by BLAST analyses, determined that 10.8% individuals had been misidentified a priori based on morphology (Table 2). This error rate is higher than those reported for some other studies (5.6-10.5%, Archaux et al. [78] Huang et al. [22]), suggesting that the Lauraceae require careful interpretation of the characters used for specific and generic definition. In particular, accurate recognition of Lauraceae would be very useful because it is the most diverse family in China and is known to be taxonomically problematic.

Evaluation of DNA barcodes for Lauraceae
Our study gives a reliable assessment of barcoding efficacy in the family Lauraceae based on a large sample size, comparable to the results of studies for other diverse angiosperm groups (e.  [81]). An ideal DNA barcode must combine conserved regions for universal primer design, which show high rates of PCR amplification and sequencing [28] and should also provide a high rate of success for species discrimination and identification [25,30,82].
In the present study, the five barcodes performed differently for all samples (Table 1 and Fig 3) and out of all regions tested, ITS performed best, showing the greatest level of species discrimination. However, other studies have described inherent difficulties with this marker [29,32,33,71,72] and some researchers have advocated using ITS2 alone as a replacement for ITS because it is easier to amplify and sequence this subset of the marker [32,33]. In contrast, ITS2 showed lower sequence variation and species identification ability than ITS in our study, even though its sequence recovery rate is about two times that of ITS, but we did not observe the other difficulties usually associated with ITS as a barcode marker, so the marker appears to have potential for Lauraceae as long as the low sequencing success rate can be addressed.
ITS was proposed as a DNA barcode for seed plants because of its high species identification ability [25,33] and in this study ITS provided the highest species resolution, agreeing with the results of recent studies in other plant groups (e.g., Poaceae: Cai et al. [83]; Schisandraceae: Zhang et al. [84]; Orchidaceae: Li et al. [85]). The other four barcoding regions investigated here (rbcL, matK, trnH-psbA and ITS2 alone) have all been proposed as core or supplementary regions for plant barcoding [25,28,29,32,82,86], but in our study they exhibited low specieslevel resolution and only Cryptocarya and Beilschmiedia were distinguished clearly from the other genera. This suggests that ITS is the best candidate for Lauraceae when using a single barcode.
Combining DNA barcodes is generally considered to improve species identification [28, 33,87,88] and in this study, the discrimination rates of the combinations varied from 10.6% to 32.6% with rbcL+matK < rbcL+matK+trnH-psbA < rbcL+matK+trnH-psbA+ITS2 < rbcL +matK+trnH-psbA+ITS at the species level (Fig 3). However, we can see that the discrimination rates of rbcL+matK are higher than those of rbcL+matK+trnH-psbA and rbcL+matK +trnH-psbA+ITS2 at the genus level. The utility of a marker is not only affected by its discriminatory power, but also by its rate of sequence recovery (Figs 2-5).
Species delimitation in Lauraceae is often complicated by a lack of unique qualitative morphological characters that can be used to define them. DNA barcode data can therefore provide useful additional information for evaluation of observed morphological diversity [89]. Efficient species identification is also important for customs and other authorities to prevent the illegal export and commercial use of protected or rare species [90]. Thus, it is suggested here that using ITS as single barcode, or a combination of barcode markers that included ITS, would be the most suitable approach for barcoding in Lauraceae.

Relationships among major clades
The BI and MP analyses provided relatively good phylogenetic resolution for Lauraceae at both generic and intrageneric levels (Fig 5), especially in basal lineages, with the Cryptocarya group, the Caryodaphnopsis-Neocinnamom um group and the Persea group plus Laureae and Cinnamomum corresponding to our Clades 1, 2 and 3 respectively. Within the Cryptocarya group, which is basal within Lauraceae [7,47], Cryptocarya is sister to the non-cupulate clade of Beilschmiedia. Cryptocarya has a deeply urceolate floral hypanthium that develops into a deep cupule enclosing the drupe at maturity, except for a small terminal orifice [7,46], but Beilschmiedia lacks these characters; a synapomorphy that separates Beilschmiedia and related genera (Endiandra and Syndiclis) from the rest of the Cryptocarya group.
Caryodaphnopsis and Neocinnamomum are associated in the present study and have been found previously to have a relatively close relationship [47,91,92]. They share triplinerved venation and four-locular anthers with the loculi arranged in a shallow arc [7], sometimes two-locular in Caryodaphnopsis, or in a horizontal row, such as in Neocinnamomum delavayi (Lecomte) H. Liu.
The remaining clade (the Persea group, Laureae and Cinnamomum) with Machilus and Phoebe as subsets of the Persea group received moderate support, agreeing with the studies of Chanderbali et al. [7], Li et al. [41] and Rohwer et al. [44]. However, as with these earlier studies, there was poor resolution for species relationships within Machilus and its presently accepted sections and subsections (e.g. Li et al. [93]) are still questionable. Nevertheless, the present study does suggest that M. fasciculata H. W. Li belongs in Phoebe. Cinnamomum was divided into two clades corresponding to sect. Camphora Meissn. and sect. Cinnamomum [63], reflecting morphological traits such as leaf arrangement, leaf venation pattern, presence or absence of perulate buds or domatia.
The remaining sampled Laureae were poorly resolved, even though a close relationship between Actinodaphne, Lindera, Litsea and Neolitsea has been recognized in almost all Lauraceae classifications [7]. All of these genera are dioecious and most have umbellate inflorescences subtended by involucral bracts [7], but further character evolution study is needed to determine if these features actually represent synplesiomorphies. This suggests that although multilocus molecular markers still do not give well-resolved phylogenies for all Lauraceae, DNA barcoding is nevertheless useful for resolving phylogenetic relationships at the generic or species level within some groups in the family.

Conclusions
The barcodes used here produced positive results for correcting species identification errors and reconstructing phylogenetic relationships of Lauraceae, even though identification rates were not high. Furthermore, because DNA barcoding plays an important role in the conservation of rare species and for forest crime prosecutions, we advocate the use of DNA barcodes, in combination with other techniques, in order to develop adequate management strategies for the long term conservation of Lauraceae. In particular, barcodes such as ITS show promise for large-scale biodiversity assessment and inventory, particularly for tropical tree species, where the use of a single barcode could significantly reduce the time and costs involved with species identification. However, our study also indicates the critical need for additional data from both more taxa and more sequence regions to help resolve issues in Lauraceae taxonomy and conservation, as there is clearly no simple one-size-fits-all barcoding solution for the family.