DNA Barcoding the Dioscorea in China, a Vital Group in the Evolution of Monocotyledon: Use of matK Gene for Species Discrimination

Background Dioscorea is an important plant genus in terms of food supply and pharmaceutical applications. However, its classification and identification are controversial. DNA barcoding is a recent aid to taxonomic identification and uses a short standardized DNA region to discriminate plant species. In this study, the applicability of three candidate DNA barcodes (rbcL, matK, and psbA-trnH) to identify species within Dioscorea was tested. Methodology/Principal Findings One-hundred and forty-eight individual plant samples of Dioscorea, encompassing 38 species, seven varieties and one subspecies, representing majority species distributed in China of this genus, were collected from its main distributing areas. Samples were assessed by PCR amplification, sequence quality, extent of specific genetic divergence, DNA barcoding gap, and the ability to discriminate between species. matK successfully identified 23.26% of all species, compared with 9.30% for rbcL and 11.63% for psbA-trnH. Therefore, matK is recommended as the best DNA barcoding candidate. We found that the combination of two or three loci achieved a higher success rate of species discrimination than one locus alone. However, experimental cost would be much higher if two or three loci, rather than a single locus, were assessed. Conclusions We conclude that matK is a strong, although not perfect, candidate as a DNA barcode for Dioscorea identification. This assessment takes into account both its ability for species discrimination and the cost of experiments.


Introduction
Dioscorea is a genus of more than 600 plant species in the family Dioscoreaceae [1], which contains approximately 70 sections. These species are mainly found in Southeast Asia, Africa, Central America, South America, and in other tropical or subtropical regions where some Dioscorea species are an economically important supply of starch in the staple diet. The genus is also a favored source of medicinal plants used to extract precursors of cortisone and other steroid hormones [2][3][4]. The importance of Dioscorea in terms of food supply and pharmaceutical use, together with the controversy over classification [5][6][7][8], has given impetus to improve the identification of this genus.
In this study, we aimed to establish a high-quality system for taxonomic identification to meet the requirements of agriculture and the pharmaceutical industry. Since the early 20 th century, morphology, cytology, palynology, and other traditional means of identification of this genus have been explored successively [9][10][11][12][13][14][15][16]. With the development of molecular biology, however, some DNA sequences, such as those of rbcL, matK and trnL-F, have been used to solve complicated taxonomic problems and to infer phyloge-netic relationships among organisms, including members of the Dioscoreaceae [17][18][19].
Little research has been carried out to investigate the applicability and effectiveness of different DNA regions as barcodes to identify species within Dioscorea. In particular, characterisation of species found in China, one of the most likely centers of origin [34,35], was rarely included in previous studies. This study focuses on Dioscorea species distributed in China, and three candidate DNA barcode regions (matK, rbcL and psbA-trnH) in the plastid genome were evaluated for identification. We aimed to address several questions: for example, which of these three regions is the most useful as a barcode and how effective are these three regions and their combinations for this discrimination?

Sequence analysis and amplification efficiency
The sequence information of three candidate DNA barcode markers, matK, rbcL and psbA-trnH, is provided in Table 1. For individual regions, aligned sequence lengths ranged from 535 bp for psbA-trnH to 752 for matK. rbcL was the most conserved gene (522/553 nucleotides), based on both sequence length and number of conserved sites. matK had the greatest nucleotide variation (110/752), followed by psbA-trnH (74/535), based on sequence length and number of variable sites. psbA-trnH had the highest percentage of parsimony (parsim)-informative sites (70/535), followed by matK (81/752). It could be inferred that psbA-trnH and matK are the best regions for use as DNA barcodes for phylogenetic reconstruction, whereas rbcL was the least suitable marker for Dioscorea.
The efficiency of PCR amplification is one of the important indicators for evaluating the applicability of DNA barcodes. The amplification efficiency of matK using universal primers was 70.23% (the lowest efficiency found), while the amplification rate of rbcL and psbA-trnH was 83.33% and 74.07%, respectively. 81.4% amplification efficiency was achieved with another matK universal primer set 3F_Kim/1R_KIM. Samples that failed to amplify with universal primers were successfully amplified using specific primer sets designed by ourselves based on the Dioscorea sequences available in Genbank. Samples that amplified successfully using universal primers were randomly picked to be amplified by the self-designed primers to verify their scope of use; amplification success rates were all found to be 100%.

Intra-specific variation and inter-specific divergence
The maximum intra-specific divergence and the minimum inter-specific divergence of the three candidate barcodes and their combinations, matK+rbcL, matK+psbA-trnH, rbcL+psbA-trnH, matK+rbcL+psbA-trnH were estimated using six metrics [36]. The non-coding region (psbA-trnH) showed greater intra-and interspecific divergence than the coding regions (matK and rbcL; Table 2). PsbA-trnH had the highest interspecific divergence, followed by that of rbcL+psbA-trnH and matK+psbA-trnH, and the inter-specific divergence of rbcL was the lowest ( Table 2). matK had the maximum intra-specific variation while rbcL had the minimum. Furthermore, all seven barcodes showed higher genetic variability between than within species.

Statistical comparison of divergence
It can be seen from the Wilcoxon signed rank tests that the inter-specific divergence of matK was higher than that of rbcL, and rbcL exhibited a higher inter-specific divergence than did psbA-trnH (Table 3). P-values were less than zero showed that the differences were highly significant. These statistically analysed data suggest that matK would serve as an ideal candidate for identifying Dioscorea.

DNA barcoding gap assessment
We examined the distributions of intra-specific versus interspecific divergence in the seven barcodes at a scale of 0.001 distance units. Although no distinct barcoding gaps as typical of CO1 were found in the distributions of all the loci, it does suggest a clearly defined range, where the intraspecific variation is considerably lower than the interspecific divergence ( Fig. 1). And among them, matK revealed a relatively well separated distribution. For matK, the intra-specific distances mainly distributed in section 0.000-0.010, while the inter-specific distances mainly distributed in section 0.050-0.060. And for matK congeneric species with a genetic distance of zero accounted for only 4.914% of the total samples (8.209% for rbcL and 10.57% for psbA-trnH). So it's proposed that matK could be used to discriminate most species in this study. The loci combination matK+rbcL+psbA-trnH could also be used for species identification in Dioscorea with the lowest ratio of samples (0.486%) having an inter-specific distance of zero. For matK+rbcL+psbA-trnH, the intra-specific distances mainly distributed in section 0.000-0.010, and the inter-specific distances mainly distributed in section 0.060-0.070. Furthermore, it was confirmed that the interspecific divergences of all the seven loci was significantly higher than that of the corresponding intraspecific variations by Wilcoxon two-sample tests. And the most significant difference was observed in matK for single locus and matK+rbcL+psbA-trnH for loci combination (Table S1).
Applicability for species discrimination BLAST1 searches and the nearest genetic distance were used to test the applicability of the three loci and four combinations for species identification (Table 2). Our results revealed that matK possessed the highest identification efficiency of the three loci. In contrast, the rates of successful species identification using psbA-trnH were the lowest. In addition, the success rates of combined barcodes were higher than those of the single locus using both methods. matK+rbcL+psbA-trnH had the highest authentication capability, which correctly identified 53.49% of the species by both the BLAST1 search and the nearest genetic distance methods.

Discussion
Assessment of the applicability of the three candidate barcodes An ideal DNA barcode must have high PCR amplification efficiency, whilst containing enough variability to be used for species identification and adequate conserved regions for the design of universal primers [37]. In this study, it was found that the chloroplast matK gene was a promising candidate for authenticating Dioscorea species based on amplification efficiency, barcoding gaps, and success rate of identification. An amplification efficiency of 100% was obtained using specific primers and the identification efficiency was highest when using the three loci (23.26%). The chloroplast rbcL gene did not have enough inter-specific divergence, although its amplification efficiency was not low. The success rate of identification of psbA-trnH was too low to be useful for this purpose. rbcL and psbA-trnH had individual advantages despite their poor capability for authentication of Dioscorea species. rbcL had high amplification efficiency, but the overlap of intra-specific and interspecific divergence was too substantial to be of use for discrimination and the identification rate was only 9.30%. This situation arises because the rbcL gene does not have sufficient variation at the species level to be use as a DNA barcode [31,[38][39][40][41]. Approximately 10,300 rbcL sequences in the GenBank were compared using the distance based method. It was found that rbcL was not capable of discriminating between all species, but was able to distinguish some taxa at the genus and species levels [42].
The amplification efficiency of psbA-trnH in Dioscorea was moderate and its identification accuracy was only 11.63%, therefore it is not a suitable candidate as a DNA barcode for Dioscorea, as it is for other species [43,44]. In addition, the presence of a poly-A/T in this region often reduces the success rate of DNA sequencing.
Insertions or deletions appear to be a common characteristic of this genetic region, even in closely related species [45][46][47][48][49][50]. The variable lengths of this region make sequence alignment difficult. Large insertion or deletion was also found in different populations of Dioscorea. For example, Dioscorea zingiberensis C.H. Wright collected from Yichang, Madao and Enshi in China had a 234bp insertion segment at 183 bp compared to other populations. The generation mechanism of indels in psbA-trnH remains ambiguous, and one hypothesis was raised by Aldrich et al., [45] that the deletion of insert often occurred between imperfect ATrich repeats flanking the insert, which was also supported by the detection of imperfect AT-rich repeats flanking the indel in Dioscorea zingiberensis. In contrast with the problems of indels for sequence alignment, indels will ultimately enrich the information needed for species discrimination [30]. Such indel indicates a divergent trend of two groups separated by Yangtze River in Dioscorea zingiberensis [51]. The insertions were only detected in the three populations of north group, while the others exhibited deletion.
The combinations of psbK-psbI+atpF-atpH, matK+atpF-atpH+ psbK-psbI and matK+atpF-atpH+psbA-trnH were used to discriminate 101 individual plants belonging to 31 species and 18 families, and achieved high success rate [40]. When combinations of two or three loci were used in Dioscorea, much higher identification efficiency was achieved than any of single locus.

Application of matK in discrimination of Dioscorea species
matK is a recommended DNA barcode candidate gene because of its high evolution rate [23,52,53]. Lahaye and colleagues amplified matK successfully in 398 samples using primers 390F/ 1326R in 2008 and more than 90% species could be identified. However, these results have been reviewed unfavorably by some researchers. Kress and colleagues [32] doubted whether amplification efficiency could remain high in plants from other families, as 96% of samples in Lahaye's study were orchids. The criticism of matK for use as a DNA barcode is its poor performance of primer universality [52,54] Table 3. Wilcoxon signed rank test of the inter-specific divergences among the three loci. 39.3% amplification rate for 96 species belonging to 48 genera, although 10 pairs of primers were used. Fazekas and colleagues [39] amplified 251 individual plants from 32 genera and 92 species, but their success rate was only 87.6%. The amplification efficiency of Dioscorea were 70.23% and 81.40% with primers intF/intR and 3F_Kim/1R_KIM respectively and a 100% success rate was achieved using our in-house designed specific primers. For an ideal barcode a distinct gap with no overlap is essential [23,36]. But in this study, no distinct barcoding gap was found even though intra-specific divergence and interspecific divergence was mainly non-overlapping (Fig. 1). Nevertheless, based on the histogram of DNA barcoding gap and species identification, matK was proven to be better than the other loci in our study.
ITS and ITS2 have been also proposed to be the most promising universal DNA barcode in plants [55,56], unfortunately, because of the low sequencing success of the ITS and ITS2 region brought by serious endophyte interference in our study, this region was not included for further analysis.
In conclusion, our study shows that the matK is a strong, although not perfect, candidate for Dioscorea identification. It remains necessary to carry out further research on other more variable DNA barcodes such as psbK-psbI and atpF-atpH in species identification of this genus.  (Table S2).

DNA extraction, amplification, sequencing
Genomic DNA was extracted following a cetyl trimethylammonium bromide (CTAB) protocol modified from Paterson et al. [57]. The universal primers intF and intR (RBG Edinburgh recommended), 1F and 724R [58], and psbAF and trnH2 [59] were used in the amplification of matK, rbcL and psbA-trnH regions of the cpDNA respectively. As to these samples failed to amplify using universal primers, specific primers were designed with the aid of OLIGO primer design software (Molecular Biology Insights, Inc., Cascade, Colorado, USA), based on genus Dioscorea sequences deposited in the GenBank database. For example the matK sequence of D. alata L. (AB040208), the rbcL sequence of D. alata L. (AY667098) and the psbA-trnH region of D. elephantipes (L'Her.) Engl. (EF380353.1) were used. In addition, the universal primer set 3F_Kim and 1R_KIM currently recommended by CBOL (http://barcoding.si.edu) was also adopted to evaluate the efficiency of PCR amplification in a sample pool composed of one randomly selected sample from all species. Detailed sequences of all the primers and reaction conditions are listed in Table S3.
Polymerase chain reaction (PCR) amplification of the three candidate barcodes was carried out using the following program: a premelt of 3 min at 94uC, followed by 35 cycles of 45 s denaturation at 94uC, 30 s annealing reaction at 53-58uC, and finally a 1.5 min 30 s extension at 72uC. Each 20-ml reaction mixture contained 30 ng of genomic DNA template, 2.5 mmol/L MgCl2, 16 Mg-free DNA polymerase buffer, 0.12 mmol/L dNTPs, 0.3 mmol/L each primer, 1 U Taq DNA polymerase. PCR products were examined electrophorectically using 0.8-1.2% agarose gels. Purification and bidirectional sequencing were completed by Beijing Genomics Institute (BGI) using the amplification primers.

Sequence alignment and data analysis
Sequences were aligned and adjusted manually using Sequencer v.4.5 software (GeneCodes, Ann Arbor, MI, USA). The nucleotide sequence data of the three regions were deposited in the GenBank database (Table S2). All genetic distances were calculated using MEGA (4.0 Version) software.
Average intra-specific distance, mean theta and coalescent depth were calculated to determine intra-specific variation [36,55], and average interspecific distance, theta prime and the minimum interspecific distance were calculated to determine interspecific divergence [36,55,60]. Wilcoxon signed-rank tests were performed as previously described [23,31]. The distribution of intra-specific versus interspecific variability was evaluated by assessment of the presence of DNA barcoding gaps [31,36]. Two methods of species identification, including BLAST1 protein similarity search and the nearest distance method, were carried out as described previously [61]. BLAST1 searches were conducted on a local reference library constructed for each region. The barcode sequence of each species was queried against the local library with the ''blastn'' command. The identity of a sample was based on the best hit and the E-value for the match must be lower than the cutoff value. In comparison, for the nerest genetic distance method, the identity of a sample was determined based on the subject sequence which has the smallest genetic distance and the distance must be less than a distance threshold. The traffic light approach was used to identify the combination of markers [62].The combination would have identification power as long as the sequences could be identified by any of the markers in combination, while the combination would be incapable of identifying sequences if none of the markers in combination could identified sequences successfully.

Supporting Information
Table S1 Wilcoxon two-sample tests for distribution of intra-vs. inter-specific divergences. (DOC)