Two New Potential Barcodes to Discriminate Dalbergia Species

DNA barcoding enables precise identification of species from analysis of unique DNA sequence of a target gene. The present study was undertaken to develop barcodes for different species of the genus Dalbergia, an economically important timber plant and is widely distributed in the tropics. Ten Dalbergia species selected from the Western Ghats of India were evaluated using three regions in the plastid genome (matK, rbcL, trnH-psbA), a nuclear transcribed spacer (nrITS) and their combinations, in order to discriminate them at species level. Five criteria: (i) inter and intraspecific distances, (ii) Neighbor Joining (NJ) trees, (iii) Best Match (BM) and Best Close Match (BCM), (iv) character based rank test and (v) Wilcoxon signed rank test were used for species discrimination. Among the evaluated loci, rbcL had the highest success rate for amplification and sequencing (97.6%), followed by matK (97.0%), trnH-psbA (94.7%) and nrITS (80.5%). The inter and intraspecific distances, along with Wilcoxon signed rank test, indicated a higher divergence for nrITS. The BM and BCM approaches revealed the highest rate of correct species identification (100%) with matK, matK+rbcL and matK+trnH-psb loci. These three loci, along with nrITS, were further supported by character based identification method. Considering the overall performance of these loci and their ranking with different approaches, we suggest matK and matK+rbcL as the most suitable barcodes to unambiguously differentiate Dalbergia species. These findings will potentially be helpful in delineating the various species of Dalbergia genus, as well as other related genera.


Introduction
In DNA barcoding, the sequence of a short stretch of DNA is used for accurate species identification [1], supplementing the classical taxonomic methods [2]. Although DNA barcoding has been successfully used for discriminating animal species, applying this approach for discriminating plant species is more difficult due to many challenges [3]. Plant mitochondrial genomes exhibit low rates of nucleotide substitution and high rates of chromosomal rearrangements [4], while extensive gene duplication occurs in the nuclear genome [5]. Initial DNA barcoding studies in plants have proposed a few plastid coding as well as non-coding regions, such as rbcL and trnH-psbA [6], matK, rpoB, rpoC1 and trnH-psbA [7] and atpF/H, matK, psbK/I and trnH-psbA [8] as promising candidates. However, the slow evolving coding regions of plastid genomes might not possess enough variation to discriminate closely related plant species and this could lower their potential as effective barcodes [9]. This can be overcome by analyzing the selected loci either individually or in combination [10,11]. Recently evolved nuclear region, i.e. nuclear internal transcribed spacer from ribosomal gene (nrITS) has also been proposed as potential barcodes [12].
Dalbergia Linn. F. (Family: Fabaceae) is a genus of shrubs, lianas and trees. It is confined to the tropical regions of the world with Amazonia, Madagascar, Africa and Indonesia as the centers of diversity [13,14]. About 200 species comprise the genus, of which nearly 35 are found in India with 10-15 species in the Western Ghats (WG) alone [14,15]. The overall species diversity is high in WG Seven species are endemic to this region (http://wgbis.ces.iisc.ernet.in/ biodiversity/sahyadri_enews/newsletter/issue38/article/index.htm); hence, we choose to select WG as our study area. The Dalbergia genus is economically important for its quality timber. The wood of different Dalbergia species is used for specific purposes such as making furniture (D. latifolia, D. sissoo), boat building (D. sissoo) and manufacturing musical instruments (D. melanoxylon) [15]. Studies on tropical dry evergreen forests (TDEF) of India have indicated indiscriminate logging as one of the major factors responsible for the loss of commercial tree species, biodiversity. This is particularly the case for the species listed in Appendix II of the CITES (Convention on International Trade in Endangered Species of Wild Fauna and Flora) document [16]. The Red list of IUCN (International Union for Conservation of Nature) has more than 30 Dalbergia species under endangered category (http://www.iucnredlist.org) including D. cochinchinensis and D. latifolia as vulnerable species. Similarly, APFORGEN (Asia Pacific Forest Genetic Resource Programme) has identified D. latifolia as a prime concern from a conservation point of view. Moreover, as the wood of Dalbergia species is illegally traded in some countries, it is difficult to prove their identity and take legal action in the absence of accurate tools and methods for species identification [16]. This has facilitated fraudulent marketing and sale of poor quality wood of other tree species in place of Dalbergia. In this context, DNA barcoding can help as a quick way of authenticating the wood of Dalbergia even for legal purpose if needed.
Dalbergia species are morphologically variable and possess a wide range of habitat preference. This makes it difficult to classify the New World and the Old World species into natural groups [17,18]. Over the past several decades, many revisions based on morphological characters have made the taxonomic speciation in Dalbergia quite challenging [12,17,[19][20][21][22][23]. Moreover, very limited information is available on the molecular taxonomy of Dalbergia genus. There is only one report [14] describing the phylogeny of Dalbergia species indicating its monophyletic nature of origin. The genus was included in the evolutionary study of Leguminosae [24] to analyze the relationship of Machaerium and Aeschynomene using trnL and nuclear ribosomal DNA sequences [25]. Very few studies have reported on the molecular analysis of Indian Dalbergia species [15,[26][27][28][29], making it imperative to conduct studies on the genus on various aspects including phylogeny, diversity and end-use quality using DNA markers and sequence based polymorphism in suitable genomic regions.
In the present study, the primary focus was to develop an accurate species identification method for Dalbergia genus and this was addressed by developing potential DNA barcodes for the genus. We have evaluated 37 primer pairs from plastid and nuclear genomes of which four loci (rbcL, matK, trnH-psbA and nrITS) were shortlisted and various statistical parameters were employed to demonstrate their potential as barcodes to unambiguously discriminate Dalbergia species.

Ethics statement
The locations involved in the study were not part of any protected area, reserve forests or national parks except for Chinar wildlife sanctuary and Parambikulam wildlife sanctuary. The samples from these areas were collected by Kerala Forest Research Institute (KFRI), Peechi, Kerala, which is a government organization having the requisite permissions. The exact GPS coordinates for the collection sites are not available. Further, none of these species are endangered or protected species.

Sample collection
The study included 166 accessions from ten Dalbergia species representing three sections, section Sissoa (Dalbergia latifolia, D. melanoxylon, D. sissoo, D. rubiginosa, D. horrida and D. tamarindifolia), section Dalbergia (D. volubilis, D. paniculata and D. lanceolaria) [15] and section Selenolobia (D. candenatensis) [20]. We focused on the locations in WG, which is one of the most important biodiversity hotspots in India (Fig 1 and S1 Dataset). Between 5 and 25 accessions of each species were collected from different locations to understand the effect of geographical isolation on intraspecific variation in barcoding. The samples were authenticated by KFRI and the Botanical Survey of India (BSI, Western Circle, Pune, India) and the voucher specimens from each species were deposited in their respective herbaria. Pterocarpus marsupium, which falls outside the Dalbergia clade and is native to WG, was used as an out-group in the present study [14].

DNA extraction, PCR amplification and sequencing
Total genomic DNA was extracted from fresh or dried leaf samples using the modified cetyltrimethylammonium bromide (CTAB) method [30]. At the time of initiating this study, since no specific region was recommended as universal plant barcode, based on available literature we selected the genomic loci corresponding to matK (7 primer pairs), rpoC (4 primer pairs), rpoB (5 primer pairs), accD (6 primer pairs), ndhJ (3 primer pairs), ycf5 (4 primer pairs), trnH-psbA (5 primer pairs), nrITS (2 primer pairs) and rbcL (single primer pair) for developing the barcodes. As sequence information for most of these loci was not available for Dalbergia species, we attempted multiple sets of primers to amplify the respective loci from all the ten species.
Thirty seven primer pairs were tested to identify the loci satisfying the set criteria for DNA barcoding. Four primer pairs (S2 Dataset) corresponding to matK, rbcL, trnH-psbA and nrITS produced highly specific amplifications (sharp bands on agarose gel) and gave good quality DNA sequences. Therefore, these were selected for further study. PCR amplifications were performed in a final volume of 20 or 25μL (S3 Dataset) and the amplicons were resolved on 1% agarose gel. Most of the PCR reactions yielded specific amplifications (i.e. sharp single bands on agarose gel) and these were directly used as templates for sequencing reactions. In the samples that generated multiple PCR products, bands corresponding to the expected size were eluted from the gel using PureLink 1 Quick Gel Extraction Kit (Invitrogen, USA) and used as templates in sequencing reactions. Sequencing was performed using Sanger chemistry in both ends of the DNA fragment using MegaBACE DYEnamic ET dye terminator kit with Mega-BACE1000 DNA Analysis System (GE Healthcare, USA).

Sequence analysis
For each sequence, the chromatograms were inspected and poor quality 5 0 and 3 0 DNA sequence ends were trimmed. Post trimming lengths were maintained at least 60% of the original read length, subject to the minimum average quality score of Q20. The sequences failing this criterion were rejected and re-sequenced. All the nucleotide variations were evaluated and confirmed by aligning the chromatograms from forward and reverse sequencing results. Sequences with 70% or more overlap were considered for creating consensus sequence for each amplicon [31]. Good quality sequences from all individuals were assembled and aligned using CLUSTALW 1.83 [32]. Conserved, variable and parsimony informative sites were determined using MEGA 5.0 [33]. Distance matrices and Neighbor-Joining (NJ) trees were established in MEGA using the best fit nucleotide substitution model (chosen with AICc) [34].

Data analysis
Genetic distance was calculated using Kimura-2-Parameter (K2P) model [35]. The interspecific divergence between the species was studied using the following three parameters: (i) average inter specific distance; (ii) average theta prime (θ'), where θ' is the mean pairwise distance within species, thus eliminating the biases associated with different individual count among species; and (iii) minimum inter specific distance. Three additional parameters were studied for the intraspecific divergence: (i) average intraspecific divergence, (ii) theta (θ) and (iii) average coalescent depth [36].
Wilcoxon signed rank tests were performed to check existence of significant divergence between the inter and intraspecific variability between the pairs of barcoding loci [11]. Consensus sequences were generated for all the ten Dalbergia species using TaxonDNA [37] with 1000 bootstraps. To analyze inter and intraspecific variation, sequence variants were generated with DnaSP 5.0 [38] using consensus sequences. Further, NJ trees were constructed in MEGA 5.0 with 1000 bootstraps. Based on the distance method using K2P parameter and a minimum sequence overlap of 300 bp, accurate species identification was performed by TaxonDNA or SpeciesIdentifier 1.7.7 [37] using two approaches: (i) Best match (BM) and (ii) Best close match (BCM). In these approaches, each sequence from the dataset was used as a query against the remaining sequences from the same dataset. With BM, a query sequence was identified by searching the reference sequence for the best match with the smallest genetic distance to the query. The BCM approach required a threshold value, which was calculated for each locus from pairwise summary. The threshold was a value below which 95% of all intraspecific distances were observed, leading to an upper bound value on the similarity of a barcode match [37]. If both, the query and the subject sequences were from the same species, the identification was considered as successful. Whereas, if more than one query sequence from different species exhibited equally good match, then the samples were considered as ambiguous. Another character based analysis method, Barcoding with LOGic Formulae (BLOG), was also employed [39]. This method selected the unique nucleotide position of the sequence and derived a formula to differentiate among species. It also provided concise and meaningful classification rules [40].

Amplification success
The success rate for PCR amplification and sequencing of bidirectional reads was the highest for rbcL (97.6%), followed by matK (97.0%) and trnH-psbA (94.7%), while nrITS exhibited the lowest rate (80.5%). Nucleotide sequences of analyzed loci from all individuals were deposited in NCBI database (S1 Dataset; accession numbers-matK: KM276475-KM276412; rbcL: KM100059-KM099987; trnH-psbA: KM276322-KM276250 and nrITS: KM276165-KM276104). Using BLAST analysis, all the loci correctly identified 100% of the samples at genus level; while at species level, nrITS had the highest identification rate i.e. 60% followed by rbcL (50%), matK (20%) and trnH-psbA (10%). The low rate of species level identification might be due to the absence of species records in NCBI database and high percentage of in-dels especially in the case of trnH-psbA sequences.

Nucleotide variation
The percentages of polymorphic informative (Pi) sites and variable sites were comparable for the respective loci. For nrITS, aligned length was 637 bp, with 29.83% sites variable and 28.89% polymorphic informative, which was the highest among all the loci (single locus as well as combination of loci). Based on the percentage of conserved sites, the most conserved loci were rbcL followed by matK and matK+rbcL (Table 1).

Inter and intraspecific divergence
Distance analysis and Wilcoxon signed rank test. The nrITS locus showed greater interspecific divergence than the plastid loci (matK, rbcL and trnH-psbA and their combinations) using both average inter specific distance and θ' parameters. However, in case of intraspecific divergence, nrITS and rbcL showed the highest and the lowest value, respectively. Thus, no single locus revealed the highest interspecific but the lowest intraspecific divergence (Table 2 and Fig 2). When the Wilcoxon signed rank test was used to compare the loci, nrITS exhibited the highest interspecific divergence followed by trnH-psbA, whereas rbcL displayed the lowest intraspecific divergence (Tables 3 and 4).
Barcode gap. Barcode gap represents the absence of overlapping regions between inter and intraspecific distances. The barcode gap was absent for all the marker loci used in the present study, indicating overlaps between inter and intraspecific distances (Fig 3). However, the mean interspecific divergence was significantly higher than that of the corresponding intraspecific divergence for each of the loci. This was further confirmed by analysis carried out using TaxonDNA.
Tree based analyses. The sequence variants of each marker locus were determined using DnaSP 5.0 and MEGA 5.0 as mentioned previously. Among all loci, nrITS exhibited the maximum number of sequence variants (Table 5). By including all the sequence variants, seven NJ trees were constructed with matK, rbcL, trnH-psbA and nrITS either alone (Fig 4) or in combinations ( Fig 5). All of them except rbcL revealed a separate cluster for each species and rbcL could not differentiate between D. rubiginosa, D. candenatensis and D. tamarindifolia. Interestingly, except trnH-psbA all other loci (matK, rbcL, nrITS and matK+rbcL) either alone or in combination were capable of grouping together all three species-clusters from the section Dalbergia (D. volubilis, D. lanceolaria and D. paniculata). This agrees with a previous report on genome size variation and evolution of Dalbergia species which found that D. lanceolaria and D. paniculata were closely related [15]. These observations indicated that matK, nrITS, rbcL and matK+rbcL could correctly identify the reported relationships among the Dalbergia species and hence, they could most likely be successful as barcodes for this genus.
Similarity based approach. To evaluate the accuracy of these potential barcodes in species assignments, the BM and BCM parameters from TaxonDNA analysis were used (Table 6). Finding a standard threshold for BCM approach is difficult as there is a large variation in inter and intraspecific divergence across all loci in different plant systems [9]. Moreover, our approach to use multiple accessions of each species, as suggested by Pettengill and Neel [9] has ensured that the basic requirement was fulfilled and therefore, we chose to use calculated thresholds. The calculated threshold value per locus varied from 0.12% in rbcL+trnH-psbA to 1.2% in nrITS. With the BM and BCM approaches, the success rate of correct identification was unambiguously 100% for matK, matK+trnH-psbA and matK+rbcL and 0% incorrect identification (Table 6). Character based approach. The data analysis resulted into logic formulae as well as revealed information regarding correctly classified, wrongly classified and not classified species. Only the analysis done using matK, nrITS, matK+rbcL and matK+trnH-psbA loci could assign the characteristic nucleotide positions for all the species with 100% correct classification (Table 7).

Overall performance of the loci
The different parameters used for screening potential barcode loci were ranked based on their performance on a scale of 1-10. In case of NJ trees, the ranking was done based on clustering of the species. Those loci which separated all the species irrespective of intraspecific variation were given ten marks, while for the remaining loci, the scale was determined based on the number of species clubbed together. For inter-and intraspecific distances, the difference between the maximum and minimum distance was calculated to determine the scale for each locus. For BM and BCM methods, the percent values corresponding to correct, ambiguous and incorrect classification were used to rank the loci. A similar methodology was also applied for BLOG. Finally, for Wilcoxon signed rank test, the locus which performed the best in a pair in both, inter and intraspecific distance determinations, was ranked the highest (Table 8).

Discussion
Paul Hebert's research in 2003 on species identification using short stretches of DNA from a well characterized region of the genome, gave birth to the concept of DNA barcoding [41]. Initial efforts proved the reliability of mitochondrial cytochrome c oxidase 1 (cox1) gene as an impressive barcode in animals [42]. However, initial research on plant DNA barcoding suggested that species discrimination in plants with a single universal locus is difficult. This is primarily due to various phenomena such as polyploidy, hybridization, heteroplasy etc., which result in the formation of continuous range of variable characters and making delineation a difficult task. Alternatively, sufficient time is often required to accumulate mutations in organisms which are responsible for separation of closely related species. However, the lack of such sufficient genetic variation hampers species level discrimination of plants by DNA barcoding [8]. This problem is exaggerated in woody plants because of longer generation time and lower mutation rate. It is also difficult to differentiate species in taxonomically complex groups where species are narrowly defined. Additionally, large ancestral population sizes and low levels of within species gene flow for plastid markers create difficulty in barcode based identification [3,8]. In order to resolve these problems, several attempts have been made to establish DNA barcodes using multiple genes from different plant genomes for specific families such as Myristicaceae [43], Lemnaceae [44], Zingiberaceae [45], Podocarpaceae [46] or genera such as Paeonia [47], Acacia [48], Paphiopedilum [49], Parnassia [50] and Gossypium [51]. However, from different studies, it appears that finding a universal barcode or even a barcode at family level is difficult and it may be possible to establish a discriminating barcode only at genus level [52].
There are few reports on DNA barcoding of tropical tree species [16,31,53] which include Amazonian as well as Indian forest trees. These studies have used nrITS, matK, rbcL and trnH-  psbA loci. However, there are scanty reports on DNA barcoding of trees exclusively from WG of India. A study on 143 tree species from tropical dry evergreen forests in India covering 114 genera and 42 families revealed that combination of matK and rbcL loci gave the highest success in accurate identification [16]. Similarly, DNA barcoding of medicinal plants from the family Fabaceae revealed 80% and 96% success at species and genus level, respectively using matK locus, while the ITS2 locus gave more than 80% success at species level and 100% success at genus level [54]. However, none of the above mentioned studies included Dalbergia. A recent study on tropical tree species from India (149 species from 82 genera and 38 families) included three Dalbergia species and suggested that ITS and trnH-psbA might not be highly successful [31]. Efforts to resolve the sister species complex of Acacia from Fabaceae using rbcL, trnH-psbA (same primer sequence as we have used in our study) and matK recommended all the  three regions for barcoding [48]. On the contrary, studies on Aspalathus using ITS (different primers than the ones used in our study), psbA-trnH and trnT-trnL concluded that all the three loci were unable to resolve the species [55]. It was observed that the output from matK analysis was variable based on the plant systems as well as on the combination of primers used for analysis. However, the Consortium for the Barcode of Life (CBOL) proposed 90% success with matK for plants. Our study also identified matK as one of the potential loci for DNA barcoding. Thus, matK, nrITS and rbcL individually or in their combinations could be explored as the potential DNA barcodes in various plant genera [53].

Assessment of the four candidate barcodes in Dalbergia genus
In the present study, the amplification and sequencing success rate in Dalbergia ranged from 80.5% (for nrITS) to 97.6% (for rbcL). While the rbcL locus was reported to be easy to amplify and sequence across a broad range of plant taxa, but offers low species resolution, the rapidly evolving matK, locus, is known for its high discriminatory power with low universality [56]. Hence, the matK is popular for species discrimination in case of angiosperms [3]. However, mixed results ranging from high success rate [56,57] to poor discrimination [3,11] have been reported for matK. Even in the present study, matK showed good resolving power and although trnH-psbA showed good universality and higher discrimination, it also has variable length, presence of homopolymers, inversions and insertion of rps19 gene [58][59][60]. Similarly, while the nrITS locus is a commonly used nuclear marker for phylogenetic studies [5], it was, however, not preferred for barcoding studies initially because of fungal contamination, paralogous gene copies and problems in recovery [8]. In our study, similarity search using BLAST did Potential DNA Barcodes for Dalbergia Species not reveal any problem of fungal contamination in nrITS sequences; however, the sequencing success was low (80%), which might be due to the presence of divergent gene copies as reported earlier [5]. In case of trnH-psbA which gave 94.7% sequencing success, our data revealed the presence of T and A repeats, without any insertion of rps19 gene when checked by BLAST. The overall interspecific distances were high compared to intraspecific distances and no significant barcode gap was observed in the present study. Usually in the closely related plant species, plastid regions such as rbcL and matK do not generate a barcode gap [57]. Several studies have also revealed the absence of barcode gap in different plant systems such as Agalinis [9], Parnassia [50], Gossypium [51] medicinal plants [12] and Dioscorea [61]. Furthermore in the NJ tree based analysis, nrITS, matK and trnH-psbA and their combinations formed separate clusters for each species. However, rbcL could not differentiate D. rubiginosa, D. candenatensis and D. tamarindifolia, which could be because of the conserved nature of the gene [62]. Similar   behavior of rbcL was also reported in Carex [58]. Together this suggested that individually rbcL might not serve as a good barcode but can be utilized in combination with other loci. A recent report on DNA barcoding of eight Dalbergia species from Vietnam recommended ITS locus as a potential barcode based on UPGMA analysis and nucleotide diversity [63]. It has been reported that being a multigene family, 18s-26s rDNA is subjected to concerted evolution. In certain cases, ITS1 [64,65] and ITS2 [12,60,65,66] have been used as separate loci for DNA barcoding. However, point mutations displayed by ITS1 and ITS2 also contribute to high intraspecific variations [67]. We used the complete ITS region (ITS1-5.8S-ITS2) as a single barcoding locus. In our study, nrITS showed high intraspecific variation with high species discrimination, leading to incorrect identification with BM and BCM. However, DNA barcoding of eight Dalbergia species from Vietnam [63], did not use the species from the current study. A reanalysis of the data from NCBI for the species used in the Vietnam study along with dataset from our study revealed a high number of sequence variants for most of the species (S1 Fig). Moreover, from the available sequence data in NCBI for the Vietnam study [63], we could find only one nrITS sequence each for D. dialoides, D. entadoides and D. hancei making it difficult to assay the intraspecific variation. It was therefore, not possible to comment on either the intraspecific diversity of these species, which is an important factor for DNA barcoding or the suitability of nrITS as the potential barcode for Dalbergia species. It is essential to sample enough number of accessions for each of these species, ideally from different geographical locations, to sample the intraspecific variation from the entire distributional range [53].

Conclusions
In the present study 7-26 accessions of ten Dalbergia species each collected from different geographic locations in WG region of India were screened using 37 primer pairs from nuclear and plastid genes. Four loci (rbcL, matK, trnH-psbA and nrITS) and their combinations were further evaluated with five different analyses and ranked based on their performance. These studies have revealed matK and matK+rbcL loci as the most suitable barcodes to discriminate Dalbergia species.
Supporting Information S1 Fig. NJ tree. Combined analysis of nrITS sequences submitted by Phong et al. [63] with those generated in this study, revealing high intraspecific variation and several sequence variants for most species.