Identification of Amazonian Trees with DNA Barcodes

Background Large-scale plant diversity inventories are critical to develop informed conservation strategies. However, the workload required for classic taxonomic surveys remains high and is particularly problematic for megadiverse tropical forests. Methodology/Principal Findings Based on a comprehensive census of all trees in two hectares of a tropical forest in French Guiana, we examined whether plant DNA barcoding could contribute to increasing the quality and the pace of tropical plant biodiversity surveys. Of the eight plant DNA markers we tested (rbcLa, rpoC1, rpoB, matK, ycf5, trnL, psbA-trnH, ITS), matK and ITS had a low rate of sequencing success. More critically, none of the plastid markers achieved a rate of correct plant identification greater than 70%, either alone or combined. The performance of all barcoding markers was noticeably low in few species-rich clades, such as the Laureae, and the Sapotaceae. A field test of the approach enabled us to detect 130 molecular operational taxonomic units in a sample of 252 juvenile trees. Including molecular markers increased the identification rate of juveniles from 72% (morphology alone) to 96% (morphology and molecular) of the individuals assigned to a known tree taxon. Conclusion/Significance We conclude that while DNA barcoding is an invaluable tool for detecting errors in identifications and for identifying plants at juvenile stages, its limited ability to identify collections will constrain the practical implementation of DNA-based tropical plant biodiversity programs.


Introduction
The Neotropics hold an estimated 78,800 flowering plant species, over a third of the world's total [1]. Yet, tropical forests are being degraded at a fast pace [2,3], and over half of the estimated 11,000 Amazonian tree species may face a direct risk of extinction [4]. Thus large-scale biodiversity inventories are critically needed in order to develop informed conservation strategies for these diverse ecosystems [5,6]. Significant progress in mapping the distribution of Neotropical plants has been achieved over the past decades [7][8][9][10][11], but many areas are still under-collected and species identification remains a challenging task in many plant families. An example was recently provided by Pitman et al. (2008), who conducted a tree species diversity survey along a 700-km transect that cuts across one of the most diverse parts of the Amazon, between Ecuador and Brazil [12]. Based on traditional botanical sampling, they were able to identify 97% of the sampled stems to the genus, and counted a total of 435 tree genera. Yet, in their statistical analyses, they decided to conservatively exclude the genera that were difficult to identify in the field when only sterile material was available. Their choice of excluding no less than 20.7% of the genera, and 15.7% of the sampled stems resulted in loss of information, the influence of which on their conclusions is unknown.
With the advent of high-throughput DNA sequencing, it has been suggested that universally amplified, short, and highly variable DNA markers (DNA barcodes) may help identify organisms to species with a high confidence, which would be useful in a wide array of applications, including biodiversity surveys [13][14][15]. DNA barcodes should be both variable enough to discriminate among closely related species and yet possess highly conserved regions so as to be easily sequenced with standard protocols. The mitochondrial marker cytochrome c oxidase I (CO1) has met with some success for animal groups [13,16], but see [17][18][19]. In plants, the search for suitable genomic regions has proven more challenging. Several regions in the plastid genome (e.g. rbcL, rpoC1, rpoB, ycf5, psbA-trnH, trnL, atpF-atpH, psbK-psbI) as well as the internal transcribed spacer (ITS) of the ribosomal nuclear DNA have emerged as good candidates for plant DNA barcoding [20][21][22][23][24][25][26][27]. A consensus has recently emerged among the members of the Consortium for the Barcoding of Life (CBoL) Plant Working Group for using only two of these markers to barcode land plants, namely rbcL and matK [28], yet these authors point out that this combination will lead to a species-level identification in 72% of the cases only, and this resolution is unlikely to be evenly distributed across land plant species.
Echoing Chase et al. (2007) [29], the CBoL Plant Working Group pointed out that plant DNA barcoding should be useful in discriminating among forest seedlings, or undertaking large-scale biodiversity surveys in situations where taxonomic expertise is limiting. Yet, we are unaware of any application in this research area thus far, and the present work fills this gap. Tropical plants present challenges to DNA barcoding that are much more pronounced than those encountered when barcoding temperate plants, and today applications of plant DNA barcoding in the tropics is still unchartered land (the only exceptions being applications on genus Compsoneura in the Myristicaceae, see Newmaster et al. 2008; genus Inga in the Fabaceae [30]; and the orchid family [26]). DNA extraction is expected to be more difficult in tropical plants, due to the greater abundance of secondary metabolites [31], and this may compromise the overall performance of DNA barcoding [32]. In addition, the rate of lineage diversification is often high in the tropics, leading to the frequent occurrence of explosive radiations [33][34]. For recent lineages with great numbers of species, we thus expect that DNA barcoding will be less efficient, because species will tend to have lots of close relatives, reducing levels of interspecific divergence, as recently confirmed in genus Inga [35], and as should be expected in other groups [36]. Finally, it has been shown that woody plant lineages show consistently lower rates of molecular evolution as compared with herbaceous plant lineages [37], suggesting the application of DNA barcoding concepts should be more difficult for tree floras than for non-woody floras [26,38].
In the present study, we use a plot-based sampling strategy to test the applicability of the currently proposed DNA barcoding scheme. Specifically, we examine if consensus barcodes are sufficiently variable and universal to reliably identify co-occurring Amazonian tree species, and we implement this scheme to the identification of tropical juvenile plants.

Study site and sampling
This study was conducted at the Nouragues Research Station, central French Guiana, in pristine lowland tropical rainforest (4u05 N, 52u40 W; [39]). Rainfall is 2824 mm y21 (average 1988-2008) with a dry season averaging 2.5 months, from late August to early November, and a shorter dry season in March. The plant diversity of this area is high, with a local flora exceeding 1700 angiosperm species.
We sampled all trees $10 cm of diameter at breast height (dbh) in two 1-ha plots. Large trees were sampled by professional tree climbers while smaller trees (less than 35 cm dbh) were collected using French climbing spikes (Fonderies Lacoste, Excideuil, France, [40,41]). A total of 1073 trees were sampled in the two plots. Voucher specimens were matched against the reference vouchers available at the Herbier de Guyane, Cayenne (CAY), and they were deposited there. Of the 301 tree morphospecies, 254 could be matched to a reference voucher with an accepted species name (94% of the inventoried individuals). These encompassed 143 genera, and 54 angiosperm families, and they spanned the most common woody plant families in Amazonia (Table S1). Individuals from the most taxonomically difficult families, such as Lauraceae, Myrtaceae, Elaeocarpaceae (Sloanea), or Sapotaceae (Pouteria), were kept into morphospecies.
For each sampled plant, we collected 1-10 cm2 of leaf tissue. Samples collected for DNA analysis were stored in 10 g of silica gel. We also collected ca. 1 cm2 of cambium tissue using a leather punch of 1 cm in diameter to test whether DNA could be extracted efficiently from this tissue [42]. Total DNA extraction was of comparable concentration with cambium and leaf tissue (results not shown), and both were used for sequencing.

DNA extraction and sequencing
Up to 30 mg of dry material was ground for 2 min in a TissueLyser mixer-mill disruptor (Qiagen, California, USA) using tungsten beads. Lysis incubation was carried out at 65uC during 2 hr for cambium tissue and 1 hr for leaf tissue using CTAB 1% PVP buffer. Total DNA extraction was performed using a Biosprint 15 workstation (Qiagen, CA) following the manufacturer's protocols.
PCR amplification was performed for the coding plastid regions rbcLa (first part of the rbcL gene), rpoC1, rpoB, matK, ycf5, the noncoding regions trnL and psbA-trnH, and the nuclear region ITS. The PCR reaction mix included 0.2 ml of GoTaqH 51 U/ml (Promega), 10 ml of 56 buffer, 1 ml of 20 mM for each primer, 1 ml of dNTP 10 mM, 1 ml of DNA template and H2O for a final volume of 50 ml. For primer combinations, PCR thermal conditions, and references, see Table S2.
PCR products were purified with a MinElute PCR Purification Kit (Qiagen, CA). Cycle sequencing reactions were performed in 10 ml reactions using 1 ml of BigDyeH Terminator cycle sequencing chemistry (v3.1; ABI; Warrington, Cheshire, UK) and run on ABI sequencers. The markers were sequenced in both directions. DNA fragments were visually inspected and assembled with SequencherTM 4.8 (Gene Codes Corp., Ann Arbor, Michigan, USA). In about 10% of the cases, the marker psbA-trnH proved difficult to sequence from the 39 end (trnH), due to long poly-A and poly-T regions [43]. If and only if the single strand had a highquality read, a single direction sequence was used. All of the sequences are deposited on GenBank (see Table S1 for the accession numbers).
We did not sequence all 1073 individuals for all candidate markers, but selected 285 individuals so as to represent all the taxonomic groups, and facilitate interspecific and congeneric comparisons. In a few markers, we increased the sequencing effort (rbcL, rpoC1, and psbA-trnH).

Test of the barcoding approach on tropical saplings
Having assembled a large database of plant DNA barcodes for tropical tree species, we tested whether it could be used to identify juveniles in the same plots, which often lack the morphological characters used to identify mature plants [29]. We established two 464 m sapling plots within each of our two tree plots. All woody plants above 30 cm in height and ,1 cm dbh (n = 252) were marked, measured, and mapped. Because it is often difficult to tell apart tree, shrub and liana saplings, we included all woody plants within the size limits, and subsequently used our identifications to infer the life form of these individuals. Based on morphology, 27% of the individuals could be reliably identified to the species, another 45% could be assigned to a clear morphotype, and 11% could be assigned to a known genus.

Data analyses
We tested if the species were retrieved as monophyletic group with the different markers. The sequences were aligned using ClustalX version 2.0.11 with default parameters [44], and alignments were visually inspected. For each marker, we generated neighbour-joigning (NJ) trees based on sequence divergence estimated with Kimura's 2-parameter (K2P) nucleotide evolution model [45], using ClustalX and the software Mega 4.0 [46]. Node support was assessed via 1000 bootstrap replicates. Trees were also constructed for each coding marker using PhyML [47] using the most general time-reversible model of nucleotide evolution with Gamma distributed errors on mutation rates (GTR+G). In PhyML, node support was estimated using the approximate likelihood-ratio test (alrt), a much faster method for estimating branch support than either the bootstrap or Bayesian posterior probabilities [48]. We present results based on NJ and ML trees only because this has the greatest potential for computationally intensive analyses based on large datasets and other studies have shown that the choice of the phylogeny reconstruction algorithm did not significantly alter the tests of DNA barcode performance [19,26]. In preliminary runs, we discovered that the performance of all plastid markers in recovering species as monophyletic was poor in two important groups that are easily recognized in the field: the Sapotaceae [49], and the Laureae clade in Lauraceae [50]. We then also computed the fraction of supported clades, excluding these two groups. We assumed that clades were supported when the bootstrap values exceeded 70%, or when the alrt values exceeded 80%.
Assessing monophyly using DNA barcodes has been criticized because it assumes that tree reconstruction is reliable, and that the minimal threshold on support value is a reliable criterion for clade support. Meier et al. (2006) have proposed an alternative criterion ('best close match') [17]. A threshold T is computed below which 95% of all intraspecific distances are found. If a query sequence had no match below T, it is left unidentified. Otherwise, if all matches of the query sequence are conspecific, the barcode assignment is considered to be correct. If the matches of the query sequence were equally good, but correspond to a mixture of species (including the correct one), then the test was ambiguous. The test fails if the match was not conspecific. This test is implemented in TaxonDNA (version 1.6.2, [17]).
Methods used to cluster DNA sequences into MOTUs fall into three categories: (1) tree-based, unsupervised (non-parametric) methods [51][52][53], (2) parametric methods that assume the choice of a threshold in sequence divergence prior to the clustering procedure and that require global sequence alignments [17], (3) alignment-free parametric clustering methods [54,55]. Although we analyzed our data using all three methods (see Supporting Information S1), the results reported in the main text are based on the alignment-based parametric clustering software TaxonDNA, and on the alignment-free method implemented in blastclust (package version 2.2.20 downloaded from ftp://ftp.ncbi.nih.gov/ blast/executables/release). The quality of the parametric clustering methods in reference to the morphological taxonomy was assessed by counting, for each threshold sequence distance, the fraction of MOTUs corresponding to more than one taxon (lumping fraction), and the fraction of taxa split into more than one MOTUs (splitting fraction). The lumping fraction should increase with the threshold sequence divergence, while the splitting fraction should decrease. The total number of taxa assigned to a unique MOTU (correct assignment rate) was also reported.

Results
Depending on the selected marker, we obtained sequences for up to 430 of the sampled individuals, including up to eight markers (a total of 2198 sequences). We obtained high quality sequences in over 90% of the samples for rpoC1, rbcLa, rpoB and trnL markers ( Table 1). Sequencing success was lower for psbA-trnH and ycf5 (over 80%). A taxonomic bias in sequencing success was detected for ycf5, which amplified poorly in the Gentianales (Apocynaceae and Rubiaceae; 7%) and in the Myristicaceae (33%), whereas rpoB amplified poorly in the Moraceae (33%). The sequencing success of matK was only ,70%, even after using two different pairs of primers. The lowest sequencing success was obtained with ITS, which amplified in only 41% of our samples. The markers varied significantly in mean sequence divergence ( Table 1). The highest variability was obtained for ITS, followed by psbA-trnH, trnL and matK.
We assessed the number of monophyletic species recovered in the tree reconstructions for each marker (Fig. 1a). We found little difference between the two methods of phylogenetic tree reconstruction (NJ and ML); and Table 2 reports only the results obtained with the maximum likelihood phylogenetic reconstruction algorithm. When considering all species, the best marker was psbA-trnH, which recovered 64% of monophyletic species, followed by matK, rpoB, and rbcLa ( Table 2). The poorest performance was obtained with ycf5 (40%) and rpoC1 (46%). Ignoring the Sapotaceae and Laureae, the three markers, psbA-trnH, rpoB, and rbcLa, had a similar performance (67%). ITS had an excellent performance in recovering monophyletic species, but this represents a biased sample, as we could amplify ITS for less than half of the individuals. Using rbcLa or psbA-trnH, 77% of the genera were found to be well-supported, while with ycf5 and trnL, this percentage dropped to 63%. The 'best close match' test as implemented in TaxonDNA yielded comparable results (Fig. 1b, Table 2). The rates of correct assignment of a randomly selected sequence was maximal for psbA-trnH (55%), followed by trnL (49%), and rbcL (48%). These low values reflect the fact that a large number of sequences were included from the Sapotaceae and Laureae, and these yielded ambiguous assignments.
All eight markers could not be sequenced for exactly the same individuals. Hence, the markers were also compared two by two, based on shared individuals only. This pairwise test of the markers yielded results consistent with the previous analyses (Table S4). In addition, we tested whether combining two markers into a single barcode to discriminate species did increased the performance of the tested markers, and found that this did not greatly improve the overall performance in comparison with single markers (Table S4).
We then tested the performance of each marker in clustering data into MOTUs. With coding cpDNA markers, fewer MOTUs were found than the real number of taxa in our sample (Table 3). Comparing the accuracy of assignment into MOTUs, we used the 'cluster' option of TaxonDNA, and found that TaxonDNA returned a mean correct assignment rate of 62% at 0.1% sequence divergence ( Table 3, including coding plastid markers and trnL). Blastclust provided slightly better results than TaxonDNA both in terms of overall number of MOTUs, and correct assignment rate (Table 3). With blastclust, the rate of correct assignment varied from 80.2% for ITS to 53% for rpoC1 (mean 65.5%). Irrespective of the clustering algorithm, the best rate of correct assignment was obtained for ITS followed by matK, psbA-trnH, rpoB, rbcL and trnL. The worst rate of correct species-level assignment was consistently obtained by rpoC1.
At the genus level, coding chloroplast DNA markers were useful to assign clusters to the correct genus (Fig. S1). For instance, rpoC1 and rbcL reached the best rate of correct genus-level assignment at about 1% in sequence divergence (Fig. S1).
Finally, we attempted to identify tropical saplings by DNA barcoding. First, we clustered the saplings together using psbA-trnH, and we then attempted to assign the clusters to recognized species using psbA-trnH combined with another marker with a slower rate of molecular evolution (rpoC1). This last marker was chosen at the time of the study because it had the highest amplification success. By clustering the psbA-trnH sequences, we could define 130 MOTUs (assuming a 1% threshold in sequence divergence, see Table 3). Combining this information with the rpoC1 marker, we were able to assign 32% of the individuals to a known species, and 25% to a known genus. Lianas and shrubs were quite abundant in the sapling layer, and these lack representatives in our reference database. Restricting our sample to the 152 juveniles of tree species, and based on DNA barcodes only, we detected 86 MOTUs, and we were able to assign 46% of the individuals to a known species, and 29% to a known genus. Finally, combining the morphological and molecular data, we could identify 59% of the individuals to the species, and 37% to the genus. The remaining 4% of the individuals were at least identified to the family level. Percentage of monophyletic species (black bars) and excluding the Sapotaceae and Laureae (grey bars) using the eight tested markers (see Table 2). Panel (b): Fraction of sequences correctly (black), ambiguously (dark grey), and incorrectly (light grey) assigned to species. Some sequences could not be assigned when their sequence diverged too much from the other species (Table S3). doi:10.1371/journal.pone.0007483.g001 Table 2. Percentage of monophyletic species and percentage of monophyletic genera recovered using the eight tested markers.

Marker
Nb tested species

Discussion
We examined whether plant DNA barcoding candidates matched taxonomic species delimitations in a large plant biodiversity survey of an Amazonian forest. Our working assumption was that the rate of species discrimination would exceed 72%, as recently found by the CBoL Plant Working Group [28]. In principle, by restricting the scope of the reference database to species known to occur in a specific habitat or region, a much greater degree of discrimination should be possible, since not all close relatives of a given species occur in the area under study [56]. We collected representatives of truly co-occurring species in order to provide a robust test of in situ applications of DNA barcodes. Using a large dataset, all attached to a voucher specimen, we were able to show that correct matching between barcodes and taxonomic species did not exceed 70%. Failure to reach a higher rate of species discrimination was due to the low plastid sequence variation in a few species-rich clades.
We confirmed that the markers rpoC1, rbcLa, trnL and to a lesser extent rpoB, could all be sequenced easily from leaf or cambium tissue. Being able to extract DNA directly from the cambium is important because it will prove useful in routine tropical forestry monitoring programs. The other markers showed a lower performance either because they failed in some groups or because they showed a low overall sequencing success. For instance, matK could be sequenced in only 68% of our samples, using two primer pairs. CBoL has reported a sequencing success of 90% for the matK region [28]. This difference could be explained either by the choice of several combination of primers. Fazekas et al. (2008) did report a 88% sequencing success for this marker, but they also emphasized that they had used up to 10 primer pairs, entailing a 'considerable effort' [57]. The second option is to use a more sophisticated chemistry at the amplification stage. Ford et al. (2009) reported a 85% success for matK using a combination of standard and nested multiplexed-tandem PCR (MT-PCR) [27]. The additional cost of testing a large number of primer combinations or of implementing non-standard PCR methods should be considered when implementing a DNA barcode project.
Despite much effort, ITS does not seem to compete as a universal DNA barcode for tropical forest inventories given the limited sequencing success observed in this study. Yet, ITS could be helpful in the identification of species in some particular target groups, such as the Sapotaceae (unpublished results). Of all coding plastid markers, ycf5 had consistently the worst performance as a DNA barcode, followed by rpoC1. According to the test of monophyly, matK and rpoB were good barcodes, but not according to the 'best close match' test. The rbcL marker was intermediate in both tests, but it is both easily sequenced, and well-represented in existing sequence repositories, and the consensus for this marker appears natural [28]. The marker matK has been found to provide valuable information in selected groups of plants (genus Compsoneura, [30]; Orchidaceae [26],). However, because obtaining sequences for this marker from field-collected plant tissue remains challenging, we suspect that it will be difficult to implement large-scale barcoding projects based on matK (see also [27] for a thoughtful discussion). The trnL UAA intron ranked second in the 'best close match' test, and fifth in the monophyly test and in the clustering test (Table 3). It was twice as variable as rbcLa, and its variability was comparable to matK, but it is much easier to sequence. Hence, it remains an interesting option for barcoding projects [25]. Indeed, the only ecological application of the plant DNA barcoding program thus far is the study of Jurado-Riviera et al. (2009), who have used the trnL intron to explore the diet of leaf beetles in the Chrysomelinae subfamily [58]. Finally, the use of the psbA-trnH marker has been much criticized because it is prone to reads error at the sequencing stage [43]. Yet, in our study, psbA-trnH had the best performance as a DNA barcode, ranking first in both monophyly and 'best close match' tests, and being universally amplifiable.  (55)  2 TaxonDNA is an alignment-based method based on sequence distance matrices, and blastclust is a method based on blast similarity scores of unaligned sequences. Percentage of correct assignment of a taxon to a MOTU (in parentheses). Given the length of the sequences (,1000 bp), 0.1% generally corresponds to less than 1 bp substitution. doi:10.1371/journal.pone.0007483.t003 Irrespective of the test or of the marker, a remarkable fact is that none of the rates of correct identification exceeded 70%. Part of this limited performance is due to the plant DNA barcoding strategy itself. Most of the markers proposed thus far are located in the chloroplast genome, and as such they do not evolve independently. Species-rich genera, the ones that would benefit the most from molecular identification techniques (Pouteria, Inga, Eschweilera, Ocotea) showed little to no variation in the plastid markers. Also, many of our botanical identifications were based on sterile morphological characters, like in all other tropical tree biodiversity surveys. While each single individual had a voucher, which was compared to a reference collection, closely related species often cannot be distinguished based on sterile morphology alone. For example, this is the case of Trichilia cipo/T. pallida, Eschweilera coriacea/E. pedicellata, and several species in genus Ocotea, to cite but a few. One different but equally important problem is that several important tropical tree families are still lacking a comprehensive systematic treatment. For instance, recent work on the Lecythidaceae based on morphology and molecular data showed that several generic delimitations needed to be recircumscribed [59]. Likewise, large genera such as Pouteria are probably not monophyletic [49]. Thus it remains critical for future DNA barcoding projects to keep improving existing repositories through fieldwork and descriptive taxonomy [15].
We used our dataset as a benchmark to assess the performance of several statistical methods to cluster sequences into molecular operational taxonomic units. Both TaxonDNA performed well with all of our markers, and the alignment-free method (blastclust) compares well with TaxonDNA. These methods may be scaled up to very large datasets. This is of considerable current interest, with the development of high-throughput sequencing technologies [60,61]. These approaches should be of considerable help in accelerating the pace of ecological research and biodiversity monitoring [62].
So far we have ignored the fact that the markers may display a high level of intraspecific geographical structure [63,64]. To truly test the performance of a putative DNA barcode, it will be essential to sample widely scattered populations for each species to assess the hypothesis that a locally defined reference of DNA barcodes does characterize a species throughout its distributional range. To our knowledge this test has not been performed yet.
It has been argued that plant DNA barcodes could be especially useful to identify juvenile individuals, and plant debris [29]. Here, we tested this idea for the first time, using a two-tiered approach: we first clustered the individuals into MOTUs using the most variable marker psbA-trnH. We then assigned the MOTUs to known taxonomic categories using the reference database we had constructed for trees. This enabled us to identify 86 MOTUs within a sample of ca. 152 tree saplings, 96% of which could be identified to the species or at least to the genus. Thus, DNA barcoding does show much potential for accurate identification of species at life stages which have been particularly difficult to investigate using morphological identification only. The coding plastid markers were often not variable enough to identify species. However, they efficiently assigned individuals to higher taxonomic ranks. Even though this differs from the stated goal of DNA barcoding -assigning individuals to species -, it will have important implications for ecological applications, such as tropical plant diversity surveys [11,12,65].

Table S4
Pairwise comparison of the markers to the samples for which both sequences are available. Reported is the percentage of best close match as reported in TaxonDNA for the two markers independently, and also for the combined markers. The rate of correct assignment was less than 50% in most of the cases, and combining two markers did not improve much the rate of correct assignment (+14% on average). Found at: doi:10.1371/journal.pone.0007483.s005 (0.08 MB DOC) Figure S1 Types of error in the parametric assignment of sequences to MOTUs. Left panel: Error made during the construction of species-level MOTUs. Two types of errors are reported as a function of sequence divergence: splitting of valid taxa into two or more clusters (splitting fraction: squares), and lumping of two or more valid taxa into the same cluster (lumping fraction: circles). Right panel: same as left panel, but using genuslevel MOTUs, as the reference taxonomic level. Found at: doi:10.1371/journal.pone.0007483.s006 (3.93 MB TIF)