Assessing DNA Barcoding as a Tool for Species Identification and Data Quality Control

In recent years, the number of sequences of diverse species submitted to GenBank has grown explosively and not infrequently the data contain errors. This problem is extensively recognized but not for invalid or incorrectly identified species, sample mixed-up, and contamination. DNA barcoding is a powerful tool for identifying and confirming species and one very important application involves forensics. In this study, we use DNA barcoding to detect erroneous sequences in GenBank by evaluating deep intraspecific and shallow interspecific divergences to discover possible taxonomic problems and other sources of error. We use the mitochondrial DNA gene encoding cytochrome b (Cytb) from turtles to test the utility of barcoding for pinpointing potential errors. This gene is widely used in phylogenetic studies of the speciose group. Intraspecific variation is usually less than 2.0% and in most cases it is less than 1.0%. In comparison, most species differ by more than 10.0% in our dataset. Overlapping intra- and interspecific percentages of variation mainly involve problematic identifications of species and outdated taxonomies. Further, we detect identical problems in Cytb from Insectivora and Chiroptera. Upon applying this strategy to 47,524 mammalian CoxI sequences, we resolve a suite of potentially problematic sequences. Our study reveals that erroneous sequences are not rare in GenBank and that the DNA barcoding can serve to confirm sequencing accuracy and discover problems such as misidentified species, inaccurate taxonomies, contamination, and potential errors in sequencing.


Introduction
Publically available, GenBank (http://www.ncbi.nlm.nih.gov/ sites/entrez) provides an annotated suite of open access, nucleotide sequences and, when applicable, their amino acid translations. GenBank relies on direct submissions from individual laboratories. Because of increasing efficiencies of sequencing and molecular research, the volume of data is explosively increasing. The sheer volume of new information necessarily translates into the accumulation of errors. For example, more than half of all published human mtDNA studies have errors [1] and 5.0% error in mitochondrial 16S rRNA sequence data occurs in public repositories [2]. Although attention focuses on the quality of the human mtDNA database [3][4][5], little effort focuses on the extent of erroneous sequences arising from the misidentification of species, sampling error, and contamination, especially in phylogenetic analyses. Unfortunately, the 'garbage in, garbage out' rule applies. If the data are not reliable, forensic analyses will have limited repeatability, phylogenies will introduce confusion, and in both cases errors may even lead to irreproducible results.
DNA barcoding usually consists of a fragment of the mitochondrial gene cytochrome oxidase c subunit I (Cox1, mt-co1, COI) but other genes are also employed, sometimes with varying levels of success [6,7]. The method has many applications among which it is an efficient means of identifying species because levels of divergence among individuals are usually much lower of the same species than between closely related species [8][9][10][11][12][13][14]. Barcoding successfully identifies a great diversity of species [15][16][17][18][19][20][21][22][23][24][25][26][27]. A sequence from a misidentified species will result in a high level of intraspecific K2P divergence [28]. In this study, we use divergence values to detect potential errors in sequences in GenBank to assess and improve the quality of the data.
Phylogenetic/genealogical analyses commonly use cytochrome b (Cytb) sequences. Thus, we use a dataset of 2555 Cytb sequences of turtles to test the power of DNA barcoding to confirm species identities and pinpoint problems. If this approach proves to be a powerful means of identifying errors, we can expect it to detect potential flaws in other groups. Thus, we further analyze 3516 and 6269 Cytb sequences in the Insectivora and Chiroptera. CoxI is the most widely used marker for DNA barcoding and, therefore, we also analyze 47,524 mammalian CoxI sequences in GenBank.

Results and Discussion
The compiled dataset of Cytb sequences from turtles was used to evaluate the ability of DNA barcoding to detect erroneous sequences in GenBank. The lengths of available Cytb sequences vary, and consequently a clear tradeoff exists between maximizing the length of the alignments and taxonomic coverage. The final data set consists of 1686 fragments of 924bp. When we set Cytb GenBank accession NC_015986 as the standard for all comparisons, the available fragments ranged from 75bp to 998bp. Given that the goal is to identify erroneous species and data, we use neighbor joining (NJ) trees as an efficient means of summarizing divergence between the sequences. Not surprising, the topology of the NJ phenogram is almost identical to trees obtained using morphology [29], nuclear genes [30][31][32], and mitochondrial genes [31], although the bootstrap values are smaller, as expected, and some branching orders remain unsolved ( Figure S1). Nucleotide diversity averages 16.0% and transitions are saturated at about 15.0% when all codon positions are compared ( Figure 1).
No particular level of divergence can serve to identify species. Rather, such data can point to taxa that need additional study. K2P distances between Rhinoclemmys diademata, R. punctularia, and R. melanosterna range from 1.4% to 2.3%. The low levels of divergence indicate either recent divergences or perhaps a taxon-specific slowing of the molecular clock. More importantly, only one sequence is available for each species and the result indicates a need for further study using more samples. Similarly, newly described Emys trinacris [38] forms an independent lineage that is the sister group of E. orbicularis. However, interspecific divergences are very low (0.7-2.3%) and intraspecific divergences of E. orbicularis range from 0.0 to 2.0%.
Many currently recognized taxonomic names are composites of cryptic species complexes [39]. Testudo graeca (six subspecies) and Geochelone pardalis (two subspecies) have complex relationships. Intraspecific divergence in the former species ranges from 0.0 to 8.1% and in the latter from 0.0 to 12.4%. Thus, these two species complexes require further attention as they may be polytypic. DNA barcoding has accelerated the rates of taxonomic discovery and descriptions to meet or exceed rates of biodiversity loss [40][41][42]. In contrast to great variation, 16 samples of Indotestudo forstenii share one haplotype. This endangered species has a critically low level of diversity necessitating that greater attention must be paid to its conservation status.
Overlapping intra-and interspecific levels of divergences indicate not only natural variation but also potential errors in GenBank and taxonomic conundrums. Among the several new species of turtles described during the last 20 years based on morphology, most were controversial. Our study affirms that DNA barcodes can provide critical data before the description of a new species, and this may involve forensics into geographic origins [43].
To test if our barcoding strategy is applicable to other taxa, we analyzed two orders of mammals, shrews (Insectivora) and bats (Chiroptera). Both groups contain a large number of species and species identity can be confusing. Identical to turtles, analyses detect potential errors in GenBank sequences, as well as taxonomic uncertainties (Table 1).
CoxI is the most widely used marker for DNA barcoding. Therefore, we also analyze 47,524 mammalian CoxI sequences in GenBank. Not surprising, many potential errors occur (Table S1). This result suggests that the paradox of deep intraspecific and shallow interspecific K2P distances can detect potential errors. This paradox is likely to be useful for a variety of popular genes such as 12S and 16S. If we exclude human sequences, primates have the highest error ratio (2.12%). When we do not exclude human sequences, even-toed ungulates have the highest error ratio (1.68%), as Table S2 shows.
In view of an explosive amount of data deposited in GenBank from an increasing number of laboratories, our study shows that erroneous sequences are not rare. In addition to artificial technological errors in sequencing, sample mix-up, contamination, and incorrect species identification constitute other possible sources of error. Erroneous data may strongly impact critical forensic applications, and result in confused taxonomies and phylogenies. Such errors are often hard to detect, and all too frequently there is no confirmation of either taxonomic accuracy or the possibility of contamination. The paradox of deep intraspecific and shallow K2P interspecific differences suggest that further verification of accuracy is necessary. Certainly, not all paradoxes owe to contamination and inaccurate identifications of species. Problematic and outdated taxonomies are also involved. Once reliable data are available for each species, and especially from type localities, it is possible to easily determine the source of the problematic sequences, be that sequencing errors or invalid taxonomies. The global initiative to DNA barcode all species of amphibians and reptiles -Cold Code [44] --seeks to suggest corrections to GenBank. Thus, DNA barcoding is not only valuable for identifying species, but it can play an important role in detecting potential errors in GenBank.

Data Analysis
The datasets for Cytb and CoxI were treated independently. All datasets were firstly aligned by MAFFT -a fast multiple sequence alignment program [45]. The alignments were trimmed by deleting the flanking regions of Cytb and CoxI. The trimmed sequences were aligned again by Clustal 61.8 [46] to obtain more accurate alignments. These alignments were examined by eye and when required adjusted to exclude obvious alignment errors. The length of these published sequences varied. To obtain the maximum amount of homologous sequences. Accordingly, we obtained a final dataset that sought the greatest taxonomic diversity while considering the longest sequences by deleting outliers. All the datasets were available upon request.
For each dataset, A neighbor-joining tree the distance was created to provide a graphic representation of the patterning of divergences among species [47]. Sequence divergences were estimated using the K2P distance model [28] in MEGA 4 [48]. Sequences that had deep intraspecific or shallow interspecific K2P divergences were recorded as being potential errors. Then, we further checked their nucleotide sequences and its phylogenetic position by eye.
Transition saturation was tested by plotting the estimated number of transitions and transversions against genetic divergence using DAMBE [49]. Third codon positions and the first two codon positions were tested separately and combined. Supporting Information Figure S1 Neighbor-joining tree using 924 bp Cytb sequences for turtles. (TIF)

Author Contributions
Conceived and designed the experiments: YYS. Performed the experiments: XC YYS. Analyzed the data: XC YYS. Contributed reagents/ materials/analysis tools: YYS. Wrote the paper: YYS RWM.