DNA barcoding for identification of fish species in the Taiwan Strait

DNA barcoding based on a fragment of the cytochrome c oxidase subunit I (COI) gene in the mitochondrial genome is widely applied in species identification and biodiversity studies. The aim of this study was to establish a comprehensive barcoding reference database of fishes in the Taiwan Strait and evaluate the applicability of using the COI gene for the identification of fish at the species level. A total of 284 mitochondrial COI barcode sequences were obtained from 85 genera, 38 families and 12 orders of fishes. The mean length of the sequences was 655 base pairs. The average Kimura two parameter (K2P) distances within species, genera, families, orders and classes were 0.21%, 6.50%, 23.70% and 25.60%, respectively. The mean interspecific distance was 31-fold higher than the mean intraspecific distance. The K2P neighbor-joining trees based on the sequence generally clustered species in accordance with their taxonomic classifications. High efficiency of species identification was demonstrated in the present study by DNA barcoding, and we conclude that COI sequencing can be used to identify fish species.


Introduction
More than 30,000 species of fish exist worldwide, accounting for more than half of all vertebrates. Aside from being an important component of biodiversity, fish also possess direct economic value and are important animal protein sources for humans [1,2]. The classification and identification of fish is not only the subject of taxonomy studies but also the key to fishery investigations, the assessment of nature reserves and the identification of food and drug ingredients [3,4]. The identification of fish species also mainly relies on morphometric and meristic characters [5]. Fish have remarkable diversity of morphological characteristics, and most fish go through ontogenetic metamorphism. Many morphometric characteristics change during the stages of ontogenetic development [6]. Convergent and divergent adaptation also lead to changes in the morphological characteristics of fish species, imposing great challenges to morphological taxonomy, in which species identification is mainly based on morphological characteristics, and the classification of many species has thus also been controversial [7]. The limitations inherent in morphology-based identification systems and the declining number of taxonomists call for a molecular approach to identify species [6,8]. PLOS  DNA sequence analysis has been used to assist species identification with the development of molecular biology. However, the accuracy of molecular identification relies on having a reliable and complete reference database, as inconsistent genetic marker usage could impede the application of molecular authentication [9]. Different DNA markers have been used in different taxonomic groups. In 2003, Hebert et al. [10] proposed DNA barcoding technology, in which the mitochondrial cytochrome c oxidase subunit I (COI) gene sequence was used as a barcode for species identification with the expectation of barcoding all species for the purpose of species identification and classification. It was found that the intraspecific diversity of the COI gene in animals was significantly lower than the interspecific diversity, using the COI gene as a barcode was effective for classifying and identifying vertebrates and invertebrates, and the COI gene has been widely used in various biological groups [11][12][13][14]. Compared with the traditional morphological classification methods, the advantages of the DNA barcoding technology are mainly as follows: 1) Some species have extremely similar external morphological characteristics; therefore, it is difficult to distinguish them from each other merely by morphological characteristics. DNA barcoding technology can help accurately distinguish such species. 2) Morphological differences can vary considerably during various developmental stages, but individuals at different developmental stages can be identified by the DNA barcoding technology.
3) The DNA barcoding technology can allow for the discovery of cryptic species. Cryptic species are two or more species that are morphologically similar yet genetically distinct. Because of their morphological similarities, cryptic species are identified as the same species in the existing system. Using DNA barcoding technology, the large molecular evolutionary distance between these species can be revealed, thereby discovering cryptic species.
The Taiwan Strait belongs to the shallow sea of the subtropical continental shelf and is the passage between the East China Sea and the South China Sea. The sea area of the Taiwan Strait has a complex physical and chemical environment and is influenced by the Kuroshio tributaries, the drifting of the South China Sea monsoon and the coastal flows in Fujian and Zhejiang Provinces, China. The Taiwan Strait has rich fishery resources and is one of the most important fishing grounds off the coast of China. The Taiwan Strait has been drawing much attention due to its unique geographical location and marine environment characteristics, and studies on its ecosystem have gained attention. As human activities have intensified, over-fishing, habitat destruction and climate change have generated significant impacts on the biodiversity and structure of the fish community in the Taiwan Strait. The decline in genetic variation of a population diminishes the ability of fish to adapt to environmental changes and decreases their chances of long-term survival [3]. To promote sustainability, better control and management of fisheries should be implemented. The identification of fish species still stands as one of the most basic but important issues in fisheries management. In the present study, we examined COI diversity within and among 85 fish species, most of which were commercial species, with the goal of testing the utility of DNA barcoding as a tool to identify fish species. The DNA barcode records generated in this study will be available to researchers to monitor and conserve the fish diversity in this region.

Ethics statement
All fish species were caught in the offshore area (not national parks, other protected areas, or private areas, etc.), so no specific permissions were required for these locations/activities. Ethical approval was not required for this study because no endangered or protected fish species were involved. Specimen collection and maintenance were performed in strict accordance with the recommendations of Animal Care Quality Assurance in China.

Sample collection
All fish specimens were captured with a drawl net at nine locations in the Taiwan Strait (Fig 1,  Table 1). All specimens were morphologically identified by experts and taxonomists, who mainly followed the identification keys of Liu et al. (2013) [15]. A total of 284 fish samples was chosen for the research, and the mean number of individuals per species was 3. The voucher specimens were fixed with 95% ethanol and deposited in the Marine Biological sample Museum at the Third Institute of Oceanography, State Oceanic Administration. After morphological examination, muscle tissue samples were dissected from each specimen and stored in 95% ethanol at −20˚C.
Total DNA was extracted from a small piece of ethanol-preserved tissue according to the standard DNA barcoding methods for fish [2]. Approximately 655 bp were amplified from the 5' region of the COI gene employing the primers described in Ward et al. [2]: The amplification reaction was performed in a total volume of 25 μl, including 16.25 μl ultrapure water, 2.25 μl 10× PCR buffer, 1.25 mM MgCl 2 , each dNTP at 0.2 mM, each primer at 2 mM, 1.25 U Taq DNA polymerase and 1 μl DNA template. The thermal cycling conditions consisted of an initial step of 2 min at 95˚C followed by 35 cycles of denaturing (94˚C, 30 s), annealing (54˚C, 30 s) and extension (72˚C, 1 min), with a final extension at 72˚C for 10 min; the samples were then held at 4˚C. The PCR samples were screened for the existence of PCR products on a 1.0% agarose gel. Sequencing in both directions was performed by Sangon Biotech (Shanghai).

Data analyses
Sequences were manually edited using the SeqMan program (DNAStar software) combined with manual proofreading; each base of the spliced sequences was ensured to be correct before submitting them to GenBank (Table 1). Next, the sequences were aligned using ClustalW in MEGA 6.0 software, and parameters including the sequence length, GC content, polymorphic loci and parsimony informative sites were calculated. The distances within species and between species were calculated using the Kimura-2-parameter (K2P) model [16]; a phylogenetic tree was constructed using the neighbor-joining (NJ) method. The clade credibility in the tree that was obtained using the NJ method was tested by bootstrapping, in which 1000 repeated sampling tests were performed to obtain the support values of the clade nodes.

Results
A total of 284 mitochondrial COI barcode sequences were obtained from 85 genera, 39 families and 11 orders of fishes (GenBank accession numbers and taxonomic data are listed in Table 1). After editing, the consensus length of all barcode sequences was 655 bp, and no stop codons, insertions or deletions were observed in any of the sequences. All analyzed sequences were larger than 600 bp. Nucleotide pair frequency analysis of the entire dataset revealed that 325 of 655 (49.62%) sites were conserved, 330 of 655 (50.38%) sites were variable, 330 of 655 (56.27%) sites were parsimony informative, and no singleton sites were present. The average number of identical pairs (ii) was 519, of which 202, 216 and 101 were found at the first, second and third codon positions, respectively. Transitional pairs (si = 76) were found to be more common than transversional pairs (sv = 60), with a si/sv (R) ratio of 1.27 for the dataset. Both transitional and transversional pairs were most common at the third codon position (si = 62 and sv = 56).
The overall mean nucleotide base frequencies observed for these sequences were T (28.90%), C (28.40%), A (24.30%) and G (18.40%). The base composition analysis for the COI sequence showed that the average T content was the highest and the average G content was the lowest; the AT content (53.20%) was higher than the GC content (46.80%). The GC contents at the first, second and third codon positions for the 11 sole fish were 56.70%, 43.10% and 40.60%, respectively ( Table 2). Of these, the GC content at the first codon position was the highest, which can be attributed to base usage bias among the three codon positions. The usage frequency of only C was similar among the three codon positions, whereas the other three bases had significantly different usage frequencies. At the first codon position, the usage of T (18.00%) was the lowest, and the usages of the other bases were C (25.60%), A (25.60%) and G (31.10%). At the second codon position, the content of T (42.00%) was highest, and the contents of the other bases were C (28.40%), A (15.00%) and G (14.70). At the third codon position, the base usage was T (27.00%), C (31.10%), A (32.20%) and G (9.50%); the G content was the lowest, exhibiting a clear pattern of anti-G bias.
The Kimura-2-parameter model is recommended by the Consortium for the Barcode of Life (CBOL) for calculating genetic distance [16,17]. In this study, the Kimura-2-parameter model was used to calculate the genetic distances within and between species for the fish used  Table 3, the K2P distances of the COI sequence within species ranged from 0 to 1.83%, with an average distance of 0.21%; the largest distance of 1.83% was found in Terapon jarbua, and the distances for all remaining species were less than 1%. The genetic distances between species ranged from 0 to 21.70%, with an average of 6.50%, which was 31 times the average genetic distance within species. The genetic distance between genera was 7.70-30.50%, with an average of 23.70%, and the genetic distance between families was 17.60-31.80%, with an average of 25.60%. Only genetic distances within species were less than 2%, and the mean genetic distances between species, between genera and between families were all greater than 5%, which were much higher than the distances within species. The data also show that the genetic distance (K2P) was larger at higher taxonomic levels, and the increases in genetic distances (K2P) above the species level were smaller and less pronounced at higher taxonomic levels.
The NJ tree, including 284 species with all haplotypes and 71 species from NCBI (Table 1), is provided in Fig 2. Most of the specimens of the same species were clustered together, which reflected the prior taxonomic assignment based on morphology. No taxonomic deviation was detected at the species level, indicating that the majority of the examined species could be authenticated by the barcode approach.

Discussion
DNA barcoding technology allows for species identification by taking advantage of a DNA sequence fragment that is shared by organisms that have significant interspecies differences. This technology breaks through the over-reliance on the personal abilities and experiences of taxonomists in traditional morphological classification and enables the informatization and standardization of species identification. The mitochondrial COI gene, which exhibits high levels of conservation within species and modest levels of genetic variability between different species, is usually utilized as a species barcode, and its high efficiency in species identification has been reported in Japanese marine fishes [6], Indian freshwater fishes [18], Taiwan rayfinned fishes[9] and Mediterranean fishes [19]. In this study, we successfully amplified the COI barcode sequences for 89 marine fish species. The primer pairs used in this study could amplify the target region without any deletions or insertions, indicating that DNA barcoding could be used as a global standard for identifying fish species. All barcode sequences were a consensus length of 655 bp, no stop codons were detected, and the sequences were free of nuclear mitochondrial pseudogenes (NUMTs). Vertebrate NUMTs are typically smaller than 600 bp [2,20]. The presence of NUMTs can cause misestimation of the biodiversity. The number of COI genes is greater than that of pseudogenes, so conserved primers should preferentially amplify mitochondrial DNA over NUMTs, and there was no evidence of NUMTs in the fish. A nucleotide pair frequency analysis resulted in 325 conserved sites, 330 variable sites, 330 bp parsimony informative sites and no singleton sites. There were more transitional pairs (si = 76) than transversional pairs (sv = 60). The observed nucleotide pair frequencies were similar to those reported in studies of fishes in Turkey. Both transitional and transversional pairs were highest at the third codon position (62 and 56 for si and sv, respectively), and synonymous mutations mostly occurred at the third position. The amount of variation observed in mitochondrial DNA can lead to demographic changes in fish populations.
The base composition analysis of the COI sequence revealed AT content (53.20%) that was higher than GC content (46.80%), which is a result similar to the results found in Australian [2], Canadian [21] and Cuban fish species [22]. The GC contents in the first, second and third codon positions for the 11 sole fish were 56.70%, 43.10% and 40.60%, respectively. Of these, the GC content of the first codon position was significantly higher than those of the other two positions, which can be attributed to base usage bias at the three codon positions. At the first codon position, the usage of T (18.00%) was the lowest, and the usage of the other bases was C (25.60%), A (25.60%) and G (31.10%). At the second codon position, the usage of T (42.00%) was the highest, and the usage of the other bases was C (28.40%), A (15.00%) and G (14.70%). At the third codon position, the base usage was T (27.00%), C (31.10%) and A (32.20%), each of which was higher than G (9.50%). The third codon position had a clear pattern of anti-G bias, and similar patterns have also been observed in Ophichthyidae and Soleidae [23,24]. In species evolution, the codon positions of mitochondrial genes are subjected to varying degrees of base-mutation selection pressure, and base usage bias may be caused by base-mutation pressure in codon positions.
The efficiency of species identification through DNA barcoding depends on both interspecific divergence and intraspecific divergence. Barcode analysis attempts to identify the boundaries to delineate species, which corresponds to the divergence between the nearest neighbors within a group [10,18]. However, there is still no universal standard threshold defined for interspecies demarcation. The difference between minimum congeneric and maximum conspecific divergence was recently used to define the barcoding gap, and this difference was more efficient than the mean of intra-and interspecific sequence variability [25,26].
In this study, the average intraspecific K2P distance was 0.21%, compared with 6.50% for species within genera. The mean interspecific distance was found to be 31-fold higher than the mean intraspecific distance, which was similar to the 25-fold difference observed in Australian marine fishes [2] and the 26.2-fold difference observed in Canadian mesopelagic and upper bathypelagic marine fishes [27]; this result corresponds to the DNA barcoding principle that interspecific divergence sufficiently outscores intraspecific divergence. In addition, the difference was greater than the 13.9-fold difference observed in the marine fishes commonly encountered in the Canadian Atlantic [28]. The amount of variation in mitochondrial DNA observed in this study can lead to demographic changes in fish populations. The mean K2P distance increased gently within the higher taxonomic ranks of families and species classes, with values of 23.70% and 25.60%, respectively. The rate of increase declines in the higher taxonomic categories due to substitutional saturation.
The entire NJ tree derived from the study is shown in Fig 2. Most species were clustered into monophyletic units in the NJ tree, indicating that DNA barcoding has high efficiency in species identification. Morphological misidentification can change the outcome of the NJ tree. We detected deep divergence of 5.99% among individuals of Muraenesox cinereus (MG220575, KX215195, KX215196) at first. Three sequences obtained from M. cinereus formed two different clusters. The sequence MG220575 clustered away from the rest and clustered with Uroconger lepturus with a K2P distance of zero. Without this single sequence, the conspecific divergence of M. cinereus was zero. We checked the preserved specimens and discovered that the single specimen was a larva of U. lepturus, which was originally classified as M. cinereus. Morphological misidentifications of voucher specimens, DNA contamination and incomplete knowledge of the taxonomic literature can contribute to ambiguous barcoding results [29,30]. On the other hand, the reference library of barcodes and species identification requires a large number of specimens, including eggs, larvae and adults, and many morphometric characteristics change during distinct developmental stages. Therefore, occasional instances of misidentification are inevitable, and this example reflects that DNA barcoding can detect cases of morphological misidentification. The combination of morphological and molecular characteristics is a necessary condition for establishing a molecular database. A successful reference barcode library can be used to better characterize and broadly identify species.
Two other cases can be found in the NJ tree: sequences that have the same species name but do not exhibit cohesive clustering by conspecies in which detect deep divergence can be detected among them, and sequences that have different species names and form a cohesive cluster. The sequences of Thryssa kammalensis clustered separately in the NJ tree; one sequence showed a genetic divergence of 3.30% with others and 4.30% with Thryssa dussumieri, but it clustered with Thryssa hamiltonii with a K2P distance of zero. We rechecked the identification history of the preserved samples and found some intermediate morphological characteristics in them. The main factors responsible for this case may be introgressive hybridization. Mitochondrial genes are maternally inherited, and the hybrid would have only maternal species DNA. When species with a close phylogenetic relationship mate, the subsequent generation can have the morphological characteristics of either parent species. Introgressive hybridization would lead to phylogenetic paraphyly. The species of Takifugu, which formed a cohesive cluster, exhibited fairly low interspecies divergence with a value of zero and could not be discriminated by DNA barcoding. This failure was due to recent and rapid speciation, and the specimens of these species were genetically similar at the DNA barcode region. The factors of erroneous taxonomy, low sister species divergence, introgressive hybridization and the scarcity of specimens were also theoretically associated with the failure of DNA barcoding. A more rapidly evolving DNA fragment, such as the mitochondrial control region, should be used in such specimens, and nuclear genes should also be sequenced to establish whether hybridization has occurred.
Fish diversity in the Taiwan Strait is highly threatened by overexploitation, and some scientists predict that all fisheries will have collapsed by 2048 [9,31]. With the significant decline in biodiversity, species extinction enhances the need for the conservation of marine biodiversity [32]. Our results reveal that DNA barcoding was successful in identifying the vast majority of fish species. Identification supported by DNA barcoding could be used to evaluate fish biodiversity, monitor fish conservation and manage fisheries [14,33]. This technique will provide direction for future studies of fish species that need to be barcoded. Once a fish DNA barcode database has been established, the scientific and practical benefits of fish barcoding are diverse. DNA barcoding can discriminate all fish species and identify the eggs, larvae and carcass fragments of these species. The results will provide more information on fish diversity to the fisheries managers and ecologists who craft the policies for the conservation and sustainable use of fishing resources.