Indigenous species barcode database improves the identification of zooplankton

Incompleteness and inaccuracy of DNA barcode databases is considered an important hindrance to the use of metabarcoding in biodiversity analysis of zooplankton at the species-level. Species barcoding by Sanger sequencing is inefficient for organisms with small body sizes, such as zooplankton. Here mitochondrial cytochrome c oxidase I (COI) fragment barcodes from 910 freshwater zooplankton specimens (87 morphospecies) were recovered by a high-throughput sequencing platform, Ion Torrent PGM. Intraspecific divergence of most zooplanktons was < 5%, except Branchionus leydign (Rotifer, 14.3%), Trichocerca elongate (Rotifer, 11.5%), Lecane bulla (Rotifer, 15.9%), Synchaeta oblonga (Rotifer, 5.95%) and Schmackeria forbesi (Copepod, 6.5%). Metabarcoding data of 28 environmental samples from Lake Tai were annotated by both an indigenous database and NCBI Genbank database. The indigenous database improved the taxonomic assignment of metabarcoding of zooplankton. Most zooplankton (81%) with barcode sequences in the indigenous database were identified by metabarcoding monitoring. Furthermore, the frequency and distribution of zooplankton were also consistent between metabarcoding and morphology identification. Overall, the indigenous database improved the taxonomic assignment of zooplankton.


Introduction
Planktonic organisms play vital roles in food webs, biogeochemical cycles and other aquatic ecosystem functions [1]. Furthermore, due to their rapid responses to environmental variation, planktonic organisms have been used as indicators of ecosystem changes [2]. Despite its ecological importance, our understanding of the biodiversity of these organisms is hindered by difficulties in their identification which is complicated, time-consuming and requires unique expertise [3,4].
The advent of high-throughput sequencing has provided an alternative to overcome issues associated with morphology-based biomonitoring. In recent years, high-throughput sequencing has resulted in dramatic advances in practical, cost-effective molecular approaches to analysis of environmental samples. Metabarcoding has several applications [5], such as a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 investigating biodiversity [6], characterizing prey diversity in gut contents [7], and analyzing food-web dynamics [8]. Zooplankton are well suitable for metabarcoding analysis, because of their wide distribution in water and easiness of sampling. Recent applications of metabarcoding provided useful information on the genetic diversity of freshwater and marine planktonic organism communities [9,10]. Nevertheless, functional assessment of communities and biodiversity by metabarcoding is constrained because of the limited reference barcode databases [11]. In some studies, more than 40% of the obtained operational taxonomic units (OTUs) could not be confidently assigned to a taxonomic group [7,12].
Another problem is that the DNA crude extract obtained from a digested zooplankton [13] is contaminated by gut prey and intracellular endosymbiotic bacteria (e.g., Wolbachia) [14,15]. The single sequence from Sanger sequencing can be the product of co-amplification of contaminated DNA and may not represent the 'true' barcode of the target individual. This DNA contamination leads to a noisy signal and confuses the barcode sequence capture [5]. High-throughput sequencing allows for sequencing millions of DNA fragments in parallel, significantly increasing sample throughput and process efficiency. Additionally, high-throughput sequencing allows for generation of multiple sequences for a single sample and provides an opportunity to identify the contamination of prey and endosymbiotic bacteria [16]. The use of high-throughput sequencing, therefore, overcomes some of the inherent limitations of Sanger sequencing for barcoding small body size organism [5].
Here we developed a high-throughput sequencing protocol to capture COI barcode sequences from zooplankton specimens by Ion Torrent PGM and created an indigenous barcode database from 910 native zooplankton specimens. We used both an indigenous barcode database and NCBI public database (consist of all of the COI sequences in NCBI Genbank) to annotate the zooplankton metabarcoding data of Tai Lake (China). The aims of this study were to 1) develop a local species barcode database using a high-throughput sequencing species barcoding protocol (Figs 1 and 2) to evaluate the performance of species annotation of metabarcoding data by the local zooplankton barcode database. (S1 Fig).

Materials and methods
Ethics statement: There are no specific permissions required for the sampling locations as the monitoring project was performed by the local government. This field study did not involve any endangered or protected species and only zooplankton were were collected.

NCBI public COI reference database
The NCBI public COI reference database consisted of all the COI sequences downloaded from the NCBI Genbank with the key word "COI". The composition of the NCBI public COI reference database were analyzed by R (3.2.3 version).

Zooplankton sampling
For construction of an indigenous barcode database. Surface water was collected by an organic glass hydrophore at depth of 5 cm and filtered by a plankton net (46-μm mesh) at different locations in Lake Tai basin (S2 Fig). Zooplankton samples were fixed with 90% ethanol on site. In the laboratory, zooplankton were washed three times in deionized water and individually selected and transferred to 96-well plates under a stereoscope. Each well contained a single individual. All organisms were identified to the species level by morphology according to Fauna Sinica [17,18] which is the most authoritative reference for taxonomic identification in China. There were a few cases where specimens could be identified to genus level or higher, such as Mesocyclops species (S1 Table). Zooplankton were classified into three categories by abundance frequency: abundant (frequency > 1/2 samples), moderate (frequency > 1/3 samples) and rare species (frequency < 1/3 samples) (S1 Table).
For metabarcoding analysis. Two samples were collected at each site for metabarcoding analysis and morphological identification, respectively (S2 Fig). The bulk sample was collected by a plankton net (46-μm mesh) and filtering~30 L of lake water (at 5 cm depth). Water samples were filtered through 5-μm microporous filter paper (Millipore, USA) and stored at −20˚C.

Zooplankton DNA isolation and PCR amplification
For construction of indigenous barcode database. The COI fragments were sequenced by Ion Torrent PGM (Fig 1). DNA was extracted from each zooplankton using the HotShot protocol [19]. The organisms were placed in 0.2-mL tubes, and digested in 30-μL of alkaline lysis buffer (NaOH 25 mM, disodium EDTA 0.2 mM, pH 8.0). The digested samples were incubated at 95˚C for 30 min and stored on ice for 3-5 min. A further 30 μL of neutralizing buffer was added to each tube and debris removed by centrifugation. PCR amplification was performed in a final volume of 50 μL, made up of 1 μL of 10 μM of universal forward (GGW ACWGGWTGAACWGTWTAYCCYCC) and reverse (TAAACTTCAGGGTGACCAAARAAYCA) primers [7], 37.8 μL of ultrapure water, 5 μL of 10×PCR High Fidelity PCR buffer, 2 μL of MgSO4 (50 mM), 1 μL of dNTP mix (10 mM), 0.2 μL of Platinum Taq DNA polymerase, and 2 μL of DNA template (Invitrogen, USA).
PCRs were performed in 96-well plates using a SureCycler 8800 thermal cycler (Agilent Technologies, USA). Because of the high level of degeneracy of primers, a "touchdown" PCR profile was used to minimize the non-specific amplification. PCR was conducted for 16 initial cycles as follows: denaturation for 10s at 95˚C, annealing for 30s at 62˚C (-1˚C per cycle), and extension for 60s at 72˚C, followed by 25 cycles at an annealing temperature of 46˚C. The final extension was performed at 72˚C for 10 min. A negative control reaction with no DNA template was included. PCR products were detected on a 2% agarose gel, and the gel fragments were purified using the MinElute gel extraction kit (Qiagen, CA, USA). The gel-purified PCR products were quantified using the Qubit dsDNA HS assay kits (Invitrogen, USA), and the final concentration was adjusted to 10 ng/μL using molecular grade water.
For metabarcoding analysis. The E.Z.N.A. water DNA kit (Omega, USA) was used to isolate zooplankton DNA trapped on the 5-μm filter paper (Millipore, USA). The samples were homogenized by the MoBio Vortex-Genie2 (MoBio Laboratories Inc., CA, USA) with glass beads. The PCR primers and programs used in indigenous barcode database experiment were also used for zooplankton metabarcoding analysis.

Ion Torrent PGM sequencing
To ensure a homogeneous number of sequencing reads from each specimen, PCR amplicons were mixed in equal concentrations (10 ng/μL) in an equimolar pool. Total 100 ng of amplicon was used in the end-repair and ligation of the adaptors using the Ion Plus fragment library kit (Life Technologies, USA) according to the manufacturer's protocols. The end-repaired and ligated adaptor DNA was purified with the Agencourt AMPure XP kit (Beckman Coulter, Germany) to eliminate primer dimers and PCR artifacts < 100 bp. The purified amplicon library was assessed for region size distribution and DNA concentration using an Agilent 2100 bioanalyzer (Agilent Technologies, USA). The quantified amplicon libraries were sequenced using the Ion Torrent PGM (Life Technologies, USA).

Bioinformatics analysis
Indigenous barcode database. The ION Torrent server auto-sorts the sequences into different groups based on the library barcode and generates a FASTQ file. The Fastx toolkits and Bio-python were used to reverse complement the FASTQ file and to convert the FASTQ to FASTA [20]. We used the QIIME (Quantitative Insights into Microbial Ecology v1.8.0) platform [21] to filter low-quality reads and to discard reads with more than two mismatches in primer sequence. Chimeras were identified and removed by UCHIME [22]. The above steps were completed using the Bio-Linux 8 system, which integrates all of the above-mentioned tools [23]. Short reads (< 200 bp) were filtered using the "Biostrings" package in R with the Bioconductor environment [24]. The high quality, correctly encoded sequences were clustered into different group by the sequence similarity and using the BLASTX to determine the COI barcode sequence. The represented sequences of each species were submitted to NCBI Genbank with the accession no. KY091149-KY091230.
Metabarcoding analysis. Sequence pre-treatment (de-nosing, quality trimming, length trimming and chimeric check) were performed following the method in the indigenous barcode database. OTUs were clustered following the UPARSE pipeline [25]. For each OTU, a representative sequence was chosen and the Statistical Assignment Package (SAP) was used to assign the representative sequence to a taxonomic group with 95% cutoff value [26] against reference database (NCBI Genbank database and indigenous species database).

Genetic distances and tree diagram
The Kimura two-parameter (K2P) distance model was used to calculate genetic divergences of zooplanktons [27]. All sequences from one species were used to calculate the intraspecific genetic distances. A tree diagram was constructed using the neighbor-joining (NJ) method, which provided a graphical representation of the patterns of COI divergences [28]. The NJ tree was constructed from 87 sequences (one sequence per species) using MEGA 6 software [29].

COI reference database form NCBI Genbank
There were 2,186,026 COI sequences downloaded from NCBI Genbank (up to 2016-11). These sequences belong to 240,451 taxa (

Species identified by morphological method
In Lake Tai, 76 zooplanktons were identified by the morphologic identification. All of 9 abundant species, 9 of 12 moderate species and 30 of 55 rare species had barcode sequences in the indigenous database (S1 Table). Twenty-four of 76 species had barcode sequences in the NCBI Genbank. Only 3 of 24 species (Brachionus calyciflorus, Keratella cochleari and Brachionus diversicornis) had > 100 COI sequences in the NCBI Genbank (S3 Fig). Taxonomic assignment of NGS data between NCBI and indigenous 762,609 reads) belong to zooplankton (Fig 4A). Forty-four zooplankton OTUs were assigned to species level (similarity > 95%, alignment length > 100 bp) by both the indigenous species and NCBI Genbank databases. Twenty-five and 45 OTUs were assigned to the species level only using the NCBI Genbank database and indigenous species database, respectively ( Fig 4C).
Thirty-nine of 76 morphological species were detected by the metabarcoding (Fig 5). Of the 39 zooplankton identified, nine were identified by both the indigenous database and NCBI Genbank database (Fig 4D). The remaining 30 species were only identified by the indigenous database (similarity > 95%).

Comparison between metabarcoding and morphological monitoring
Morphology data demonstrated that Copepod S. dorrii and Mesocyclops sp., Cladocera B. sp. and Ceriodaphnia cornuta, and Rotifer Keratella quadrata were the dominant zooplankton in Lake Tai. These species also represented a greater reads number and had a higher detected frequency by the metabarcoding than other zooplankton (Fig 5A). Cladocera Limnoithona sinensis was not identified by metabarcoding, although it had a high frequency in the morphology data. Copepod S. forbesi and Thermocyclops taihokuensis and Rotifer B. diversicorni showed high detection frequency in metabarcoding data, but had low detection rates in the morphological data (S4 Fig). The number of species detected by metabarcoding in each sample was positively correlated (R 2 = 0.42, p = 0.0004) with that by morphological identification (Fig 6A). Furthermore, the frequency of species in metabarcoding also positively correlated (R 2 = 0.43, p < 0.0001) with morphology identification (Fig 6B).

Discussion
In the present study, we constructed an indigenous COI barcode database of zooplankton from the Tai Lake basin of Eastern China, and then compared indigenous database and NCBI Genbank in the annotation of the zooplankton metabarcoding. The indigenous database improved the taxonomic assignment of metabarcoding of zooplankton. Furthermore, the similarity of species identification of the common species between microscopic and metabarcoding was confirmed. First, most zooplankton (81%) which had barcode sequences in the indigenous database were identified by metabarcoding. Second, the species number observed by metabarcoding was positively correlated with that identified by microscope. Finally, the distributions of common zooplankton are highly similar between the two methods. These results are not new observations, but confirm that the COI barcode can successful identify most species of zooplanktons and metabarcoding is well suited for biodiversity monitoring of zooplankton. Although the metabarcoding monitoring of zooplankton is promising, there is still an opportunity to reduce the divergences between molecular and morphological monitoring by addressing the current limitations of metabarcoding. Some technical biases related to DNA extraction, PCR conditions, primer specificity, library preparation and bioinformatics have been extensively discussed in previous studies [30][31][32]. Below, limitations of (1) incompleteness, inaccuracy and the high divergence of zooplankton databases; and (2) inefficiency of barcode sequences captured by Sanger sequencing are discussed.

Incompleteness, inaccuracy and high divergence of zooplankton databases
Metabarcoding-based species identification requires taxonomically complete and geographically comprehensive reference databases of DNA sequences for each species [33,34]. Incompleteness and inaccuracy of databases are commonly believed to be the main hindrance to the use of metabarcoding [35]. Although COI sequences are growing fast, the identification of zooplankton by only relying on the NCBI Genbank is inefficient. This is not only because of database incompleteness, but also due to the high divergence of zooplankton [36,37]. Only 0.85% of the COI sequences belong to zooplankton in NCBI Genbank. Here, 24 out of 76 zooplanktons identified by morphology have records in Genbank but only nine of them were identified to the species level by NCBI Genbank. The sequences of NCBI Genbank come from all over the world. These sequences show high levels of intraspecific divergence of most zooplankton species, suggesting a geographical difference (Fig 1E). Furthermore, indigenous species sequences also show a high level of divergence compared with the sequences from NCBI (Fig 1G). This explains why some species cannot be assigned to the species level by NCBI. It is well known that COI fragment appears to possess a greater range of phylogenetic signal than any other mitochondrial and nuclear gene [38]. In fact, the evolution of COI is rapid enough to allow the discrimination of not only closely allied species, but also phylo-geographic groups within a single species [39,40]. Zooplankton, such as rotifer, often have complex life cycles, high dispersal capacities and rapid local adaptations, which may facilitate interspecific gene flow and intraspecific divergence [41,42]. Previous studies has discussed the high divergence and cryptic species in zooplankton [37,43,44]. For example, up to 15 COI genetic groups were found in one of the common Rotifer, B. calyciflorus, among 22 lakes in Netherlands [45]. This species also had a high intraspecific divergence in China [46]. Another possible reason for the high divergence of zooplankton in the NCBI database is the misidentification of zooplankton; especially for rotifers where taxonomy remains unclear [47] with few taxonomist experts [48]. In addition, the ability to discriminate between species on the basis of morphological characteristics is limited by the high level of phenotypic variation [13]. Different morphological variants have often been described as different species, subspecies, or forms [49]. Overall, incompleteness, inaccuracy and high divergence of zooplankton reference databases is a challenge for studying zooplankton metabarcoding. This can be addressed by the barcode database of indigenous species, especially for the metabarcoding based on the mitochondrial COI region.

Inefficiency of barcode sequence captured by sanger sequencing
The high-throughput sequencing platform improves the DNA barcode capture from zooplankton. Although an indigenous species database is important for metabarcoding, capturing the barcode sequence of zooplankton was inefficient by Sanger sequencing. We attempted to construct a taxonomic DNA barcode library of a large number of zooplankton samples by the high-throughput sequencing platform. The results demonstrated the potential of highthroughput sequencing as an effective method to capture barcode sequences of zooplankton.
The shortage of DNA barcode sequences in public databases for small body organisms such as zooplankton, may be due to the limitation of conventional approaches of generating barcode sequence, which is by PCR amplification and Sanger sequencing [50]. The low yield and low quality genomic DNA of single zooplankton specimen leads to low-efficiency PCR and low successful rates of Sanger sequencing [5,51]. In addition, insufficient amplification due to primer specificity, co-amplification of non-target amplicons also causes barcoding failures [5]. For example, in addition to the target barcode sequence, sequences from Wolbachia were also detected in some specimens of insect Lepidoptera [5]. The presence of Wolbachia [52,53], pseudogenes and heteroplasmy in public COI sequence databases could compromise the identification of DNA barcode specimens [14,15].
These problems can be solved using high-throughput sequencing. First, high-throughput sequencing only requires a small amount of DNA (e.g. 100 pM for ION torrent PGM) to sequence. In addition, high-throughput sequencing can generate multiple sequences for a single specimen. The non-target sequences can be identified by examining the sequence similarity and subsequently removed and improve the efficiency of recover DNA sequences in a single attempt [5]. Although the Sanger sequencing remains the major way for barcode sequence capture, the low cost and high-throughput of high-throughput sequencing platform will enhance and accelerate the indigenous database construction of zooplankton [16].

Conclusion
Building up indigenous databases significantly improved the analysis of species-level zooplankton biodiversity by metabarcoding. Although NCBI Genbank contain a large number of COI sequences, the contributions of NCBI Genbank to the identification of zooplankton in metabarcoding data are limited. The high-throughput sequencing platform enhanced the DNA barcode capture from single zooplankton specimens and the barcode database of indigenous species significantly improved the taxonomic assignment of metabarcoding data.

Additional information
The raw sequences of metabatcoding were submitted to NCBI Sequence Read Archive (SRR5202370).
Supporting information S1 Table. Zooplankton identified by morphological method. " p " means the species have barcode sequence in indigenous database or NCBI Genbank databse. "yes"means the species can be identified by indigenous database or NCBI Genbank.