DNA Metabarcoding of Amazonian Ichthyoplankton Swarms

Tropical rainforests harbor extraordinary biodiversity. The Amazon basin is thought to hold 30% of all river fish species in the world. Information about the ecology, reproduction, and recruitment of most species is still lacking, thus hampering fisheries management and successful conservation strategies. One of the key understudied issues in the study of population dynamics is recruitment. Fish larval ecology in tropical biomes is still in its infancy owing to identification difficulties. Molecular techniques are very promising tools for the identification of larvae at the species level. However, one of their limits is obtaining individual sequences with large samples of larvae. To facilitate this task, we developed a new method based on the massive parallel sequencing capability of next generation sequencing (NGS) coupled with hybridization capture. We focused on the mitochondrial marker cytochrome oxidase I (COI). The results obtained using the new method were compared with individual larval sequencing. We validated the ability of the method to identify Amazonian catfish larvae at the species level and to estimate the relative abundance of species in batches of larvae. Finally, we applied the method and provided evidence for strong temporal variation in reproductive activity of catfish species in the Ucayalí River in the Peruvian Amazon. This new time and cost effective method enables the acquisition of large datasets, paving the way for a finer understanding of reproductive dynamics and recruitment patterns of tropical fish species, with major implications for fisheries management and conservation.

Tropical rainforests harbor extraordinary biodiversity. The Amazon basin is thought to hold 30% of all river fish species in the world. Information about the ecology, reproduction, and recruitment of most species is still lacking, thus hampering fisheries management and successful conservation strategies. One of the key understudied issues in the study of population dynamics is recruitment. Fish larval ecology in tropical biomes is still in its infancy owing to identification difficulties. Molecular techniques are very promising tools for the identification of larvae at the species level. However, one of their limits is obtaining individual sequences with large samples of larvae. To facilitate this task, we developed a new method based on the massive parallel sequencing capability of next generation sequencing (NGS) coupled with hybridization capture. We focused on the mitochondrial marker cytochrome oxidase I (COI). The results obtained using the new method were compared with individual larval sequencing. We validated the ability of the method to identify Amazonian catfish larvae at the species level and to estimate the relative abundance of species in batches of larvae. Finally, we applied the method and provided evidence for strong temporal variation in reproductive activity of catfish species in the Ucayalí River in the Peruvian Amazon. This new time and cost effective method enables the acquisition of large datasets, paving the way for a finer understanding of reproductive dynamics and recruitment patterns of tropical fish species, with major implications for fisheries management and conservation. PLOS

Introduction
The Amazon basin is thought to hold 30% of all river fish species described to date [1]. This exceptional wealth of fish species synchronize their reproduction to benefit from the highly predictable annual flooding regime of the Amazon basin (flood pulse concept, [2]). Annual flooding increases food availability for fish larvae and juveniles, and facilitates their dispersal [3,4]. Inundated floodplains provide also shelter for both parents and juveniles [2,[4][5]. For migratory fish, the exceptional dimension and complexity of the Amazonian hydrographic network provide a large number of dispersion routes for eggs and larvae [6,7]. Consequently, the study of their recruitment is difficult. These migratory fish species represent over 80% of fishery catches in the Amazonian basin, and most of them belong to the orders Characiformes and Siluriformes (catfish) [8][9][10]. Some of the most important catfish species, in particular the largest species of the genus Brachyplatystoma, are already overexploited [11][12][13][14][15]. Among this genus, Brachyplatystoma rousseauxii undertakes the longest migration known in freshwaters, 5500 kilometres [16]. The adults reproduce in the piedmont of Andean rivers in Bolivia, Colombia, Ecuador and Peru. The larvae then drift downstream to their nursery area in the Amazon estuary where they grow before migrating upriver to join the breeding adults in the headwaters [14][15][16][17][18]. This long-distance migratory life cycle is threatened by human activities, particularly overfishing and the construction of hydroelectric dams, which could drastically limit larval recruitment, hence the renewal of fisheries resources. Understanding the reproductive cycles and recruitment dynamics of commercially important migratory species is thus essential for the sustainable management of fisheries in Amazonia. Regular, standardized plankton net sampling of the larvae of these species is a promising way to achieve these goals. One of the technical limits of this approach is the specific identification of freshwater fish larvae, particularly those of Siluriformes, given the great number of species in this order. Morphological criteria alone in early development stages only enable differentiation of families, and sometimes of genera, but very rarely of species [19,20]. With the development of dedicated databases and the standardization of molecular tools, molecular approaches like DNA barcoding offer a solution to this problem. The technique has already been successfully used to identify adult siluriform species [21][22][23]. More recently this approach was extended to the identification of larvae [24,25] using individual Sanger sequencing. This approach is very reliable, but it can quickly become burdensome when several thousand larvae need to be identified. We therefore developed an alternative strategy using high throughput sequencing of batches of larvae. Our main objectives were to evaluate whether this approach is able to 1) retrieve all the different species in a mixture of larvae, and 2) quantify the relative proportions of species in the batch. PCR based methods can lead to biased amplification of DNA from divergent species and artificially lower the representation in the final counts [26][27][28]. Consequently, we developed a hybridization-based method to capture a commonly used DNA barcode marker: a portion of the cytochrome oxidase 1 (COI) gene. The mitochondrial molecular marker COI (600 bp) was selected for its ability to distinguish fish [29][30][31][32]. Our approach also enables the simultaneous analysis of several batches of larvae, hence further reducing sequencing effort. Moreover, it is very easy to extrapolate from a proof-of-concept on a single marker to several markers, whole mitochondria, or even several thousand nuclear genes [33].

Study area and sampling
In 2013 and 2014, larvae (ichthyoplankton) were sampled in three tributaries of the Amazon River: the Napo, Ucayalí and Marañon in Peru (for a list of samples, see S1 Table). A partner of the French National Institute of Research for Sustainable Development (IRD), the Instituto de Investigaciones de la Amazonia Peruana (IIAP) is rightfully authorized by the Peruvian government to study and to sample the Peruvian Amazonian biodiversity. The DIREPRO office "Dirección Regional de la Producción del Gobierno Regional de Loreto" authorized the export of fish larvae to IRD laboratories in France. Ichthyoplankton were sampled in daylight by towing an ichthyoplankton net behind a boat with an outboard motor maintained in a static position. The protocol used 1.5 m long conical-cylindrical nets with an aperture diameter of 0.6 m, and a mesh size of 1 mm or 0.3 mm depending on the sample, each net containing a collector cup in its end. Either one net was used, or three nets arranged vertically, with a distance of 2 m between each. The nets were towed for 15 min, five times a day, over a period of two days. Larvae were fixed in a 96% ethanol solution and dried just before DNA extraction following the procedure published by Sambrook et al. [34].

Larval sample
We used two samples, one from the Napo River and one from the Marañon River, to validate the method. The two samples of larvae were randomly divided into two subsets. One subset was used for DNA extraction from each individual larva (270 larvae from the Marañon and 102 larvae from the Napo). Part of the extracted DNA was pooled in equimolar quantities to form two samples named Mar-mock-NGS (for the Marañon larvae) and Nap-mock-NGS (for the Napo larvae). In the second subset, total DNA was extracted using all larvae together for the Marañon (250 larvae) and for the Napo (373 larvae) to form two bulk samples, Mar-bulk-NGS and Nap-bulk-NGS for the Marañon and Napo larvae, respectively. Each larva from the first subset was individually Sanger sequenced. We named the results of the Sanger sequencing of Napo and Marañon Nap-S and Mar-S, respectively. The mock and the bulk samples were sequenced using an Illumina platform.

COI amplification and sanger sequencing
A 680 bp fragment of the COI gene was amplified using universal primers Fish F1 and Fish R1 and PCR conditions as previously described by Hubert et al. [29]. PCR was performed with a Phusion1 High-Fidelity PCR Master Mix (ThermoFisher, 1040-2678) using the following program: 3 minutes at 98˚C followed by 30 cycles of 80 seconds at 98˚C, 45 seconds at 55˚C and one minute at 68˚C, with a final extension of 10 minutes at 72˚C. PCR products were sequenced using the BigDye Terminator 3.1 kit (Applied Biosystems). Sequencing was performed on an ABI prism 3130 (Applied Biosystems), at IRD Montpellier, (France) using the Fish F1 primer. Sequences are available on Dryad doi:10.5061/dryad.117tn (see S2 Table).

Library preparation
Preparation of biotinylated PCR probes for capture. COI probes were produced by PCR using the same conditions as above but with 5' biotinylated primers synthesized by Eurogentec SA (Angers France). Amplification was performed on four adults from four different species: Oxydoras niger (oni), Pimelodella sp. (pgra), Pinirampus pirinampu (ppi) and Pseudoplatystoma punctifer (ppu). The four species were selected to cover the panel of siluriform species present in our database, spanning a nucleotide divergence range of 20%. PCR probes were purified and pooled at equimolar concentrations for capture.
Library preparation and capture were performed as described in Mariac et al. [33]. Libraries were tagged using 6-bp oligonucleotides (TAG) to allow for multiplexing, and then hybridized with COI-targeted probes as described in Mariac et al. [33]. Briefly, total DNA was sheared using a Covaris E220 sonicator (Covaris, Woburn, Massachusetts, USA), end-repaired, ligated and nicks were filled in before pre-hybridization PCR was performed. These products were mixed and hybridized with biotin-labeled probes. The streptavidin-biotinylated probes-target complexes were magnetically immobilized during a 3 min. denaturation step at 96˚C to elute the enriched library fraction. A non-enriched library was also produced as control using genomic DNA from a subset of the Napo 2013 sample. This library was used to assess the efficiency of the capture method. The concentrations of the libraries were quantified by qPCR (Kappa Biosystems) after which paired-end sequencing was performed using MiSeq v2 reagents and 2 × 150 bp on an Illumina MiSeq v3 platform at the CIRAD facilities (Montpellier, France).

Data analysis
We built a database with sequences of COI genes from 86 adult siluriform species, referenced in different collections or publications (see S5 Table) and for which the phylogenetic relationships are established and discussed in Carvajal et al. [35]. The 86 sequences of COI used were deposited in GenBank (see S5 Table). This database contains only species of economic importance in Amazonia and consequently does not include all the genera of Siluriformes. The database allowed species identification of the samples by mapping sequences to those in the reference database.
The sequences produced using the Sanger method were examined with Geneious Pro 4.8.5 software [36]. Automatic filtering involved trimming regions with more than a 5% chance of error per base. Sequences were manually checked and any ambiguities corrected. The sequences were aligned with the reference database using MUSCLE (100 iterations, [37]). Finally, species assignment was computed by building a neighbor-joining tree using Tamura-Nei genetic distances.
The sequences obtained with the NGS method were demultiplexed using the script "demultadapt" (available at https://github.com/Maillol/demultadapt), that sorts reads according to their TAG (6 bp, 0 mismatch threshold). Adapters were removed using Cutadapt 1.2.1 software with the following parameters: length overlap = 7 bp, insert size = 35 bp. Reads were trimmed based on their nucleotide mean quality (Q < 30) using the script "filter-mean-quality" (available at https://github.com/SouthGreenPlatform/arcad-hts/blob/master/scripts/ arcad_hts_2_Filter_Fastq_On_Mean_Quality.pl). Sequences were then mapped on the reference database using the BWA mem v.0.7.12 package [38,39]. For each enriched library, we calculated 1) the percentage of useful reads, i.e. the ratio of the number of reads mapped on the database to the total number of reads, and 2) X-fold enrichment, which corresponds to the ratio of the number of reads from the enriched library mapped to the reference library to the number of reads from the non-enriched (genomic) library mapped to the reference library. X-fold measures the effectiveness (the enrichment) of the experiment. The percentage of useful reads measures the specificity of the enrichment.
As we had built up the bulk DNA from species we were able to identify using Sanger sequencing, we knew which species we expected to find using the NGS approach. We were thus able to identify false positives, i.e. species found using the NGS approach that were not present in the bulk DNA. Only properly paired reads with a maximum of one mismatch, no clipping, and no secondary alignment with the reference were used. Different filters were applied on the alignments: only the sequences for which both the R1 and R2 reads were mapped to the same species on the database were kept, and we kept only those mapped with quality > 0 (MAPQ>0), only sequences with strictly less than two mismatches, and an alignment over 100 bp. A given species was identified by listing which reference sequences the filtered reads were mapped to with a minimum coverage of 60%, otherwise we considered that it was not sufficiently precise for species level identification and kept only the genus taxonomic rank. The frequency of each identified species was calculated using the relative number of mapped reads divided by the total number of mapped reads across species. Finally, the composition and frequencies obtained from Sanger sequencing and from NGS sequencing were compared to validate the capture method. Pearson's product-moment correlations were calculated with R 3.2.5 [40].

Diversity study
Here, we describe a case study in which the new sequencing method was assessed over a period of eight months. To estimate the composition of siluriform species in samples collected monthly in the Ucayalí River between March and October 2014 (see S1 Table), we performed bulk DNA extraction, and used the NGS capture method on the eight samples to obtain the sequence data.

Identification of species composition
The COI marker was successfully Sanger sequenced for 270 and 102 larvae from the Marañon and Napo rivers, respectively. A total of 168 sequences corresponded to siluriform species. A total of 158 sequences were very close or identical to the 86 species of our database. Ten sequences presented some divergence from these 86 species. These ten sequences were redundant, and were narrowed down to three different sequences. These three sequences were added to our dataset, as we thought they might either represent three described species not yet sequenced or three new species. They were named Pimelodidae_Pimelodus_spA-C34, Pimelodidae_Hypophthalmus_sp. C95 and Pimelodidae_Hypophthalmus_sp.A289. The final database thus contained 89 sequences (Fig 1). Based on the revised reference dataset, the Sanger method identified 11 different species in the Marañon River sample (Mar-S) and 10 different species in the Napo River sample (Nap-S) ( Table 1).

Enrichment in COI sequences
X-fold enrichment ranged from 71 to 247-fold (see S1 Table), meaning that we found 71 to 247-fold more sequences from the target COI region after enrichment. After capture, the percentages of reads that mapped to the target region ranged from 0.29% to 1.03%.

Comparison of species identification by the NGS vs. the sanger method
The enrichment approach using the mock samples provided results that were largely congruent with those obtained with the Sanger sequencing method. The species compositions of the Marañon and Napo samples inferred from the mock-NGS capture method were identical to those obtained using the Sanger sequencing approach, which is considered as the reference (Table 1). Even species with very low frequency, such as Zungaro zungaro (0.98%) in the Nap-S sample and Brachyplatystoma rousseauxii (1.54%) in the Mar-S sample, were identified with the NGS capture method. We also identified a false positive in the NAP-mock-NGS sample ( Table 1). This species, Pseudoplatystoma punctifer, only represented 0.79% (27 reads) of the total number of reads.
The initial bulk of larvae was divided into two bulks and, as expected, low frequency species varied between the two bulks. Seven species not present in the Sanger reference were identified in the Marañon (4) and Napo (3) bulk-NGS samples. Conversely, five species in the Marañon sample and two in the Napo sample were present in the Sanger sample but were not retrieved in the respective bulk-NGS samples. The frequency of all these species was < 1.5%.
We would like to emphasize here that we successfully identified distant species with our four probes. The species Hypophthalmus sp.C95, which was identified in the Marañon and Napo mock samples using the NGS capture method, was the most genetically distant species [41] in the four probes (see S3 Table). This species presented 87.6% pairwise identity with the closest probe designed from Pseudoplatystoma punctifer, implying that the probes are able to capture very distant targets, even if the species is present at a low frequency (0.98%) in the sample.

Analysis of estimated frequencies
Species frequencies were very well correlated between Sanger sequencing and the bulk of larvae. Strong significant correlations were found between Sanger and NGS frequencies: r = 0.96 (p < 0.001) for the mock-NGS and r = 0.94 (p < 0.001) for bulk-NGS (Fig 2).The samples were characterized by one or two dominant species: in Mar-S, B. vaillantii represented 62% of the species and in Nap-S, B. vaillantii plus Pimelodus blochii represented 79.4% of the species present ( Table 1). The frequencies of the remaining species rapidly dropped below 10%.

Application to temporal variability of the samples
Changes in species richness over the eight month study period (Fig 3) tended to correlate with the hydrological cycle: species diversity was lower during high water and falling water periods, and higher during low water and rising water periods. Each batch was represented by one or two main species, plus a number of species present at lower frequencies. The number of species identified in each sample ranged from 2 to 13.
Detailed examination of species composition (see S4 Table) showed that the species belonged to two groups: those present almost all year round, such as B. vaillantii and B. filamentosum; and those present for only a few months, such as P. pirinampu.

Discussion
The aim of this study was to validate and use a new approach to study the specific diversity of siluriform larvae in river plankton samples. We validated the method and provided evidence for its reliability. Overall, the NGS approach made it possible to identify and quantify the species composition in a swarm of larvae in each sample. The enrichment rate was high, i.e. a 71 to 247-fold increase, and in agreement with observations in the literature [42][43][44]. The percentage of useful reads is still relatively low, but this likely reflects the size of our target sequence relative to the whole genome. Although the rate of useful reads is low, it is nevertheless cost effective and labor saving. Considering only the sequencing cost (library preparation # US$10 per sample can be disregarded), sequencing 300 larvae in each of 30 samples (9000 larvae) at US$5 per larva (Sanger sequencing) is very expensive compared to a single run of MISEQ sequencing at US$1,100 for the same 30 samples. Several strategies have been proposed to increase the percentage of useful reads including two rounds of capture [45]. This would be a useful improvement but it needs to be tested to see if the double capture does not bias the estimation of species frequencies.
Comparing the Sanger method and the NGS method enabled us to validate the qualitative estimation of species composition, although we also found one false positive P. punctifer in the  Table 1. Data on the alignment of the COI sequences for the Napo and Maranon samples. The table summarizes the results and estimated frequencies of each species identified using next generation sequencing (NGS) and the Sanger approach. For the next generation approaches, a "Mock" was prepared with equimolar DNA from each larva and a "Bulk" with DNA extracted from a batch of larvae. The number of reads, the percentage of mismatches, the percentage of base coverage, the average coverage depth, maximum coverage and estimated frequency are given for each species. For the Sanger approach (Nap-S, Mar-S), the number of individuals sequenced and the frequency of the species is given. Species are denoted including the family name "Family_Genus_species". When coverage was considered low (below 60%), the data are underlined and the identification is only given at the genus level (denoted Family_Genus_(species)_spX).

Mar-Sanger
Mar-Bulk-NGS   bulk Nap-NGSeq sample, which we knew from the Sanger method was not present. This species is genetically very closely related to P. tigrinum, which was also found in the sample (97.8% pairwise identity on 685 nucleotides). Consequently, the false positive could be the consequence of mapping to a very similar sequence. Longer reads might improve mapping precision in the future. The two samples from the Marañon and Napo rivers were randomly but equally divided into two halves, one for the Sanger (Mar-S and Nap-S) and equimolar (mock-NGS) analyses and the other for the non-equimolar analyses (bulk-NGS). When we compared the Sanger or mock-NGS with the bulk-NGS, we observed differences in the presence of low frequency species (<1.5% in the Sanger sample). This is not surprising, as species found once or at a low  frequency in a whole sample may be found in only one of the split samples. Overall, predictability of the relative frequency of species was very high. For low frequency species (below 10%) quantification might include some over-and underestimation. This might result from the variability in standardization or the difference in genome size between species, both of which lead to variation in the targeted COI sequences between larvae.
One limit worth mentioning is the need for a reference database to map our NGS sequences. Here we focused on commercially important and hence widely distributed species. Our reference contained 86 species of Siluriformes, which does not represent the whole species diversity of Siluriformes, but we only added three putative species to our database out of 564 Sanger sequenced larvae. In a few cases, we were only able to identify a larva to the genus level. As new species are sequenced, the database will be enriched and the precision of identifications improved.
Using our database, we were able to assess the temporal and spatial dynamics of species reproduction in the Ucayalí River. The results showed that the breeding activity of the Pimelodidae family was spread out over the eight months of the study. However, the results also suggested that more species reproduce during the falling and rising water periods, which is consistent with the results of earlier studies on adults [14,17] and on larvae [25,46]. Together with the work of García-Dávila et al. [25], who used individual Sanger sequencing, this study allows the first assessment of the life cycle of several siluriform species analyzed together. Until now, the life cycle of only one species, B. rousseauxii [14] had been studied in the Peruvian Amazon. Although more samples will improve our understanding of this dynamics, this first study provides very interesting insights into both diversity dynamics and the reproductive activity of some species at specific periods. Our NGS metabarcoding method, which enables the analysis of several thousand larvae in a bulk sample, opens new opportunities for studying the breeding cycles and recruitment of fish species in Amazonia. Indeed, with specially designed sampling strategies, it will be possible to estimate the breeding seasons, the spawning grounds and the recruitment patterns of a large number of commercially important migratory siluriform species. As creating probes is simple and can be used on phylogenetically distant species, as shown by Liu et al. [47], this method could also be extended to other groups, such as Characiformes and Gymnotiformes. It could also be extended from a single gene to a whole mitochondrial genome.

Conclusion
Compared to the classical Sanger method, here we demonstrate the reliability of our new metabarcoding method with enrichment by capture using probes. Our approach led to both specific identification of larvae in swarms and to the quantification of their relative abundance. The NGS approach has the potential to greatly improve our understanding of fish recruitment, and thus help develop adequate fisheries management and conservation strategies for Amazonian fish species. In addition, considering the alarming trend in the planning and construction of hydroelectric dams in the Amazon basin [48][49][50], quantifying the relative contribution of major sub-basins to the recruitment of important fish species might help guide decisions made by stakeholders and preserve river connectivity.