DNA Barcoding Green Microalgae Isolated from Neotropical Inland Waters

This study evaluated the feasibility of using the Ribulose Bisphosphate Carboxylase Large subunit gene (rbcL) and the Internal Transcribed Spacers 1 and 2 of the nuclear rDNA (nuITS1 and nuITS2) markers for identifying a very diverse, albeit poorly known group, of green microalgae from neotropical inland waters. Fifty-one freshwater green microalgae strains isolated from Brazil, the largest biodiversity reservoir in the neotropics, were submitted to DNA barcoding. Currently available universal primers for ITS1-5.8S-ITS2 region amplification were sufficient to successfully amplify and sequence 47 (92%) of the samples. On the other hand, new sets of primers had to be designed for rbcL, which allowed 96% of the samples to be sequenced. Thirty-five percent of the strains could be unambiguously identified to the species level based either on nuITS1 or nuITS2 sequences’ using barcode gap calculations. nuITS2 Compensatory Base Change (CBC) and ITS1-5.8S-ITS2 region phylogenetic analysis, together with morphological inspection, confirmed the identification accuracy. In contrast, only 6% of the strains could be assigned to the correct species based solely on rbcL sequences. In conclusion, the data presented here indicates that either nuITS1 or nuITS2 are useful markers for DNA barcoding of freshwater green microalgae, with advantage for nuITS2 due to the larger availability of analytical tools and reference barcodes deposited at databases for this marker.

The green algae, Chlorophyta, are an ancient and taxonomically diverse lineage with approximately 8,000 described species [18,19]. It is estimated that at least 5,000 species still remain undescribed, notably in tropical and subtropical areas [19]. Chlorophytes are important producers in aquatic and humid terrestrial ecosystems, which are often used as bioindicators in water monitoring and ecological studies [20,21]. In addition, there is a growing interest in using green microalgae for biotechnological applications such as the production of fuels, chemicals, food and animal feed [22,23]. The identification of green microalgae can be a difficult task and often requires careful microscopic examination of live cultured cells by a trained specialist [14,24,25]. Even so, the presence of cryptic species and phenotypic plasticity found in some species may hamper conclusive morphologic species diagnosis [26,27]. DNA barcodes could provide the means to identify green microalgae consistently and rapidly, regardless of life stage [13,28,29].
Targets for potential Chlorophyta DNA barcodes have included chloroplast (rbcL, tufA and Cp23S), mitochondrial (COI) and nuclear genes (18S rDNA, nuITS1 and nuITS2) [13,[28][29][30]. However, none of these markers were considered ideal for use across all lineages tested [13,29,31,32]. Given the complexity and heterogeneity of chlorophytes, the protist working group of the Consortium for the Barcode of Life (CBOL) recommended the use of a two-step barcoding pipeline in which a universal pre-barcode marker should be used first, followed by the use of a group-specific second barcode [29]. A dual marker barcode based on matK and rbcL genes has been formally proposed for use in DNA barcoding embryophytes [4]. However, the matK gene is absent in chlorophytes precluding its use in this group [33]. Despite the unavailability of a universal PCR toolkit for rbcL amplification, this marker is considered a promising barcode for green algae [13]. Indeed, there are currently 4,449 rbcL sequences from chlorophyte species deposited at the Barcode of Life Data Systems (BOLD), a taxonomically curated database [3]. Apart from rbcL, the most promising candidates for green microalgae barcoding are the nuITS1 and nuITS2 markers [13,14,26,28,30,34]. The ITS1-5.8S-ITS2 region from virtually all Viridiplantae can be amplified with a single set of universal primers [35], despite these being markers of high variability [13]. Furthermore, it is possible to analyze not only the nuITS1 and nuITS2 primary sequence, but also their secondary structures [36]. Although there are reports indicating that nuITS1 and nuITS2 might be insufficiently conserved or confounded by introgression or biparental inheritance patterns, a growing body of evidence has shown that simultaneous analysis of nucleotide data and compensatory base changes (CBCs) with secondary structure information can overcome most of the limitations of this potential barcode [14,28,30]. In addition, nuITS1 and nuITS2 have been the molecular markers of choice in several recent taxonomic revisions of freshwater chlorophytes species that were based on integrated morphological, physiological and molecular approaches [14,26,27,34,[37][38][39][40][41][42] The use of nuITS1-and nuITS2-based phylogenies promoted considerable changes in green microalgae taxonomy, especially in taxa with simple morphology and few ultrastructural characteristics such as coccoid chlorophytes [26,27].
This study aimed to identify neotropic green microalgae specimens isolated from Brazilian inland waters through the use of rbcL, nuITS1 and nuITS2 molecular markers as DNA barcodes. Brazilian continental waters comprise a biodiversity reservoir of enormous global significance and might contain up to 25% of the world's algae species [43]. Novel primers for neotropic specimens' rbcL gene amplification and sequencing are presented, as well as comparisons between rbcL, nuITS1 and nuITS2 markers variability, primers universality and databases accuracy and comprehensiveness.

Isolation and culturing
All the sample collections were made under the authorization SISBIO #39146 (09/26/2013) conceded by the Instituto Chico Mendes de Conservação da Biodiversidade (ICMBio) of the Brazillian Ministry of the Environment (MMA). The collections made on private land were also authorized by the owner of the land. This study did not involve endangered or protected species. Water samples were collected from the sites shown in S1 Fig. The collection environments included natural freshwater bodies within the Amazon rainforest, the Cerrado savanna and the Pantanal flooded grasslands, as well as anthropogenic wastewater deposits from the sugarcane industry (vinasse), pisciculture ponds and wastewater from swine farming. Sampling areas were delimited as being a 1 km radius centered in the geographic coordinates shown in S1 Fig. The collected environmental samples were submitted to an enrichment step through suspension in modified Bold's Basal Medium-BBM [44] and subsequent culturing at 28°C, light intensity of 50 μEm -2 s -1 and 16/8h light/dark regime. After 15 days of culture, the microalgae strains were isolated by two subsequent rounds of subculturing on BBM agar plates supplemented with ampicillin (100 μg/ml), chloramphenicol (25 μg/ml) and amphotericin B (2,5 μg/ml) under the same conditions described above. Individualized macroscopic colonies on agar plates were collected and inoculated into liquid BBM media to derive axenic cultures. The absence of contaminants was confirmed through microscopic inspection. The isolated strains were deposited in the Collection of Microorganisms and Microalgae Applied to Agroenergy and Biorefineries at Embrapa (Brasília/DF-Brazil).

DNA extraction, amplification and sequencing
Total genomic DNA was isolated from 30 mg of fresh algal biomass using the Cetyl Trimethylammonium Bromide (CTAB) DNA extraction protocol adapted by [45]. The rbcL and ITS1-5.8S-ITS2 DNA regions were submitted to PCR amplification using the primers described in Table 1. The 25 μL PCR reaction mix was composed of 14.5 μL of ultrapure water, 5 μL of GoTaq 5X PCR buffer, 1.5 μL MgCl 2 25 mM, 0.75 μL BSA 10 mg/mL, 0.5 μL dNTPs 10 mM, 0.25 μL of GoTaq DNA polymerase (5 U/μL) (Promega, USA), 0.25 μL of each primer (10 μM) and 2.0 μl of DNA template (50-100 ng/μL). The PCR amplification protocol used for both markers was: 96°C for 5 min, 40 cycles of 96°C for 1 min, (primer annealing temperature-see Table 1) for 1 min and 72°C for 1 min, with a final extension at 72°C for 5 min. The PCR products (5 μL) were visualized on agarose gels and selected for direct sequencing. Sequences were determined bi-directionally for at least two different amplicons using the BigDye Terminator v.3.1 Cycle Sequencing Kit on the ABI 3130 automated DNA sequencer (both from Life Technologies, USA), in accordance with the manufacturer's instructions. The forward and reverse sequences were aligned and edited using Geneious 6.1 software [46], generating consensus nucleotide positions with QV 20. Sequences were deposited in GenBank under the accession numbers: rbcL sequences (KT307991 to KT308039); ITS1-5.8S-ITS2 sequences (KT308040 to KT308042; KT308046 to KT308076; KT308078 to KT308086; KT445859 to KT445863).

Molecular data analysis
Sequences were aligned automatically using ClustalW [47] under default parameters using MEGA5 software [48]. The nuITS1, 5.8S and nuITS2 sequences were annotated using ITSx v. 1.0.11 [49]. For similarity searches, the rbcL sequences were submitted to the Barcode of Life Data Systems (BOLD systems) using the Plant identification tool, while nuITS2 sequences were submitted to the Basic Local Alignment Search Tool (BLASTN) for comparisons against nucleotide sequences deposited at the Genbank. The nuITS2 secondary structures were predicted by either direct fold (energy minimization) or homology modelling [50]. Subsequently, in order to locate hemi-compensatory base changes (hemi-CBCs) and compensatory base changes (CBCs), each sequence-structure along with its top match on ITS2 Blast tool were aligned and analyzed with 4SALE v. 1.7 [51,52]. The barcode gap was inferred based on uncorrected pair-wise (p) distance matrices. MEGA5 software was used for calculation. The taxon samplings used were reference nuITS1, nuITS2 and rbcL sequences derived from recent taxonomic revisions of the Chlorella and Desmodesmus genera [14,53,54] (S1-S3 Tables). The maximum intraspecific distances and minimum interspecific distances obtained were computed.

Morphologic Identification
Microscopic morphologic identification at the genus level was performed according to Bellinger & Sigee, 2015 [56]. Further identification to species levels was accomplished by comparison with the species original descriptions that are available at the AlgaeBase [57]. In the case of the as of yet undescribed species, the morphological comparisons were made with the closest strains obtained in the molecular identification step: Desmodesmus sp. MAT2008c [58]; Micractinium sp. CCAP 211/92 [39]; Desmodesmus sp. GM4a [59]. A Carl Zeiss Axio Imager A2 microscope (Zeiss.co, Brazil) equipped with Differential Interference Contrast (DIC) was used for morphological analysis.

Barcode markers primer universality
A total of 51 unialgal strains (named Embrapa|LBA#1 to #51) were isolated from natural water bodies within the Cerrado savanna, the Pantanal wetlands and the Amazon rainforest, as well as anthropogenic wastewater deposits (S1 Fig). Coccoid morphotypes were the most abundant among the isolated strains (51%), followed by monadoids/palmelloids morphotypes (41%) (data not shown).
The ITS1-5.8S-ITS2 region could be successfully sequenced from DNA samples extracted from 47 strains (92,15% sequencing success rate) by using the universal primers described by White and coworkers (1990) [35] (Table 1). Even though all the 51 samples could be amplified with this set of primers, the presence of multiple PCR products impaired direct sequencing of four samples. On the other hand, the sequencing success rate obtained using the rbcL gene universal primer sets described by Hall and coworkers (2010) [13] or the sets proposed for embryophytes by the CBOL Plant working group [4], ranged from 0% to 15,69% (Table 1). In order to circumvent this problem, new sets of primers targeting rbcL gene partial amplification (Table 1) were designed based on 175 rbcL reference sequences from distinct Chlorophyta taxa mined from BOLD Systems. The newly designed primer pairs Fw_rbcL_192/Rv_rbcL_657 and Fw_rbcL_357/Rv_rbcL-1089 could successfully amplify and sequence 82,35% and 50,98% of the dataset, respectively ( Table 1). The combination of the sequencing results from both these rbcL primer pairs allowed the construction of quality consensus sequences (QV20) for 49 samples (96,08% sequencing success rate). A total of 18 distinct 5.8S genotypes, 23 distinct nuITS1 genotypes, 23 nuITS2 distinct genotypes and 26 distinct rbcL genotypes were obtained.

Similarity search based on nuITS1, nuITS2 and rbcL markers
In order to perform the molecular identification of Embrapa|LBA strains, the rbcL sequences obtained were submitted to similarity searches against the DNA barcoding dedicated database, BOLD systems. The closest matches retrieved for rbcL sequences ranged from 90% to 99% of similarity (Table 2). Currently, there are very few nuITS1 and nuITS2 sequences from chlorophytes deposited at taxonomically curated databases such as BOLD, therefore similarity searches were performed against the GenBank. The closest matches retrieved for nuITS1 sequences ranged from 70% to 100% of similarity and for nuITS2 sequences ranged from 81% to 100% of similarity (Table 2). Embrapa|LBA strains retrieved matches from species that belong to the Chlorophyceae and Trebouxiophyceae classes, especially to the orders Chlamydomonadales, Chlorococcales, Sphaeropleales and Chlorellales ( Table 2). Ten nuITS1 sequences, 14 nuITS2 sequences and 0 rbcL sequences retrieved matches with a 100% similarity ( Table 2).

Barcode gap analysis
Similarity searches only configure the first step for DNA barcoding since they provide information about the closest matches present in reference databases, but not necessarily species-level identification. In order to establish a genetic distance threshold for species-level identification that is applicable to chlorophytes, barcode gap analyses were conducted based on reference sequences from two species-dense green microalgae genera, Chlorella and Desmodesmus (S2-S4 Figs; S1-S3 Tables).
Additionally, even though nuITS2 Embrapa|LBA#50 sequence presents only 96% of identity to its GenBank closest match, it can also be considered that species-level identification has been achieved, since the lowest interspecific distance calculated specifically for the Chlorella genus nuITS2 sequences is 0,076 (S3A Fig). On the other hand, rbcL based identification assigned Embrapa|LBA #32-34 and #42-44 strains to Chlorella pyrenoidosa species, which is not currently a taxonomically accepted name [57]. Therefore, Embrapa|LBA #32-34 and #42-  The compensatory and hemi-compensatory base changes (CBCs/hemi-CBCs) between the indicated sequence and its closest match in the ITS2 Database are shown. An hyphen (-) is indicated for samples that could not be amplified and/or sequenced, and for the nuITS2 sequences for which secondary structure predictions and CBCs/Hemi-CBCs analysis were not possible.
doi:10.1371/journal.pone.0149284.t002 44 strains were excluded from the subset of strains identified to the species-level based on rbcL sequences.
In conclusion, the results presented so far indicate that 18, 18 and 3 Embrapa|LBA strains were identified to the species-level based on nuITS1, nuITS2 or rbcL sequences, respectively.

Discussion
A dual marker DNA barcode system has been proposed as a potential solution to cope with the great diversity of protists, however there is no current consensus about which marker should be used [29,32]. Ideally the two chosen markers should be easily amplified/sequenced using a single set of primers and sufficiently variable to permit clear species delimitations without loss of the phylogenetic signal [29,32]. Even though tufA has been reported to be a promising barcode for chlorophytes [13,31,32], the number of green algae tufA sequences deposited at Gen-Bank is three times lower than the number of deposits for the protein-coding plastid gene rbcL or the non-coding regions of nuclear rDNA ITS1 and ITS2 (over 6,000 sequences deposited for rbcL and nuITS1 and over 7,000 sequences deposited for nuITS2 markers up to December/ 2015). Furthermore, recent taxonomic revisions of green algae have been based mainly on rbcL, nuITS1 or nuITS2 sequences [14,26,27,32,34,[37][38][39][40][41][42]60]. In addition, there are thousands of rbcL sequences from chlorophytes deposited at BOLD systems, which is the most complete taxonomically curated DNA database available [3]. Therefore, although a formal proposal for Chlorophyta DNA barcodes has not been made, a preference for rbcL, nuITS1 and nuITS2 markers by several research groups involved in green algae taxonomy can be observed.
Brazil holds the largest reservoir of algal genetic resources in the neotropical region [43,61]. In order to evaluate the applicability of nuITS1, nuITS2 and rbcL markers as DNA barcodes for neotropic freshwater chlorophytes, a subset of green microalgae strains was isolated from Brazilian inland water bodies (S1 Fig). This study, however, did not intend to perform an exhaustive sampling of all the Chlorophyta taxa present in the neotropics. Instead, it used specimens from this largely unexplored biodiversity hotspot as test case. DNA from all 51 Embrapa|LBA strains could be amplified and sequenced for at least one of the markers tested. The higher primer universality obtained for ITS1-5.8S-ITS2 region compared to the rbcL marker (Table 1) is in agreement with previous studies [13,28,62]. This can be explained by the presence of highly conserved neighbor regions flanking nuITS (1 and 2) markers, such as the 18S and 28S rDNA genes that function as annealing sites for the primers, described by White and coworkers (1990) [35], which are not available for the rbcL gene.
The levels of nucleotide diversity observed among the 5.8S, nuITS1, nuITS2 and rbcL sequences were of 0,046, 0,537, 0,321 and 0,250, respectively. Indeed, although nuITS1, nuITS2 and rbcL markers may fluctuate depending on the taxa analyzed, these markers rank among the most diverse barcode candidates for chlorophytes [13,28,31]. On the other hand, the 5.8S marker might not present sufficient resolution for species discrimination. Therefore, although other studies used the nuclear rDNA region ITS1-5.8S-ITS2 as a barcode for Chlorophyta (14,34,39), in this study the nuITS1 and nuITS2 regions were used separately to avoid genetic distance calculation bias eventually introduced by the simultaneous analysis of DNA regions with distinct evolutionary rates.
It is noteworthy that 53% of the nuITS1 and 42% of the nuITS2 matches retrieved from GenBank lacked the Latin binomial that characterizes the complete species name, compared to 10% of the rbcL matches retrieved from BOLD (Table 2). This might be due to the combination of two factors: i) CBOL's effort to preserve traditional taxonomic nomenclature; ii) The overall tendency in phycology to gradually move away from species identifiers based on Latin binomials pushed by the faster rate of genetic information discovery compared with the traditional taxonomic descriptions [24]. Importantly, species names that are not currently taxonomically accepted were found at both the BOLD and GenBank databases. That is the case, for example, of the strains Embrapa|LBA#32-34 and #42-44, which were assigned as Chlorella pyrenoidosa (Table 2), currently Pseudochlorella pyrenoidosa [26,38], at BOLD systems. Although this finding is not unexpected within GenBank, it is especially relevant in a taxonomically curated database such as BOLD. A possible explanation is that these are, actually, non-validated reference sequences mined directly from GenBank that are currently under taxonomic revision by BOLD collaborators. Indeed, it can be observed that the Acutodesmus obliquus rbcL reference sequence DQ396875.1 retrieved from BOLD (Table 2) is deposited with the old species name, Scenedesmus obliquus, at GenBank (data not shown).
Only few sequences retrieved matches with 100% of identity from GenBank and BOLD (Table2), suggesting incomplete taxa coverage within the reference databases analyzed. This is corroborated by the fact that there are less than 500 hundred rbcL records from the neotropical region (only 21 from Brazil) deposited at BOLD up to July/2015. Thus, it seems that the incongruences observed between species names retrieved from nuITS1, nuITS2 and rbcL similarity searches (Table 2) are mainly due to reference databases incompleteness rather than to real conflicts derived from distinct species identification by each marker. This is important information to be considered since the possibility of biased performance, eventually leading to sample misidentification, when using search algorithms such as BLAST is increased when analyzing poorly sampled groups [63].
Barcode gap analyses can provide the means to improve the accuracy for species level identification [1,17]. A barcode gap is present when the maximum intraspecific distance is lower than the minimum interspecific distance for a certain taxon, thereby revealing a corresponding distance threshold that can be applied to delimit species [17]. However, the same distance threshold may not be applicable to every species and should be determined for each taxon analyzed [32,63,64]. Due to the unavailability of a complete set of reference sequences for most of the taxa listed in Table 2, the analyses were based on sequences Chlorella and Desmodesmus genera for nuITS1 and nuITS2, and for Desmodesmus genus for rbcL. These reliable reference barcode sequences are originated from recent revisions of these genera based on integrative taxonomy approaches (S2-S4 Figs; S1-S3 Tables). As expect, the barcode gap analyses based on nuITS1, nuITS2 and rbcL makers (S2-S4 Figs) indicate that it is not possible to establish a single universal distance threshold that would avoid incorrect identifications and, at the same time, include all specimens into the correct species. However, assuming that incorrect specimen identification is more problematic than simply not assigning a specimen to any species, distance thresholds were inferred for each marker based on the minimum interspecific distances observed (S2-S4 Figs) allowing species-level identification.
There are several reports suggesting that the presence of compensatory base changes (CBCs) in nuITS2 secondary structures correlate with reproductive isolation [65][66][67]. A largescale testing with~300.000 nuITS2 secondary structures revealed that if a CBC is present then there are two different species with a probability of~93% [65,67]. Therefore, the detection of CBCs between the Embrapa|LBA strains nuITS2 sequences and their closest matches at Gen-Bank seems to be a reasonable predictor that species-level identification has not been achieved. In accordance, the CBCs analyses shown in Table 2 corroborate the species-level identification achieved based on barcode gap calculations. Additionally, the morphological (Fig 1) and phylogenetic analyses (Figs 2 and 3) also corroborate the species-level identification based on barcode gap calculations.
The DNA barcoding results presented here using a subset of neotropic freshwater green microalgae as a test case suggest that nuITS1 and nuITS2 are the most useful markers, while rbcL presented lower primer universality and species-level identification power. Although, both nuITS1 and nuITS2 precisely identified the same 18 strains to the species-level based on barcode gap calculations, nuITS2 accounts with a more complete set of reference sequences deposited at databases and an automated and well developed pipeline for secondary structure analysis [50]. The S5 Fig depicts the tentative DNA barcoding workflow for green microalgae specimens based on the results presented.

Conclusions
DNA barcoding can make specimens identification to species level faster, more reliable and accessible to non-specialists. Defining of the appropriate DNA barcodes for Chlorophyta identification and the availability of taxonomically curated DNA databases are pivotal to this task. The results presented here indicate that a DNA barcoding pipeline based on nuITS2 should be useful for green microalgae species identification. It is clear, however, that there is an urgent need for the deposition of more taxonomically accurate reference barcodes in curated databases (e.g.: BOLD Systems). Therefore, extensive efforts on integrative taxonomy are crucial, ideally encompassing the use of both DNA markers. These studies are especially relevant for poorly studied taxa such as tropical chlorophytes.
Supporting Information S1 Fig. Collection sites. Map of Brazilian biomes, including the Amazon tropical rainforest (1), the Caatinga xeric shrublands (2), the Cerrado tropical Savanna (3), the Pantanal flooded grassland (4), the Mata Atlântica tropical rainforest (5) and the Pampa subtropical grassland (6). The geographic coordinates of the six distinct locations sampled and the respective isolated strains in each site are shown. The strains isolated were deposited in the Collection of Microorganisms and Microalgae Applied to Agroenergy and Biorefineries at Embrapa (Brasília/DF-Brazil). The Brazilian territory is highlighted in black in the map of the neotropical region (inset). (TIF)  Table). (TIF) S5 Fig. Roadmap for green microalgae DNA barcoding. nuITS2 should be primarily sequenced and submitted to similarity searches against GenBank. Similarity values obtained must be compatible with the barcode gap thresholds calculated using reference sequences for the taxon indicated (a). The absence of CBCs between the query nuITS2 sequence and its closest match retrieved from similarity search is necessary to confirm species diagnosis (b). Finally, the current status of the assigned species name must be checked using a reference database (e.g.: AlgaeBase) (c). If nuITS2 is not sufficient for a species diagnosis, other markers/methods should be tried (d).