New Insights into the Diversity of Marine Picoeukaryotes

Over the last decade, culture-independent surveys of marine picoeukaryotic diversity based on 18S ribosomal DNA clone libraries have unveiled numerous sequences of novel high-rank taxa. This newfound diversity has significantly altered our understanding of marine microbial food webs and the evolution of eukaryotes. However, the current picture of marine eukaryotic biodiversity may be significantly skewed by PCR amplification biases, occurrence of rDNA genes in multiple copies within a single cell, and the capacity of DNA to persist as extracellular material. In this study we performed an analysis of the metagenomic dataset from the Global Ocean Survey (GOS) expedition, seeking eukaryotic ribosomal signatures. This PCR-free approach revealed similar phylogenetic patterns to clone library surveys, suggesting that PCR steps do not impose major biases in the exploration of environmental DNA. The different cell size fractions within the GOS dataset, however, displayed a distinct picture. High protistan diversity in the <0.8 µm size fraction, in particular sequences from radiolarians and ciliates (and their absence in the 0.8–3 µm fraction), suggest that most of the DNA in this fraction comes from extracellular material from larger cells. In addition, we compared the phylogenetic patterns from rDNA and reverse transcribed rRNA 18S clone libraries from the same sample harvested in the Mediterranean Sea. The libraries revealed major differences, with taxa such as pelagophytes or picobiliphytes only detected in the 18S rRNA library. MAST (Marine Stramenopiles) appeared as potentially prominent grazers and we observed a significant decrease in the contribution of alveolate and radiolarian sequences, which overwhelmingly dominated rDNA libraries. The rRNA approach appears to be less affected by taxon-specific rDNA copy number and likely better depicts the biogeochemical significance of marine protists.


Introduction
In the last decade, 18S rDNA clone libraries have been considered as the gold standard approach for conducting molecular surveys of marine protist diversity in the environment [1,2]. These investigations, almost exclusively performed on the picoplanktonic size fraction (0.2-3 mm), have unveiled high rank taxa such as the so-called MALV (marine alveolates, [3]), MAST (marine stramenopiles, [4]), and picobiliphytes [5], many of which have become cornerstone taxa for microbial ecologists. Diversity surveys of picoplanktonic protists in different marine regions have generated broadly similar patterns [2,6], with dominance of nonphotosynthetic groups, including tiny parasites [7] and grazers [8]. In contrast, epifluorescence microscopy typically reveals a dominance of photosynthetic or mixotrophic cells over heterotrophic cells (ca 80% vs 20%, respectively) in the oceans [9]. This suggests that 18S rDNA clone libraries may give a significantly biased view of diversity. Several technical limitations inherent to culture-independent explorations of microbial diversity have been highlighted [10,11]. Among these, biases during DNA extraction and PCR amplification steps [12], primer selectivity, multiple rDNA gene copy number [13], and the existence of pseudogenes [14] or extracellular DNA [15], are particularly relevant.
Alternative approaches focused on photosynthetic protists have recently been developed to overcome the apparent bias towards heterotrophic cells. These include the construction of clone libraries from flow cytometry sorted populations [16], studies specifically targeting plastid genes [17], and the use of taxonspecific primers [18]. However, PCR biases, rDNA copy number, and extracellular DNA remain as potentially problematic issues with these approaches. A promising alternative which does not require PCR steps is the metagenomic approach, based on direct cloning and shotgun sequencing of environmental DNA.
This strategy was recently used to study prokaryotic life on a worldwide scale (Sorcerer, Global Ocean Survey expedition, [19]). Studies that compared metagenomic and 16S rDNA PCR-based clone libraries demonstrated that these two approaches were complementary for bacterial community analysis [20,21]. With respect to eukaryotic microbes, phylogenetic information present in metagenomic libraries has thus far received very little attention [22]. Another perspective to investigate microbial diversity is to target directly the 18S rRNA (i.e. the ribosomes themselves) as a proxy for both diversity and metabolic activity of cells [23], and to avoid the problems induced by differences in rDNA copy number and the perturbation from dissolved DNA. This approach has been proven to be effective on prokaryotic communities [11,24,25], but to date has only been applied on protist communities in an oxygen depleted environment [26].
In the present study we performed an in-depth analysis of the metagenomic dataset from the GOS expedition, seeking eukary-otic signatures through the presence of 18S rDNA genes. We also compared the protist diversity assessed by 18S rDNA libraries prepared from both environmental DNA and RNA extracted from the same water sample collected in the Mediterranean Sea. We show that overall the PCR induced biases do not appear to impact significantly diversity surveys. Rather we argue that rDNA copy number and extracellular DNA (partially by-product of the size fractionation) are major issues that introduce biases in current studies of protist diversity. Environmental 18S rRNA clone libraries appear to represent a promising means to minimize these biases and thereby offer new perspectives in the study of the diversity and function of marine protist.

Results
Taxonomic composition in 18S rDNA clone libraries versus the metagenomic dataset Taxonomic affiliation of sequences retrieved from PCR amplified 18S rDNA clones libraries performed on the picoplankton size fraction (0.2 to 3 mm) of samples collected in the photic zone around the globe [2] was compared to that of 18S rDNA sequences found in the ,3 mm size fraction of the GOS metagenomic dataset ( Figure 1A). Despite the large differences in the number of sequences analyzed for both datasets, random sub-sampling of the larger dataset demonstrated that the range of expected averaged distributions on a smaller number of sequences matched closely to the distribution observed ( Figure S1). This shows that looking at a limited number of sequences does not affect the diversity observed at the taxonomic level we considered. The clonal representation of the different taxonomic groups in both datasets was significantly correlated (slope 0.78; R 2 = 0.39; p = 0.0165), indicating that both integrated datasets yielded comparable results.
In the clone libraries, out of the 2175 sequences reviewed by Massana and Pedrós-Alió, alveolates dominated the assemblages with 50.3% of the sequences retrieved (most of which were MALV, marine alveolates: 19.2% MALV-II and 16.7% MALV-I).
The second most represented taxon was the stramenopiles, accounting for 20% of the eukaryotic sequences (of which 10.9% were MAST, marine stramenopiles). Prasinophytes and radiolarians accounted for 12.1% and 4.1% of the sequences, respectively. Cryptophytes, haptophytes and picobiliphytes represented 2.9%, 2.4%, and 1.1% of the sequences. The category ''other'', accounting for 7.2% of the sequences, was mainly composed of cercozoans, choanoflagellates and unassigned alveolates. Out of 116 sequences extracted from the GOS metagenomic dataset, the most represented groups were the alveolates (40.5%, of which 24.1% belonged to MALV-II and 5.2% to MALV-I), radiolarians (18.1%), stramenopiles (16.4%, including 12.9% of MAST), prasinophytes (11.2%), and haptophytes (6.9%). Sequences  [2] versus the metagenomic data retrieved from , 3 mm size fraction of the GOS dataset. B. Detail of the metagenomic GOS dataset obtained from two different size fractions ,0.8 mm and 0.8 to 3 mm. C. Comparison of clone libraries performed on the same sample from the Mediterranean Sea (0.6 to 3 mm size fraction) after DNA extraction (62 sequences) and RNA extraction (111 sequences). Actual numbers of sequences affiliated to each taxonomic group used to prepare these graphs are shown in Table S1. doi:10.1371/journal.pone.0007143.g001 affiliated to the picobiliphytes accounted for 1.7% of total sequences.
The 18S rDNA sequences retrieved from the GOS dataset had an averaged similarity of 96.0% with sequences deposited in the GenBank database. The most dissimilar sequences affiliated with marine alveolates (e.g. 80.6% similarity), whereas some were identical to GenBank entries (mostly prasinophytes and the haptophyte Chrysochromulina) and many were .99% similar to GenBank sequences. Closest matches for most GOS sequences corresponded to environmental clones obtained from PCR-based studies (Tables S4 and S5) Taxonomic analysis of distinct size fractions among the metagenomic dataset Of the 116 18S rDNA sequences identified in the metagenomic dataset from the GOS expedition, 69 derived from the ,0.8 mm fraction and 47 from the 0.8-3 mm size fraction. Clearly, both size fractions were capturing a distinct fraction of picoeukaryotic diversity ( Figure 1B), and the percentage of taxonomic groups observed in the two size fractions did not correlate at all (slope 0.18; R 2 = 0.03; p = 0.5523). Considering the smaller size fraction (,0.8 mm), radiolarians contributed 30.4% and stramenopiles 7.2% of the sequences (with 5.8% MAST). The overall contribution of alveolates was 41.9% of the sequences, including 18.8% of MALV-II, 10.1% of dinoflagellates and 7.2% of ciliates. Prasinophytes contributed 15.9% and haptophytes 2.9%. No picobiliphyte sequences were detected. In the larger size fraction (0.8-3 mm) the overall contribution of alveolates remained similar, but there was an increase of MALV-II (31.9% of sequences) and a decrease of dinoflagellates (2.1%) and ciliates (not detected). The contribution of stramenopiles increased drastically to 29.8% (of which 23.4% were MAST) while not a single radiolarian sequence was identified. Prasinophytes decreased to 4.3%, whereas the contributions of haptophytes, chrysophytes, and picobiliphytes increased to 12.8%, 4.3%, and 4.3%, respectively.
18S rDNA clones libraries prepared from DNA and RNA extracts 18S rDNA environmental clone libraries were constructed from DNA and RNA extracts (followed by a reverse transcription) obtained from the same seawater sample ( Figure 1C). Considering the limited number of clones sequenced and previous knowledge for marine samples, the libraries were explored in numbers far from saturation. Nevertheless, obvious patterns could be distinguished and the distribution of diversity observed for the 18S rRNA library is well outside the range of expected values for 18S rDNA libraries. Again, there was no correlation among the clonal percentage of taxonomic groups in the two libraries (slope -0.02; R 2 = 0.00; p = 0.9539). Among the 62 sequences from the DNA based library, 43.5% affiliated to alveolates, 38.7% to radiolarians, and 11.3% to stramenopiles. Most alveolate sequences affiliated with MALV-I (21.0%) or MALV-II (17.7%). Most of the stramenopiles belonged to MAST (i.e. 8.1% of the sequences). Chrysophytes, haptophytes, prasinophytes, and cryptophytes were detected but with a low clonal representation. In the rRNA based library, the diversity observed for the 111 sequences analyzed was drastically different. The contribution of alveolates decreased to 9.9% and the contribution of stramenopiles increased to 64.8% including 45.0% MAST. The contribution of sequences affiliated to haptophytes and prasinophytes increased to 7.2% and 4.5%, respectively. In contrast, the contribution of radiolarians sharply decreased down to 2.7%. The pelagophytes and picobiliphytes, which were not detected in the DNA survey, contributed 8.1% and 4.5% of sequences in the RNA survey, respectively. Also only detected in the RNA-based library, dictyochophytes made up half of the ''other stramenopiles'' category and Telonemia the major fraction of the ''other eukaryotes'' (data not shown).
In each library, Operational Taxonomic Units (OTUs) were defined using a 99% identity threshold (Table 1). Of the 62 and 111 sequences from the DNA and RNA based libraries, 34 and 52 OTUs were identified, respectively. Only 2 OTUs were present in both libraries, one affiliated to MALV-I, and the other to MAST-4. Using a 98% identity threshold, 29 and 46 OTUs were identified for the DNA and RNA based libraries respectively, but only one additional OTU (belonging to chrysophytes) was common to the two libraries. Statistical comparisons performed with LIBSHUFF found a significant difference between the two libraries (p,0.001).

18S rDNA clone libraries and metagenomic surveys give similar diversity patterns
Our analyses of the 18S rDNA sequences retrieved from the metagenomic dataset from the GOS expedition did not reveal substantial differences as compared to the PCR-based environmental clone libraries ( Figure 1A). Both datasets were obtained from a similar size fraction (,3 mm) and correspond to compilations of sequences from various sampling locations and thus represent a reasonable integration of the photic layer in the marine environment. Eukaryotic microbial diversity assessed by means of environmental clone libraries of the 18S rDNA gene has been reported from a variety of ecosystems over the last decade [2,6]. This approach has led to the discovery of eukaryotic taxa such as the MALV and MAST groups that often dominate the community in terms of clonal abundance. Among the technical issues usually invoked to lead to biases in 18S rDNA clone libraries there is the PCR step before the cloning procedure [10,12]. Metagenomic approaches directly clone and shotgun sequence the DNA from a given sample, without prior PCR. The similarity in diversity patterns between the PCR cloning and metagenomic approaches suggests little impact of the PCR step on the outcome of clone libraries in terms of sequence diversity and relative contribution of specific taxa. Our observation is consistent with similar studies on 16S rDNA bacterial diversity that did not find significant differences at high phylogenetic levels between metagenomic and PCR-based libraries [20].

Analysis of GOS size fractions refines our view of actual community composition
Separate analysis of the two size fractions from the GOS dataset revealed clear differences in terms of taxonomic composition ( Figure 1B). As the smallest eukaryotic organism known so far has a cell diameter of 0.8 mm [27], some of the 18S rDNA signatures observed in the ,0.8 mm fraction might indeed derive from very small eukaryotes (like the prasinophytes that appeared mostly in this small fraction, Table S4), but many sequences most likely derive from cell debris or extracellular DNA from larger cells. This is likely the case for radiolarians, dinoflagellates, and ciliates, groups known to contain relatively large nano-and microplanktonic cells, and for which sequences were prominent in the ,0.8 mm fraction and nearly absent from the 0.8-3 mm fraction. Among these groups, the radiolarians were the most intriguing, since these relatively large exoskeleton bearing protists typically represent a significant fraction of 18S rDNA sequences in diversity surveys of marine picoeukaryotes ( Figure 1A). These radiolarian sequences appear highly diverse [28], and most likely derived from larger organisms for which molecular data are not yet available, highlighting the extent of both the unknown diversity in this taxonomic group and filtration artifacts which affect environmental surveys of the smallest size fractions. As suggested in a recent study that investigated the eukaryotic diversity of the ,0.8 mm size fraction in a subset of the GOS dataset (i.e. Sargasso Sea samples) [22], future environmental surveys should target the 0.8-3 mm fraction, which may actually better represent the picoeukaryote diversity.
Several studies have proved the occurrence of extracellular DNA (particulate or dissolved) in water or sediments [29][30][31]. Among this DNA pool, a substantial portion contains high molecular weight molecules and is thought to be derived from eukaryotic organisms [29]. This extracellular DNA is prone to PCR amplification, and genes such as the one coding for the rbcL enzyme have been successfully amplified from particle-free water samples [15]. It is very likely that a fraction of the extracellular DNA is retained onto 0.2 mm filters, through collection of aggregates or molecular adsorption. Consequently, we believe that it is important to consider the interference of extracellular DNA when assessing the diversity of eukaryotic microbes in ecological perspectives.

The RNA approach gives complementary perspectives on marine protist diversity
Diversity assessed by means of libraries prepared after reverse transcription of extracted RNA led to a drastically different view of the community as compared to the classical DNA-based approach ( Figure 1C). It is generally recognised that 18S rDNA diversity surveys are not quantitative with respect to cell abundance [32,33]. Besides PCR biases, the contributions of specific taxa are related to the number of rDNA copies within cells of the taxa. Although rDNA copy number is usually assumed to be correlated with cell size [13,34], for a limited size range (e.g. picoeukaryotes) this number can vary significantly depending on phylogenetic affiliation and is also suspected to be influenced by life strategies of cells (e.g. parasitic, heterotrophic, autotrophic) [2,35]. The effect of taxon-specific rDNA copy number is avoided when analysing extracted RNA. Moreover, extracellular RNA is much less stable than DNA, minimising the problem of amplification from extracellular material. Ribosome content within a single cell is commonly viewed as a proxy of cellular activity status [23,36]. Therefore, 18S rRNA libraries are intentionally skewed to give insights on both diversity and taxon specific activity within protist assemblages [26]. As a flip side effect we might have expected an over representation of the most active taxa. However, both DNAbased and RNA-based libraries contained a high diversity, with comparable ratios of OTUs/number of sequences (Table 1). We found very little overlap in the sequences retrieved in the DNA and RNA libraries. At the 98% identity threshold, only 3 OTUs (ca. 4%) were detected in both libraries, which is rather low compared to the 27% observed in a similar study on anoxic waters [26]. This discrepancy might be explained by a lower sequencing effort done here but also by the selective nature of anoxic waters that might impose stronger constraints on the communities compared with open ocean conditions, implying a lower diversity and therefore a higher overlap between rDNA and rRNA libraries.
The diversity observed by both approaches is clearly not distributed within the same high level taxa, paralleling observations made on prokaryotes or on eukaryotes in an extreme environment [11,[24][25][26]. Some photosynthetic groups such as pelagophytes and picobiliphytes were not detected in the 18S rDNA based library, whereas they contributed notably to the 18S rRNA library ( Figure 1C). The relative contribution of other photosynthetic groups such as the prasinophytes and the haptophytes was also higher in the rRNA library. This might reflect a relatively higher metabolic activity in these photosynthetic taxa at the time of sampling, or may indicate that they have fewer rDNA copies (e.g. Pelagomonas, [13]), so they could be diluted in the environmental DNA surveys by cells with a higher rDNA copy number (e.g. alveolates). Among prasinophytes, cells belonging to the genus Micromonas were identified as being the most active (Table S3), confirming previous studies showing the significance of this genus in coastal ecosystems [37]. Regarding heterotrophic protists, sequences belonging to MAST-3, -4 and -7 appeared as prominent grazers (Table S3), which together with the widespread distribution of these taxa suggest they might actually be the major protistan predators in the oceans [8]. Finally, the most pronounced divergence between both libraries was the contribution of alveolates and radiolarians, which overwhelmingly dominated DNA-based diversity surveys [2]. This perhaps reflects the high 18S rDNA gene diversity and high copy number matching the parasitic life strategy of MALV [7,38] and further supports the putative presence of extracellular radiolarian 18S rDNA in seawater.

Conclusions
Size fractionation, metagenomics, and 18S rRNA libraries bring new perspectives for the understanding of marine picoeukaryotic diversity. In particular, rRNA libraries reduce significantly two of the major biases of rDNA diversity surveys, the rDNA copy number and the occurrence of extracellular DNA, but are in turn skewed towards the active part of the communities. Considering the relative ease of handling ribosomal RNA molecules, extended diversity surveys based on environmental rRNA will undoubtedly provide insights into the ecology of uncultured species. Associated with stronger depth of sequencing (e.g. 454 [39]), this approach will probably help to achieve a nearly exhaustive view of protist diversity and to better appreciate the contribution and function of specific organisms in the microbial food web.

Materials and Methods
Mining the GOS dataset using CAMERA The Global Ocean Survey (GOS) covered a variety of oceanic regions from Nova Scotia to South Africa across the Caribbean, the Panama Channel, the Pacific and the Indian Ocean [19] and data is accessible through the CAMERA database [40]. For the purpose of our analysis, and to compare waters of similar characteristics, only samples from offshore and coastal photic zones were used, whereas samples from environments such as hypersaline lagoons or mangroves were discarded. Seventy two sampling sites, representing a sequencing effort of 14000 Mb, were analyzed for the ,0.8 mm fraction, whereas only 8 sampling sites (850 Mb) were analyzed for the 0.8-3 mm fraction. This demonstrates the primary focus on prokaryotes of the GOS expedition. The fraction ,3 mm recorded in our analysis corresponds to the sum of data retrieved from the two size fractions. We searched for 18S rDNA genes using the eukaryotic specific primers EukA and EukB [41], 528f [42], 336f and 1209f [43] as in silico probes. Sequences were then assigned to specific taxonomic groups after the results of BLAST searches [44]. Chimeras were detected by doing BLAST with different regions of the sequence. Metazoans, marine euryarchaeote group II sequences (obtained with EukA primer), and short (,100 bp) sequences were discarded. We ended up with a total of 116 eukaryotic sequences from this metagenomic survey, with 69 and 47 sequences in ,0.8 mm and 0.8-3 mm size fractions, respectively.

Sampling procedures for the DNA vs RNA clone libraries
Seawater samples were harvested on November, 15 th 2007 in the Mediterranean Sea off Villefranche sur Mer (France). Water was collected with a 12L Niskin bottle deployed successively at 40, 60, 80, 100, 120, and 140 meter depths. After a pre-filtration through a 1000 mm mesh, equal volumes of water from each depth were mixed together in order to obtain an integration of the communities throughout the water column. Then water was gently sieved through 63 mm and 20 mm meshes and filtered through a 3 mm pore size 47 mm diameter polycarbonate filter. For DNA and RNA libraries, around 4 liters of the fraction below 3 mm were filtered onto 0.6 mm pore size 47 mm diameter polycarbonate filters at a rate of 90 ml min 21 . Finally the filters were flash frozen in liquid nitrogen and stored at 280uC until further analysis.

Nucleic acid extractions and clone library construction
For DNA extraction, the filter was cut in small pieces with a sterile razor blade and placed in a 2 ml microcentrifuge tube. Liquid nitrogen was added to the tube and the frozen sample material was disrupted manually with a disposable pellet pestle (Fisher Bioblock), repeating this step four more times. This disruption procedure was followed by DNA extraction with a DNeasy Plant Mini kit (Qiagen) following the manufacturer's recommendations. DNA extracts were stored at 280uC until analysis. For RNA extraction, filters were immersed in RLT buffer (from a Quiagen RNeasy kit) mixed with an equal amount of 0.1 and 0.5 mm glass beads and subsequently vortexed. Then the RNeasy kit instructions for Plants and Fungi were followed.
Quantification of extracted nucleic acids was performed with the Qubit Quantitation platform (Invitrogen). Prior to reverse transcription, a DNase digestion step was performed with DNaseI (Roche Diagnostic) and efficient digestion was controlled by gel electrophoresis. Reverse transcription was performed on pure RNA using the SuperScript II kit (Invitrogen) according to the manufacturer's instructions. The eukaryotic 18S specific EUKB primer [41] was used for the reverse transcription.
Both 18S rDNA genes and 18S ribosomal cDNA were PCR amplified using the same set of primers, 528f [42] and EUKB [41]. Approximately 10 ng of DNA were used as a template in a 50 ml PCR mixture containing 200 mM of each dNTP, 1.5 mM MgCl 2 , 0.5 mM of each primer and 1.25 units of Taq DNA polymerase (Promega) with the PCR buffer supplied with the enzyme. Reactions were carried out in a thermocycler with the following cycle: an initial denaturing step at 94uC for 3 min, 35 cycles of denaturing at 94uC for 45 s, annealing at 55uC for 1 min and extension at 72uC for 3 min, and a final extension step at 72uC for 10 min. In order to check the quality of the RNA extraction, we used the RNA extract digested by DNase as a PCR template. Negative results confirmed the lack of remnant DNA after digestion which could have interfered with the results obtained for the cDNA libraries. PCR products were used for clone library construction. In both cases, three separate fresh PCR products (50 ml) were pooled and cleaned with the Qiagen PCR Purification kit and cloned using the TOPO-TA cloning kit (Invitrogen). Putative positive clones were checked by PCR amplification using the same primer set. PCR reactions showing the right insert size were purified and sequenced with the 528f primer on an ABI Prism 3100 sequencer (Applied Biosystems) at the Station Biologique de Roscoff sequencing facility.
Taxonomic affiliation of the 18S rDNA sequences obtained in this study (between 800 and 950 bp length) and putative chimeras were identified by using BLAST as explained before (data shown in Tables S2 and S3). Among the 113 cDNA clones sequenced 2 were chimeras leaving 111 sequences for further analysis. Sixty seven rDNA clones were sequenced, 2 chimeras were identified, and 3 metazoan sequences (Appendicularia and copepods) were discarded, leaving 62 sequences for further analysis. Operational Taxonomic Units (OTU) at 99% identity threshold were identified and compared among libraries using the DOTUR and SONS programs [45,46]. Statistical comparisons of the two libraries were performed with the webLIBSHUFF tool [47]. Sequences have been deposited in GenBank under accession numbers GQ344621 to GQ344796.

Statistical analysis
Considering the small number of sequences retrieved from our analysis, we wanted to make sure that comparisons between datasets were meaningful. Using R software we calculated the expected distribution of sequences from small size samples compared to a larger reference dataset. The random sub-sampling procedure of 62 and 47 sequences was replicated 1000 times from the Massana and Pedrós-Alió (2008) dataset (2175 sequences) and the GOS dataset (116 sequences), respectively. Standard deviations were calculated for each taxonomic group considered and comparisons between observed and expected datasets were plotted ( Figure S1).
Correlations were performed with the statistical package JMP 5.0.1a to evaluate the degree of divergence between paired datasets and estimate the impact of PCR approaches ( Figure 1A), size fractionation ( Figure 1B), and 18S rDNA versus 18S rRNA clones libraries (Figure 1C), on environmental diversity surveys.