Probing the fabric of the rare biosphere

The relatively recent development of high-throughput sequencing (HTS) techniques has revealed a wealth of novel sequences found in very low abundance: the rare biosphere. Today, most studies of diversity of microorganisms are carried out almost exclusively with HTS techniques. However, culturing seems indispensable for diversity studies especially if the aim is exploring the rare biosphere. We have carried out a deep (1 million sequences per sample) pyrosequencing analysis of two marine bacterial samples and isolated a culture collection from one of them. We have shown that the collectors curves of the pyrosequencing data were close to reaching an asymptote. We have estimated the total richness of the samples and the sequencing effort necessary to obtain the total estimated richness experimentally. Comparing the pyrosequencing data and the isolates sequences we have found that isolation retrieves some of the rarest taxa and that the composition of rare taxa follows an annual succession. We have shown that increasing the number of tags sequenced would slowly uncover the isolated bacteria. However, even if the whole bacterial diversity was found by increasing the sampling depth, culturing would still be essential for the study of marine bacterial communities, especially if the target is the rare biosphere.

Abstract: The relatively recent development of high-throughput sequencing (HTS) 26 techniques has revealed a wealth of novel sequences found in very low abundance: the rare 27 biosphere. Today, most studies of diversity of microorganisms are carried out almost 28 exclusively with HTS techniques. However, culturing seems indispensable for diversity 29 studies especially if the aim is exploring the rare biosphere. We have carried out a deep (1 30 million sequences per sample) pyrosequencing analysis of two marine bacterial samples and 31 isolated a culture collection from one of them. We have shown that the collectors curves of 32 the pyrosequencing data were close to reaching an asymptote. We have estimated the total 33 richness of the samples and the sequencing effort necessary to obtain the total estimated 34 richness experimentally. Comparing the pyrosequencing data and the isolates sequences we 35 have found that isolation retrieves some of the rarest taxa and that the composition of rare 36 taxa follows an annual succession. We have shown that increasing the number of tags 37 sequenced would slowly uncover the isolated bacteria. However, even if the whole bacterial 38 diversity was found by increasing the sampling depth, culturing would still be essential for 39

Introduction 42
The question of how many species of living beings are there on Earth has intrigued 43 ecologists and evolutionary scientists for decades (May, 1988;Erwin, 1991). One of the most 44 recent estimates considered a number of 8.7 million species, but excluded bacteria and 45 archaea from the estimate due to our ignorance of the microorganisms (Mora et al., 2011). 46 The International Census of Marine Microbes attempted to map the diversity of microbes in 47 the oceans with novel high throughput sequencing techniques (Amaral-Zettler et al., 2010) 48 but a global estimate was not attempted. Some estimates for marine bacterial species range 49 from 10 4 to 10 6 based on different assumptions (Curtis et al., 2002;Hagström et al., 2002). 50 Such a range of values, spanning several orders of magnitude, shows that we are far from a 51 reasonable estimate. 52 53 Traditionally, bacteria were isolated in pure culture and then characterized biochemically 54 and genetically until a new species could be formally described. It was realized that the 55 bacteria able to grow in culture media were a small fraction of the bacterial cells that could be 56 directly counted on a filter, a discrepancy named the "great plate count anomaly" (Staley & 57 Konopka, 1985). Different studies estimated that only about 1% of the cells in natural waters 58 window of 50 bp and a minimum average quality score of 28. Additionally, denoising was 138 used to reduce the amount of erroneous sequences (Quince et al., 2011). The final number of 139 tags was reduced after this processing to 500 262 for the surface sample and to 574 960 for 140 the bottom sample (Table 1). The sequences were then clustered into Operational Taxonomic 141 Units (OTUs) based on the relatedness of the sequences (97% similarity) with UCLUST. 142 Afterwards, a representative sequence from each OTU was selected. To identify potential 143 chimera sequences, the dataset was subjected to the ChimeraSlayer implemented in Mothur 144 (Schloss et al., 2011). Then, taxonomy assignment was made with QIIME by searching the 145

Isolation of bacterial cultures 162
Isolates were obtained on board by plating 100 µl of undiluted and 10x diluted sea-water 163 from the surface sample, in triplicates, onto modified Zobell agar plates (i.e. 5 g peptone, 1 g 164 yeast extract and 15 g agar in 1 l of 0.2 µm filtered 75% sea water). Agar plates were 165 incubated at in situ temperature (~20 °C), in the dark, for 14 days. 326 bacterial colonies 166 were selected and the cultures were subsequently purified by re-isolation three times in a 167 month. Next, isolates were grown at 20 ºC on the same liquid medium and stored at -80 ºC 168 with 25% (v/v) glycerol. 200 µl of these cultures were place in 96 well plates, diluted 1:4 and 169 heated (95ºC, 10 min) to cause cell lysis, so available DNA could be used as a template in 170 Polimerase Chain Reactions (PCR). PCR, using Taq polymerase (Boehringer-Mannheim), of 171 the Internal Transcribed Spacer (ITS) were done to select as many different species as 172 possible from the 326 isolates. ITS length is species specific and therefore allows to 173 differentiate the isolates (Fisher & Triplett 1999;Scheinert et al. 1996). ITS amplification 174 was done using primers ITS-F (5'-GTCGTAACAAGGTAGCCGTA) and ITS-R 175 (5'-GCCAAGGCATCCACC) and the following thermal conditions: 94ºC for 2 min, then 32 176 cycles of 94ºC for 15 sec, 55ºC for 30 sec, 72ºC for 3 min, followed by one cycle of 72ºC for 177 4 min and 4ºC on hold. According to their different ITS patterns, 148 isolates were chosen 178 out of 326, including some replicates, and their 16S rRNA gene were then amplified using 179 bacterial 16S rRNA gene primers 27F (5'-AGAGTTTGATCMTGGCTCAG) and 1492R (5'-180 GTTTACCTTGTTACGACTT). The thermal conditions were as follows: 94ºC for 5 min, 181 then 30 cycles of 94ºC for 1 min, 55ºC for 1 min, 72ºC for 2 min, followed by one cycle of 182 72ºC for 10 min and 4ºC on hold. Nearly the full-length 16S rRNA gene (aprox. 1 300 bp) 183 was sequenced in GENOSCREEN (Lille Cedex, France). Taxonomical assignment was done 184 by BLAST searches in the National Center for Biotechnology Information (NCBI) website. 185 The 16S rRNA sequences have been deposited in EMBL with accession numbers LN845965 186 to LN846112. 187

188
In addition to the SUMMER culture collection obtained for the present study, we also 189 used three previously isolated collections for comparison. The isolation procedures were the 190 same as above, but the samples were collected at the Blanes Bay Microbial Observatory 191 (BBMO). Station D sampled in the present work is about 100 km offshore from the BBMO. Estimates of total richness were calculated using two packages of the free software R (R Core 201 Team, 2013). The "vegan" package (Oksanen et al., 2013) was used for non-parametric 202 (S Chao , Abundance-based Coverage Estimator [ACE]) estimations. The "drc" package (Ritz & 203 Streibig, 2005) was used to estimate richness by fitting the species accumulation curves to a 204 mathematical function. We fitted several mathematical functions to our data (Flather, 1996). 205 The best fits were given by Michaelis-Menten, Rational, and Weibull Cumulative functions. 206 The three parameter Weibull Cumulative function was the best of all according to the 207 coefficient of determination (R 2 ), the residual sum of squares (RSS), and Akaike's 208 information criterion (AIC) (Supplementary Table 1 where S is the total number of OTUs (richness). Diversity and evenness of each sample were 227 calculated using the "vegan" package (Oksanen et al., 2013). 228 Rank-abundance plots of the isolated cultures and the 454 pyrosequencing data were done 229 using the "BiodiversityR" package (Kindt & Coe, 2005) and the accumulation curves using

Comparison of 454-pyrosequencing tags and isolates 233
Comparison between isolates and 454 tag-sequences was done running BLASTn locally. 234 Thus, the isolate sequences were searched for in the 454 tag-sequence datasets and vice 235 versa, and only the reciprocal matches between these two searches were considered. The 236 output was filtered using R (R Core Team, 2013) according to a 99% of identical nucleotide 237 matches, 75%-100% of coverage of the isolate sequence and a bit-score higher than 100. In 238 all the cases the e-value was lower than 0.0001.

Pyrosequencing dataset 257
Richness (S), computed as the total number of OTUs, was higher in the bottom (4 460) 258 than in the surface (1 400) sample (Table 1). In both samples only ~17% of the OTUs were 259 singletons (an OTU represented by a single sequence) (Table 1). Evenness (J') and diversity 260 (H' and D) were also higher in the bottom than in the surface sample (Table 1) Table 1). None of the estimates were significantly different from each 264 other (z-tests on the differences, all P > 0.05). The mathematical function that best fitted the 265 species accumulation curves was the Weibull Cumulative function (99% model efficiency or 266 R 2 =0.99) (Figure 1). Fitting the species accumulation curve allowed both calculation of the 267 total richness of the samples by extrapolation (as explained above), and estimation of the 268 sequencing effort necessary to obtain the total estimated richness experimentally (Table 1). 269 Our estimates indicated that the sequencing effort necessary to retrieve 99% and 99.9% of the 270 total estimated richness should be 4 to 8 times higher that the one applied in this study (Table  271 1). These numbers are approximately twice higher that the estimations obtained from the 272 Erythrobacter citreus, isolated 37 times, while 17 species were isolated only once. A rank 285 abundance plot of the 38 species can be seen in Fig. 3. The isolates belonged to the phyla 286 Actinobacteria (4 isolates), Bacteroidetes (4 isolates) and Firmicutes (2 isolates) and to the 287 Proteobacteria classes Alpha-proteobacteria (18 isolates) and Gamma-proteobacteria (10 288 isolates). The names of all the isolates are shown in Table 2 and supplementary Table 2. 289 290

Comparison of isolates and sequences 291
Only 14 (37%) of the 38 different species were found in the 454 tag-sequence datasets: 292 one Actinobacteria, two Bacteroidetes, two Firmicutes, four Alpha-proteobacteria and five 293 Gamma-proteobacteria isolates ( Figure 3, Table 2). Surprisingly, the number of cultures 294 found in the 454 tag-sequence dataset was higher in the sample collected at 2 000 m (37%) 295 than in the surface sample (24%), even though the latter was the sample used for isolation of 296 the bacterial cultures ( Figure 3, Table 2). Nine species were found in the sequences from both 297 samples (maroon in Figure 3), 5 were found in the bottom sample only (green in Figure 3) 298 and 24 were not found in either sample (empty symbols in Figure 3). Practically all the 454 299 tag-sequences that matched the sequences from the isolates belonged to rare OTUs (<1% of 300 the total tags). Only the OTU matching the isolate Alteromonas macleodii str. 'Balearic Sea 301 AD45' (Gamma-Proteobacteria) was somewhat abundant (1.3%) in the bottom sample 302 (Table 2). Further, all the matching sequences made a larger percentage of the assemblage at 303 the bottom sample than at the surface sample. 304

305
In order to examine the rate of appearance of cultured species in the sequence database, 306 we calculated a kind of "rarefaction curves", by randomly resampling either the OTUs or the 307 454-pyrosequecing tags in 5% increases, and checking how many cultured species had been 308 found in each case (Figure 4). These curves showed that the rate of appearance of the isolates 309 in the 454 tags datasets as the sequencing depth increased was slower for tags than for OTUs 310 but in neither case was an asymptote reached ( Figure 4). The rate of increase of retrieved 311 isolates versus OTUs was constant, but the rates were different at the two depths: one new 312 isolate was retrieved every 50 OTUs in the surface sample and every 117 OTUs in the bottom 313 sample ( Figures 4A-B). The increase in retrieved isolates with respect to the number of tags 314 showed a nonlinear response, implying that the sampling effort had to increase as the number 315 of isolates increased (Figures 4C-D). 316 317

Comparisons across different times and locations 318
Three additional culture collections, isolated in the same way as the collection described 319 above (see material and methods), were available for comparisons with 454-pyrosequencing 320 data (A in Table 3). In addition to the 454-pyrosequecing dataset obtained in the present 321 study, 454 sequences from the MODIVUS cruise in September 2007 were also available (see 322 material and methods). These different datasets offered the possibility to compare how 323 effective was the retrieval of isolated species in 454 sequence datasets at different time and 324 space scales (B and C in Table 3). The number of cultures from each collection (columns 2 325 and 4 in Table 3) that could be compared to the two 454 datasets (B and C in Table 3) was 326 different because the 16S rRNA fragments sequenced were different (see material and 327 methods). 328 329 Column 3 (Table 3) shows that the percent of cultures retrieved was much higher from the 330 simultaneously taken culture collection (37%) than form those taken years before (~ 10%). 331 Likewise, column 5 (Table 3) Table 3) or if only the sequences from BBMO are used for the 341 comparison (column 6 in Table 3). For the three BBMO collections the percent increased 342 when a larger area was considered. The same can be observed in columns 5, 7 and 8 ( Table  343 3) for the SUMMER culture collection: when only the sequences from the open sea station 344 were included, the percentages recovered were 24 (for the surface sequences) and 34 (for the 345 bottom sequences). When the whole MODIVUS dataset was included, 74% of the cultures 346 were retrieved. Surprisingly, this percentage is larger than that of the SUMMER 454 dataset. 347 This occurred despite the fact that the SUMMER data set had many more sequences 348 (1.7x10 6 ) than the MODIVUS dataset (3x10 5 ). 349 350 Discussion 351

Estimates of richness. 352
A large proportion of the microbial world is invisible to traditional cultivation approaches 353 but it can be accessed using molecular tools (DeLong, 1997). The first cloning and Sanger 354 sequencing approaches revealed a wealth of new taxa in the oceans (Giovannoni et al., 1990). 355 The number of clones examined however, was still too limited and there was a discrepancy an extraordinary depth (on the order of 10 7 Illumina tags) and we also found a good fit 396 (R 2 =0.99). The fit was better when singletons were included in the analysis (the AIC value 397 was lower, supplementary Fig. 1). Therefore, the Weibull cumulative function will be a very 398 useful tool to characterize the species accumulation curves and to obtain robust richness 399 estimates, at least for well-sampled bacterial communities. 400

401
The high abundance of the most abundant OTU in the surface sample (Figure 2) may have 402 caused less OTUs to be uncovered, forcing the richness to appear lower at the surface sample 403 than at the bottom sample. However, higher richness as well as higher diversity at the bottom 404 than at the surface have been reported before for the sampling area (Pommier et al., 2010). 405 Also, the high number of sequences examined and the richness estimations (Table 1)  10% were "unclassifiable (reads too short)" (see their Figure 1C). Thus, only about 40% of 446 the OTUs could be classified. Again, this very low figure is intriguing. The short read length 447 cannot be the reason, because in our earlier studies, we used exactly the same methodology 448 and read length as for the seasonal samples from the English Channel or the world ocean 449 samples from ICoMM and yet, 98.9% of our OTUs could be assigned a taxonomy at least at 450 the Phylum level and most of the time at the Class level. Obviously, different criteria for 451 assignment of identity can be responsible for this, but the 10% of sequences labeled as 452 "unclassifiable (reads too short)" should probably have been discarded. 453

454
We used the data from their deep sequenced L4 sample to try the Weibull fit and found a 455 good fit, both with and without the singletons. In both cases the number of OTUs was one 456 order of magnitude higher than our estimates (40 000 and 100 000 OTUs without and with 457 singletons respectively). It is difficult to integrate these very different estimates of richness. approaches we have found that isolation retrieves some of the rarest taxa since only 24 to 483 37% of the isolates were found in the 454-pyrosequencing data and, moreover, they were 484 found in extremely low abundance (Figure 3, Table 2). 485

486
In principle, the low percentage of coincidence might have been due to potential PCR bias 487 and differential DNA amplification of the sequencing techniques (Berry et al., 2011;Pinto & 488 Raskin, 2012). However, when tested in silico, the primers used for pyrosequencing covered 489 the whole diversity captured by the primers used for Sanger sequencing of the isolates and 490 therefore, a PCR bias affecting the diversity found using both methods was unlikely.  (Table 1) should harbor the isolates not found with the sequencing depth applied in this 503 study. Increasing the number of tags sequenced would slowly uncover the isolated bacteria in 504 a logarithmic way (Figure 4). Even if the whole bacterial diversity were found by increasing 505 the sampling depth, culturing would still be essential for the study of marine bacterial 506 communities, especially if the target is the rare biosphere (Donachie et al., 2007). 507 508 Taking advantage of other datasets previously collected from the same area of our 509 sampling, we were able to compare 454-pyrosequecing data and cultured isolates from 510 different times of the year and stations along a horizontal transect and a vertical profile 511 (Table 3). The percentage of cultured isolates sequences found in the 454-pyrosequecing 512 datasets decreased as the time between the collection of samples for isolation and 513 pyrosequencing increased; i.e., cultures isolated from the same sample as the pyrosequencing 514 data were found more easily in the pyrosequencing dataset than the cultures collected further 515 away in time or space (Table 3). The consistency of this pattern within the two datasets, Similarly, Lekunberri et al. (2014) found that sequences from cultures obtained from one 520 season of the year tended to cluster together with environmental sequences obtained at the 521 same time of the year. These two findings suggest that the composition of rare taxa also 522 follows an annual succession, supporting the conclusion that the environment determines the 523 marine bacterial composition, not only of the abundant taxa but also of the rare members of In conclusion, by obtaining 10 6 tags per sample and fitting a Weibull cumulative function 530 we have been able to obtain a robust estimate of the richness of the bacterial assemblages in 531 two samples at the surface and deep Mediterranean Sea. This richness is of the order of 3 000 532 to 5 000 taxa. The comparison with cultures shows that many of the isolates are found deep 533 within the rare biosphere, and we have determined the sequencing effort necessary to retrieve 534 them. HTS and culturing appear as two complementary strategies to probe the fabric of the 535 rare biosphere. 536 537

Acknowledgements 538
We thank the crews and scientists in cruises Modivus and SUMMER, both on the RV 539 García del Cid, supported by the Spanish MICINN grants CTM2005-04795/MAR and 540 CTM2008-03309/MAR respectively. 541 We thank F. M. Cornejo-Castillo for his advice on the method for isolates' differentiation. 542 E. Sa and V. Balagué help with bacterial culturing and PCR work is highly appreciated. 543 B. G. C. was supported by a Juan de la Cierva contract from the Spanish "Ministerio de 544 Ciencia e Innovación". Research was funded by the Spanish "Plan Nacional de Investigación 545 pyrosequencing dataset, and the white circles indicate the cultures that were not found in any 707 of the 454-pyrosequencing datasets. A list of the isolated bacterial species can be found in 708 Table 2 and supplementary Table 2. 709 Table 1. Summary of location and depth (m) of samples, total sequences before (Raw Tags)  Cumulative three parameter function (S WeibullFit ) which model efficiency was 99% for the 734 surface and the bottom samples fitting (see Fig. 1