Detection of Large Numbers of Novel Sequences in the Metatranscriptomes of Complex Marine Microbial Communities

Background Sequencing the expressed genetic information of an ecosystem (metatranscriptome) can provide information about the response of organisms to varying environmental conditions. Until recently, metatranscriptomics has been limited to microarray technology and random cloning methodologies. The application of high-throughput sequencing technology is now enabling access to both known and previously unknown transcripts in natural communities. Methodology/Principal Findings We present a study of a complex marine metatranscriptome obtained from random whole-community mRNA using the GS-FLX Pyrosequencing technology. Eight samples, four DNA and four mRNA, were processed from two time points in a controlled coastal ocean mesocosm study (Bergen, Norway) involving an induced phytoplankton bloom producing a total of 323,161,989 base pairs. Our study confirms the finding of the first published metatranscriptomic studies of marine and soil environments that metatranscriptomics targets highly expressed sequences which are frequently novel. Our alternative methodology increases the range of experimental options available for conducting such studies and is characterized by an exceptional enrichment of mRNA (99.92%) versus ribosomal RNA. Analysis of corresponding metagenomes confirms much higher levels of assembly in the metatranscriptomic samples and a far higher yield of large gene families with >100 members, ∼91% of which were novel. Conclusions/Significance This study provides further evidence that metatranscriptomic studies of natural microbial communities are not only feasible, but when paired with metagenomic data sets, offer an unprecedented opportunity to explore both structure and function of microbial communities – if we can overcome the challenges of elucidating the functions of so many never-seen-before gene families.


Introduction
DNA sequence based metagenomics has become a standard tool for the analysis of natural microbial communities in marine environments [1,2,3]. It involves the sequencing of random community DNA from environmental samples and subsequent determination of taxonomic and protein-encoding gene diversity. However, questions of how natural bacterial assemblages respond to perturbations in environmental conditions, are better answered by analysis of community mRNA than genomic DNA [4].
Historically, metatranscriptomic studies have involved either the use of microarrays [5] or mRNA-derived cDNA clone libraries [6]. These approaches have produced significant insight into the metatranscriptome of different communities but have limitations when exploring the diversity of natural communities. Firstly, a microarray only gives information about those sequences for which it was designed and it is usual to screen for gene sequences that are already known (e.g. from a gene-library or metagenomic sources). Secondly, although transcript cloning avoids this problem through the random amplification and sequestering of environmental mRNA fragments, it introduces other biases; e.g. any cloned transcripts that encode toxic products or titrates host DNAbinding factors will skew the relative abundance of sequences.
More recently, the first metatranscriptomic studies using highthroughput sequencing technology (pyrosequencing) have been published [7,8,9]. Two studies of soil communities have sequenced total RNA for the purpose of exploring both community structure, through the analysis of ribosomal RNA (rRNA), and community function, through the study of mRNA [7,8]. The first study of a marine microbial community metatranscriptome focused on mRNA analysis and achieved an enrichment of ,50% mRNA [9].
Ideally, if the study of mRNA is the prime purpose of a metatranscriptomic study, further enrichment is desirable. Here we present a study of a complex marine microbial metatranscriptome enriched to 99.92% mRNA [10]. Metatranscriptomes were generated from 4 samples taken at two time points in a replicated mesocosm (11,000 liters) study involving an induced phytoplankton bloom [10]. Pyrosequencing technology (GS-FLX pyrosequencer) was used to generate four metatranscriptomes and four corresponding metagenomes with an average sequence length of 215 bp from the middle and end time point in the phytoplankton blooms. This experiment provided an opportunity to obtain replicate samples to explore the use of this approach for detecting changes in the expression of genes over time. The primary focus of the mesocosm experiment was to study the response of marine microbes to the increase in ocean acidification that is resulting from dissolution of anthropogenic CO 2 [10] and these results will be described in detail elsewhere. The immediate purpose of this study was to 1) demonstrate the feasibility of obtaining highly enriched samples of mRNA (.90%) from these communities, 2) determine whether differences in expression could be identified between time points of this controlled experiment [10], and finally 3) to determine what proportion of the most highly expressed genes using such a methodology might be novel.

Sampling, cDNA synthesis and sequencing
Water samples were obtained from a replicated mesocosm study (two treatments, each in triplicate) established in coastal waters of a fjord close to Bergen, Norway (60.27uN: 5.22uE). Each mesocosm contained 11,000 L of coastal water and two of the six mesocosms were sampled for this study. To induce the phytoplankton bloom, nitrate and phosphate were added. Water samples were taken at the peak and immediately following the collapse of the phytoplankton bloom from both a high CO 2 and control mesocosm.
The nucleic acid extraction methodologies are briefly outlined in Gilbert et al. [10] but are fully described here. To isolate DNA and RNA, 15 L of water from each sample was filtered through a 140 mm diameter, 1.6 mm GF/A filter (Whatman), to reduce eukaryotic cell abundance and maximize the proportion of prokaryotic cells. This filtration took only 3 minutes and the filtrate was applied directly to a 0.22 mm Sterivex filter (Millipore) to allow rapid filtration of samples (,15 minutes per sample) to limited mRNA degradation. Following filtration, each Sterivex was pumped dry, frozen in liquid nitrogen and stored at 280uC until extraction. Total nucleic acid extraction was performed on each Sterivex using the method of Neufeld et al. [12]. Throughout the protocol nuclease-free plastic consumables and DEPC-treated water and reagents were used to limited degradation of mRNA. Following extraction, total nucleic acids were eluted in 200 ml of nuclease-free water.
For metagenomic analysis, 100 ml of the total nucleic acid extraction was purified for DNA by treatment with RiboShredder TM RNase (Epicenter) following manufacturer's instructions. Purified metagenomic DNA was quantified by nano-litre spectrophotometry, diluted with nuclease-free water to 500 ng ml 21 and then stored at 280uC until pyrosequencing.
For metatranscriptomic analysis, 100 ml of the total RNA was purified using the RNA MinElute TM clean-up kit (Qiagen); bmercaptoethanol was added to the RLT buffer. Approximate RNA concentration was determined by nano-litre spectrophotometry and checked for rRNA integrity using an Agilent bioanalyser (RNA nano6000 chip). Average RNA concentration was 2.4 mg ml 21 . The integrity of rRNA was demonstrated by highly defined, discrete rRNA peaks, with the 23S rRNA peak being 1.5-2 times higher than the 16S rRNA peak. Fully intact rRNA is essential for subtractive hybridization because degraded rRNA molecules will not be fully subtracted from the total RNA pool.
DNA contamination was removed from total RNA samples by treating with the Turbo DNA-free enzyme (Ambion). 75 mg of purified total RNA was applied to the subtractive hybridization method (Microbe Express Kit, Ambion) to remove rRNA from the mRNA. Purified mRNA was eluted in 25 ml of TE buffer (10 mM Tris-HCl pH 8.0, 1 mM EDTA) and was further purified with the MEGAclear TM kit (Ambion) to remove small RNAs and small contaminants. Purified mRNA was eluted in 10 ml of nuclease free water and stored at 280uC until further analysis. 0.5 ml of the purified mRNA was then checked using the Agilent bioanalyser for removal of genomic DNA and ribosomal RNAs. The mRNA concentration was estimated using the Agilent bioanalyser software to average 450 ng ml 21 .
mRNA was estimated to be approximately 8% of total RNA isolated. 9.5 ml of the purified mRNA was then applied to a reverse transcription reaction using the SuperScriptH III enzyme (Invitrogen) with random hexamer primers (Promega). The cDNA was treated with RiboShredder TM RNase Blend (Epicentre) to remove trace RNA contaminants. To improve the yield of cDNA, 1 ml of each sample was subjected to random amplification using the GenomiPHI TM V2 method (GE Healthcare) yielding approximately 4 mg of cDNA. GenomiPHI technology produces branched DNA molecules that are recalcitrant to the pyrosequencing methodology. Therefore amplified samples were treated with S1 nuclease using the method of Zhang et al. [13]. DNA and cDNA were nebulized to produce an average size of 500 bp, then cleaned with AMPure beads (Agencourt) and sequenced using the 454 Corporation's GS-FLX instrument at the NERC-funded Advanced Genomics Facility at the University of Liverpool (http://www.liv.ac.uk/agf/). Extraneous sequences resulting from .1 template molecule per picotitre well were removed from the datasets ( Table 1) as they include exact duplicates and failed sequences that are replete with uncharacterized nucleotides. Metatranscriptomic and metagenomic data sets were deposited in NCBIs Gene Expression Omnibus (GEO, http://www.ncbi. nlm.nih.gov/geo/) and are accessible through GEO Series accession number GSE10119. All data is also deposited with the Short Reads Archive (NCBI) under accession number SRA000266. These datasets are also available with richer annotates in ISATAB format [14] compliant with the ''Minimum Information about a (Meta) Genome Sequence'' (MIGS) specification [15].
Clustering of DNA and mRNA and prediction of partial ORFs (pORFs) Clustering analysis was performed on the raw reads and translated peptide sequences (see below) using the CD-HIT package [16]. The reads from all eight samples were clustered together with CD-HIT-EST program. Sequences were clustered if the identity was $95% (++ or +2 strand) and the length of the alignment was $40 bp and $80% length of the shorter sequence. The clustering results show the internal structure of the combined dataset including number of non-redundant sequences, distribution of clustering, number of singletons, etc. The same analysis was applied to each individual sample by counting only the sequences from that sample. Results are shown in rows 5-11 in Table 1.  Size, clustering and annotation data were generated by CAMERA. rRNA and subsystem hits were generated by SEED. ORFs (including pORFs) were then predicted. Since genes cannot be reliably predicted from such short reads, we applied the methodology used in the Global Ocean Survey (GOS) study [2,3], calling pORFs from all six reading frames. As the current study had overall shorter reads than the GOS study (average of 215 bp instead of 822 bp) pORFs had to contain at least 30 amino acids. In total 3,026,200 pORFs were detected. The approach of six reading frame translation can result in many non coding (shadow) pORFs, or spurious pORFs. However, this is less likely with short sequence data because the translations from non-coding frames are usually too short (due to random occurrence of stop codons) to rank as pORFs using our selected cut-off threshold.
The protein gene coding density, according to the most recent NCBI RefSeq database for microbial organisms (ftp://ftp.ncbi.nih. gov/refseq/release/release-statistics/RefSeq-release27.01062008. stats.txt), is about 0.25 million amino acid per 1 million base pairs (bp). This study, which obtained 162 million amino acids from 323 million bp of sequence, shows only 50% of these pORFs were spurious. Clustering of pORFs can further help to exclude spurious pORFs which are more likely to remain singletons.
The pORFs were clustered with two-step CD-HIT runs. At the first step, pORFs were clustered at 95% identity over 80% of sequence length in order to identify non-redundant sequences. The non-redundant sequences were further clustered at 60% identity, over 80% of sequence coverage, to find clusters of homologous pORFs or protein families (see row d, e, f in Table 1).
Here, we only use the non-redundant sequences to count the size of each cluster so that the large clusters reported in row g in Table 1 contain diverse sequences. The same clustering techniques were also applied to the data from the metatranscriptomic study of Frias-Lopez and colleagues [9] (Table S1).

Dividing pORFs into 'predicted, 'spurious' and 'putative'
The clusters of pORFs were annotated by comparison to the PFAM database (http://PFAM.sanger.ac.uk/) by Hmmer, TIGRfam database (http://www.tigr.org/TIGRFAMs/) by Hmmer and the COG database (http://www.ncbi.nlm.nih.gov/COG/) by RPS-BLAST (reversed PSI-BLAST). All analyses were annotated with an expect value cut-off of 0.001. Hmmer analysis was performed in fragmental mode, and each hit also had to pass the TC score. The pORFs with significant matches to these reference databases were confirmed as genes, while the pORFs that overlapped with them from a different reading frame (the shadow pORFs) were deemed spurious pORFs. From this final analysis of the 3,026,200 pORFs, 494,253 could be confirmed as predicted proteins, 459,150 excluded as spurious pORFs, and the remainder (2,072,797) marked as ''putative proteins''. The combined predicted and putative proteins were used for subsequent analysis.

PCR detection of dominant orphan gene clusters in environmental DNA and mRNA
To validate the presence of highly expressed orphaned sequence clusters in the environment we randomly selected 27 of the most highly expressed nucleotide clusters. It was necessary to establish the presence of these sequences in both original DNA samples and cDNA samples to show they were not artefacts of cDNA amplification by GenomiPHI. To further cluster the 609 dominant nucleotide clusters (.100 sequences per cluster) for the purpose of designing PCR primers, all clusters were re-clustered at 95% identity over at least 40 base pairs. This allowed sequences with small 59 or 39 overlaps to be clustered together and increased the probability that the sequences assayed represented different transcripts. This reduced the number of dominant clusters from 609 to 85 (Table S2) and resulted in a significant increase in the number of clusters with more than 5,000 reads each. The maximum number of sequences in the largest cluster was 31,642 and 15 clusters now contained more than 5,000 reads. This provided a smaller pool of sequences for analysis and reduced the likelihood of amplifying similar sequences.
Primers were designed to screen 27 of these potential transcripts using the batch Primer3 online interface (http://probes.pw.usda. gov/cgi-bin/batchprimer3/batchprimer3.cgi), with the following conservative rules. First, we targeted the 'representative' sequence of each cluster (as opposed to the consensus sequence) to maximize the length of the query DNA sequence and avoid use of chimeric sequence that could have resulting from false assembly of the original 609 clusters. Second, we iteratively explored a range of parameters to find a rule that allows us to automatically create primers (no manual inspection required) for all 85 loci using a single set of optimality criteria that were as stringent as possible. In the end, we took the default parameters of the interface and optimized the following parameters: annealing temp (55uC), overall length of product (100 bp), primer size (20 bp), G+C content (50%) and minimum ''maximum self-complementary''. Exact optimality criteria used for the selection of each batch of primers is available from the authors.
Individual transcript sequences were amplified by PCR from the environmental DNA used for the metagenomic analyses (coextracted with the mRNA used for the metatranscriptomic approach), purified mRNA prior to RT-PCR and cDNA prior to GenomiPHI amplification. Each of the 54 primers (http:// nebc.nox.ac.uk/nebcfs/public/Joint/metatranscript_primers.xls) were diluted to a working concentration of 10 pmol ml 21 . Approximately 10 ng of environmental DNA, cDNA or mRNA was added to a 25 ml PCR reaction with final concentrations of 16PCR buffer (Promega), 2.5 mM MgCl 2 , 0.2 mM deoxynucleoside triphosphates (Invitrogen), 0.4 pmol of each primer, and 1 unit of Taq DNA polymerase (Promega). Negative controls used were Escherichia coli K12 genomic DNA and sterile water. Reactions were cycled with a PTC 1000 thermal cycler (MJ Research) using the following conditions; 94uC for 2 minutes, 30 cycles of 94uC for 1 minute, 55uC for 1 minute, 72uC for 2 minutes, and a final extension of 72uC for 10 minutes. Products were visualised by agarose gel electrophoresis (1.8%).

Results and Discussion
We demonstrate the feasibility of conducting metatranscriptomic studies on RNA samples highly enriched for mRNA from natural microbial communities. This is the first time such a high level of enrichment has been achieved in a metatranscriptomic study ( Table 2). Eight samples, four DNA and four mRNA, were processed producing a total of 323,161,989 bp (117.4 Mbp of mRNA and 205.7 Mbp of DNA). This exceeds previously published metatranscriptomic studies because of the inclusion of replicated samples. By further contrast, this is equivalent to 5.1% of the total bp sequenced, and 19% of the number of reads of the recent Global Ocean Survey (GOS) sequencing effort [2]. Here we present an analysis of these data that confirms the high level of enrichment for mRNA and the high levels of assembly of mRNA sequences compared to the DNA of the metagenomes; we also speculate on the potential coverage of the natural metatranscriptome sampled and discuss potential biases introduced by this methodology and provide evidence against the large-scale generation of mosaics and artefacts by the use of GenomiPHI amplification. We then discuss the proportion of these mRNAs that match protein databases, discuss the most abundant 'known' clusters, and compare the metatranscriptome with the metagenome and with the first published study of a marine metatranscriptome [9].
Determining the proportion of ribosomal RNA remaining in mRNA metatranscriptomic samples Both DNA and mRNA sequences were analyzed using the publicly-available SEED MG-RAST (Metagenome Rapid Annotation using Subsystem Technology, http://metagenomics.theseed.org [17,18]), which compares inputted sequences against a database of metabolic systems from selected organisms. Taxonomic information for the metagenomes within SEED was obtained by comparison against three 16S rDNA databases (the Ribosomal Database Project II (RDP), Greengenes, and the European Ribosomal Database). Although rRNA comprises approximately 80-90% of total RNA in a typical bacterium [19], it averaged only 0.08% of the total number of sequences in the four combined cDNA libraries ( Table 1). The purification was far more efficient than would be predicted for the methodology and capture-probe range of the Microbe Express kit (Ambion) used for the subtractive hybridisation of rRNA. This could be because the 16S rRNA probes used in the subtractive hybridisation technique may hybridise to a more significant proportion of the community than previously considered. While this might lead to a more substantial removal, it cannot explain the near-complete removal seen in this study. A second more likely possibility is that the multiple displacement amplification approach (GenomiPHI) used to amplify the available mRNA, inefficiently amplified rRNA due to its inherent secondary structure that could have inhibited the reaction (GE Healthcare technical services communication). Both of these options should be further tested.

Comparisons of homology between datasets
To determine the similarity of each dataset to each other, total nucleic acids between each database and total partial ORFs (pORFs) between each database were compared to provide an indication of the number of homologous sequences shared between each pair of datasets (Table S3). This demonstrated that each DNA dataset shared approximately 10% to 25% of the nucleic acid sequences and 20% to 33% of the pORF sequences. This suggests that the majority of sequences within each group were unique (singletons) to each dataset; a similar result was seen in the Global Ocean Survey when the metagenomes of different regions were compared [2]. The comparison between mRNA datasets showed a clear delineation between mid-bloom and postbloom, with mid-bloom mRNA sharing ,50% of their nucleic acid transcripts and post-bloom sharing .95% of their nucleic acid transcripts. This result was consistent when the datasets were compared between time points, with ,50% of mid-bloom transcripts being homologous with ,90% of post-bloom transcripts (Table S3). We postulate below that this difference could be due to an over-abundance of viral transcripts in the post-bloom environment causing the metatranscriptomes to become more homogenous.
It was expected that the metatranscriptome from the postbloom environment would be more similar to the metagenome from the post-bloom environment than the mid-bloom samples. The comparison clearly demonstrates this ( Table S3). The midbloom metagenomes also had greater homology to the mid-bloom metatranscriptomes than the post-bloom metatranscriptomes.
Clustering of DNA and mRNA sequences confirms higher levels of assembly of mRNAs and differences between time points To determine the possible level of assembly of sequence clusters, total DNA and mRNA sequences were analyzed using a metagenomic sequence analysis pipeline developed at CAMERA (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis [20]) (access to this pipeline can be arranged by contacting the corresponding author). The number of unique sequences was calculated by clustering un-assembled sequence reads as described in the Materials and Methods [16].
As shown in Table 1 an average of 79% of the DNA-derived metagenome sequences from both mid-and post-bloom samples were unique (singletons). This confirms the low level of coverage of the genomes in this sample and the high diversity. In contrast, the mRNA-derived sequences showed much higher levels of clustering, with an average of 45% of the sequences from the mid-bloom (time point 1) and only 9.5% of the post-bloom sequences (time point 2) being unique ( Table 1) (calculated by dividing the total number of unique sequence clusters by the total number of Table 2. Comparison of methods described by current manuscript with the three most recent methods for analysing microbial metatranscriptomes. Leininger et al [7]; Urich et al [8] Frias-Lopez et al. [9] Gilbert et al [10] (and this study) sequence reads). Strikingly, only five of the 609 largest nucleotidelevel mRNA clusters (those with $100 sequences) had an observed match in any of the DNA metagenomes. This low level of homology between mRNA sequences and the community DNA metagenome was previously noted [9] and is expected given the sparse sequence coverage for the much larger metagenome. Alternatively, this could be an overestimate of the lack of homology as it is possible for any of the RNA clusters to actually come from the same transcript. For example, an mRNA cluster from the first 25% of a given gene (average gene length ,950 bp [21]) would appear to have no match even if the DNA library captured the other 75% of the gene. Unfortunately, there is no way to resolve this issue given the small size of sequences currently generated with pyrosequencing methodology.
To directly compare the level of assembly (diversity) in the eight samples, individual rarefaction was used to normalize the number of clusters to an equivalent sampling effort (i.e. that of the smallest sample) ( Table 1). The number of clusters in all four DNA samples was surprisingly uniform and was double the number of clusters from the mid-bloom and nine times from the post-bloom mRNA samples. These results show that the metatranscriptome is smaller than the metagenome, assembles better, and that the expression of genes is different for mid-bloom and post-bloom communities. Of these an average of 0.23% of mid-bloom and 2.3% of post-bloom transcript clusters included more than 100 sequences. In other words, the transcription profile became more homogenous in the post-bloom situation.
Based on this clustering, we compared the total number of clusters found to the total number expected within a given water sample. To generate a rough estimate of potential metatranscriptome coverage, we used the approach of Poretsky et al. [6] to estimate that each water sample contained ca. 80,000 unique transcripts. This estimate is based on the observed number of dominant taxa and bacterial abundance (data not shown). This is the same order of magnitude as the number of unique mRNA sequences identified ( Table 1) suggesting that this study may have achieved a reasonable coverage of the community metatranscriptome (in comparison, the metagenomes were vastly under-sampled). This is clearly an upper estimate, and given that the top 609 nucleotide clusters could be collapsed with less conservative clustering criteria into 85 larger clusters (see Materials and Methods), the actual number of transcripts could be 7-fold or more lower. Indeed the real value could be even lower, since one functional transcript may be coded for by more than one cluster ( Table 3). We have previously shown this for another gene, phnA, that encodes phosphonoacetate hydrolase; the phnA from one organism had twelve hits within the metagenomic data, which were spread out over the gene [10]. Using the clustering methodology outlined here, this method would have identified this one gene as belonging to six different clusters due to overlap between the 12 sequences.

Evidence against potential biases in detecting naturally occurring mRNA clusters introduced by GenomiPHI
There are two key biases that might be introduced using the methodology presented here. Firstly, the time required to concentrate the community by filtration is longer than the halflife of mRNA, but this is true of most methods used to analyze the metatranscriptome of aquatic samples [6]. Recent studies however, have used smaller volumes (e.g. ,1 L [9]), and the current methodology would still be effective using these smaller volumes. However, in the current study this methodology was run concomitantly with other analyses that required a significant amount of DNA, e.g. fosmid library production [10].
Secondly, amplification of cDNA using GenomiPHI could introduce artefactual sequences (although evidence of such a bias for transcriptome amplification does not exist [22]). Such artefacts could include mosaic or artefactual sequences that could explain the large number of orphan transcripts found in this study. We therefore performed four types of subsequent analyses to attempt to validate these clusters.
Firstly, to generate empirical evidence of the presence of these clusters in the original water samples and to test for chimeras, a PCR analysis was performed that targeted 27 of the most highly expressed orphan clusters. PCR reactions were performed on 1) the original environmental DNA preparations, 2) unamplified cDNA and 3) mRNA (this was a negative control, since it is DNAfree). Amplification products were detected for all 27 selected target sequences in at least one of the environmental DNA samples ( Table S2). None of the sequences could be detected in any of the 4 mRNA samples (negative controls) confirming an absence of contamination of DNA. All 27 transcripts were found in all four cDNA samples. For the mid-bloom time points, 12 and 11 of the transcripts respectively were identified in the high CO 2 and control environmental DNA samples (Table S2). Some, but not all, of these transcripts were of lower abundance when normalized to sequencing effort (data not shown).
Secondly, this is the first published metatranscriptomic study to include biological replicates ( Table 2) making it possible to compare observed transcripts generated from independent samples using the same methodology. Of the four metatranscriptomes analyzed, transcript clusters showed similar abundance in both peak bloom samples and both post bloom samples ( Table 1,  Table S2). Since all four metatranscriptomes were generated using the same mRNA enrichment methodology, this level of observed similarity of abundant transcripts would not be expected by chance and provides strong evidence that the difference seen between time points in both the treatment and control samples are due to biological differences in the composition of the community within the bloom (Table S2).
Thirdly, we compared the functional profiles of the metagenomes and metatranscriptomes (Fig. 1). All eight data sets were annotated using similarity matching against SEED subsystems [17]. While this approach only validates transcripts with observable homology to genes in known subsystems, it still shows that the metatranscriptome functional profile does not significantly differ from that of the metagenome (one-way Anosim R = 0.271, p = .0.05). For this analysis, the number of sequences with significant identity to each metabolic gene in a functional category in the SEED subsystem database were normalised to the sequencing effort for each sample (Fig. 1) and sequences which could not be annotated in this way were not included. Fourthly, we compared the level of assembly and novelty between our four mRNA and DNA samples and that of the only previously published metatranscriptomic study of a marine microbial community [9]. All samples were translated in all six reading frames into contiguous peptides of at least 30 amino acids without a stop codon, spurious pORFs were removed, leaving 2,567,050 predicted or putative pORFs (See Materials and Methods, Table 1). These pORFs were clustered to assess the diversity of function from each sample, and were compared against known databases to provide basic annotation of known proteins and potential identification of novel pORFs. As shown in Table 1, the majority of the highly clustered ($10 non-redundant sequences per cluster) transcripts (,94% mid-bloom and ,87% post-bloom) were novel clusters that may represent uncharacterized proteins.
The mRNA samples from the current study yielded 1,2 orders of magnitude more novel protein clusters than their corresponding DNA samples when normalized to size ( Table 1). Surprisingly, this high level of diversity was actually exceeded by the previously published Frias-Lopez study [9]. Clustering of that data according to the same criteria showed that ,98% of metatranscriptomic sequences were unique (Table S1). To directly compare the annotation of pORFs for this study with that of the Frias-Lopez study [9], we applied the same clustering techniques for the translated proteins to their raw data (Table S1). A total of 1,826 pORF clusters containing .10 non-redundant sequences were found (DNA, rRNA and mRNA) and 865 pORF clusters remained after all rRNA clusters were removed. If the values for novel protein clusters are normalised to sequencing effort, we see that the Frias-Lopez study identified 1.86 the number of novel protein sequences per sequencing effort when compared to the current study. This phenomenon can be partially explained by the differences in read lengths between the studies ( Table 2), as longer read lengths are more likely to be positively annotated than shorter read lengths [23,24].

Differences between sequence abundances in metatranscriptomes and metagenomes
The benefits of applying both metatranscriptomic and metagenomic analysis to the same biological samples include the potential to detect differential expression of mRNAs (function) between communities under different environmental conditions, while the metagenome (DNA) can also provide a frame-of-reference for the total potential of the community metatranscriptome. Using the proportion of DNA and mRNA sequences that had homology to known proteins, we were able to make phylum-level taxonomic assignments using annotations from the SEED databases [18] (Fig. 2). Comparison of the 4 DNA and 4 mRNA samples shows them to be significantly different in taxonomic composition (by one-way Anosim analysis, R = 0.385, p,0.03). Despite this, comparisons of all subsets of the data failed to reveal any significant differences (perhaps due to small sample sizes -data not shown). This suggests that changes seem in mid-and post-bloom time points are due more to changes in particular genes within taxa, than large-scale changes in the abundances of phyla-level taxonomic groups. Some potential qualitative changes can be seen within these patterns that may contribute to this significant difference between DNA and mRNAs including an increased number of transcripts from the Bacteroidetes phylum (an important group in macromolecule degradation) during the mid-bloom sample compared to its proportion of the same sample of DNA (Bacteroidetes was only the 4 th most abundant metagenomic group but had the 3 rd highest transcriptional activity) (Fig. 2).

The most abundant 'known' transcripts found in the metatranscriptome and metagenome
The most abundant 'known' pORF clusters included a large number of housekeeping genes. All pORF clusters with .10 non redundant sequences were annotated by comparison to the PFAM, TIGRfam and COG databases ( Table 3). Whilst the PFAM annotations yielded significant numbers of viral proteins, viral sequences were absent from both the TIGRfam and COG annotations (viral annotations discussed in next section). Among the most abundant sequences with annotations are stress-induced chaperonin proteins (Table 3), which are potentially expressed in response to the low pH or high CO 2 concentration stress found in the ocean acidification mesocosm samples. This is corroborated by the distribution of sequences with ,60% (1.46increase) of the chaperonin transcript sequences coming from the high CO 2 mesocosms. This is mirrored by the metagenomic data in which ,55% (1.26increase) of the chaperonin gene sequences are found in the high CO 2 environment. However, it is possible that these proteins are induced when the bacteria are being filtered, and hence this could be an artefact caused by the sampling procedure; using smaller starting volumes should alleviate this. Neither of these proteins were identified as being abundant in the dominant pORF clusters (.10 non-redundant sequences) from study by Frias-Lopez et al. [9] which utilised only 1 L sampling volumes and may have reduced stress on the bacteria by reducing the filtration time.
A range of proteins considered to be ubiquitous in cellular processes also ranked among the most abundant sequences that could be annotated. These included ribonucleotide reductase proteins (COG0209, TIGR02505, TIGR02506, PF02867), which were matched in all 3 reference databases ( Table 3), as were proteins involved in ABC transporters, ATPase activity and AMPbinding (COG5265, TIGR00630, PF00004, PF00005, PF00006, PF00501).
Abundant viral sequences in the post-bloom time point samples may contribute to the large number of orphan transcript clusters found At the end of any phytoplankton bloom, a substantial increase is expected in the number of expressed viral transcripts. This was observed in our post-bloom mRNA samples, in which transcripts with viral homologues were on average 24.5 times more abundant than viral DNA sequences (Fig. 2). While free viruses particles would pass through the 0.22 mm filters used, and would therefore have low sequence abundance in the post-bloom samples, infected cells would be expected to have overwhelming viral gene expression during lytic growth (Fig. 2). The large increase in viral transcription occurred immediately after a substantial increase in bacterial abundance following the phytoplankton bloom (data not shown).
The high expected abundance of viruses in the post-bloom environment suggests that many of the unknown predicted proteins maybe of viral origin [3]. This is supported by the annotation of the dominant pORF clusters (with .10 nonredundant sequences, Table 3) and Fig. 2. The most abundant sequence that could be annotated was PF02407, Putative Viral Replication Protein, and the second most abundant was PF00910, RNA helicase, which is thought to be involved in viral infection ( Table 3). These sequences comprise 7.7% and 5% respectively of the dominant clusters that can be annotated by comparison to the PFAM database (12.7% in total). Furthermore, these two proteins are more abundant in the postbloom environment, with ,86% of these transcripts being found only in the post-bloom samples. Interestingly, only a single homologue for PF02407 was found in the post-bloom metagenomes.
This not only confirms the results seen in Fig. 2, but also underscores a clear case of the biological significance of observed differences in the ratios of transcripts and their DNA sequences.

Further validation of metatranscriptomes and metagenomes by direct comparison to an oligotrophic ocean metatranscriptome
Both the validity and nature of mRNA transcripts from this experiment were explored by direct comparison with the Frias-Lopez data set ( Table 2) [9]. There was some overlap of sequences, including both house-keeping genes and a few of the most highly expressed novel orphan clusters. But the analysis also highlights the extensive diversity between these samples which, while both taken from the marine environment, came from two distinct marine habitats ( Table 2).
Specifically, comparisons were generated using BLASTN ( Table 4) for 3 versions of the two data sets: 1) total sequences, 2) representative ntDNA sequences from each nucleotide cluster and 3) representative sequences from each pORF cluster. Both mRNA and DNA sequences were compared separately. Values for the Frias-Lopez cDNA following removal of the rRNA sequences were also used for comparison. The most abundant clusters of the current study were also compared to the Frias-Lopez full mRNA and DNA datasets ( Table 5).
About 10% of the sequences in the two metagenomes are shared, but the shared proportion of DNA pORFs is higher (15%) for the Frias-Lopez study and considerably lower (3.7%) for the current study. Interestingly, this trend is confirmed by the DNA-mRNA comparisons in which the total proportion of DNA matches is always far lower than the proportion of mRNA ( Table 4). At the highest level of assembly, comparisons of pORF clusters reveal that 9% and 7.5% of the mRNAs are shared with the relevant metagenome. Smaller proportions of the pORF mRNAs of each study (5.0% and 0.7%) showed similarities to each other suggesting that different subsets of the ''potential metatranscriptome'' of the two communities are expressed in the two habitats. Table 4. BLASTN comparison of total nucleic acids, representative sequences from nucleic acid clusters and representative sequences from pORF clusters from this study and the Frias-Lopez study [9].
Gilbert et al [10] and current study Shared sequences are expected to include housekeeping genes, as suggested by the homology between many of the largest identifiable clusters found in both studies (Table 4 and Table 6). Similarities among the mostly highly expressed abundant clusters are still very rare ( Table 5). This could be a result of niche-specific genes (or post-bloom specific genes in this study) and/or the heavy viral load associated with the collapsing algal bloom conditions for the current study. This viral load hypothesis is potentially confirmed by an observed anomaly seen in Table 4. The relative percentage of total nucleic acid comparisons is higher than the comparison between nucleic acid clusters (nuc-clusters) for each analysis except when comparing our mRNA nucleotide clusters (mRNA nuc-clusters). The relative percentage increases from 0.53% to 1.45%, and is seen again when comparing against the Frias-Lopez mRNA following removal of the rRNA, whereby the values are 0.43% increasing to 1.2% (Table 4). We hypothesise that this anomaly is caused by the majority of the Frias-Lopez mRNA homolog's being singletons in our mRNA data, hence on clustering, their contribution to the percentage calculation is more significant. This highlights the fact that the abundant sequences in our mRNA data are not abundantly expressed in the Frias-Lopez data, which is to be expected if they are viral sequences.
The number of protein sequences that can be annotated through comparison to PFAM, TIGRfam or COG was approximately 2.5% prior to removal of rRNA sequences and 4.3% following removal. This is far lower than the 36.5% from our study which could be annotated (8.5 fold more) ( Tables 2 and Table  S1). When comparing the annotation of the top 10 most abundant pORF clusters (.10 non-redundant sequences) found in the Frias-Lopez studies ( Table 6) 6 (by PFAM), 7 (by TIGRfam) and 5 (by COG) clusters are found in both studies as abundant clusters (.100 sequences per pORF cluster). For example, the 1 st and 2 nd most abundant PFAM annotation for the Frias-Lopez study (PF00004) are the 4 th and 10 th most abundant PFAM annotation for the current study ( Table 3 & 6).

Summary
The ability to assess natural metatranscriptomes of complex microbial communities under different environmental conditions represents a significant advance in our ability to link community structure with function and DNA genotypes (sequences) with corresponding phenotypes. The approach presented here expands the available methodologies for assaying metatranscriptomes with .99% enrichment from total RNA (by removal of ribosomal RNA) and demonstrates that changes in expression of transcripts can be observed between time points. The outputs of this study include a large number of novel, highly expressed sequence clusters and confirmation that the majority of these clusters are orphaned and therefore further prove the utility of this approach for use in discovering novel genetic capacity [9]. The computational analyses produced in this study also demonstrates the critical importance of access to public portals, namely CAMERA [20] and SEED [17,18], for the processing of such vast quantities of complex data.

Supporting Information
Table S1 Comparison of DNA and mRNA from samples collected by Frias-Lopez et al [9]. Table S2 Information about the 85 most abundant nucleotide clusters. Including size, number of sequences in cluster, distribution of abundance of mRNA and DNA sequences within each cluster and the presence or absence of those clusters for which PCR amplification from environmental DNA was performed. T1B1 refers to high CO2 from the mid-bloom; T1B6 refers to present day CO2 from the mid-bloom; T2B1 refers to high CO2 from the post-bloom; T2B6 refers to present day CO2 from the post-bloom. Found at: doi:10.1371/journal.pone.0003042.s002 (0.27 MB RTF)

Table S3
Number of (A) nucleotide sequence and (C) partial ORF sequence homologues found between the eight datasets from the current study. Percentage of (B) nucleotide sequence and (D) partial ORF sequence homologues found between the eight datasets from the current study. T1B1 = Mid-Bloom, High CO2. T1B6 = Mid-Bloom, Present Day. T2B1 = Post-Bloom, High CO2. T2B6 = Post-Bloom, Present Day. pORF percentages are based on total pORFs, denoted d in Table 2. Found at: doi:10.1371/journal.pone.0003042.s003 (0.10 MB RTF) Table 5. BLASTN comparison of the reference sequences of the abundant nucleic acid clusters (.10 and .100 sequences per cluster) from the current study to the total combined mRNA and DNA sequences from the Frias-Lopez et al [9] study.  Table 6. Top 10 most abundant annotatable transcripts from the Frias-Lopez et al. [9]. (B). If a homologue was identified in the current study (A) that too is included. Numbers in columns A and B refer to the number of sequences which were assigned this particular protein annotation. Only pORF clusters with .10 non-redundant sequences were included in this analysis. doi:10.1371/journal.pone.0003042.t006