De Novo Assembly of a Transcriptome for Calanus finmarchicus (Crustacea, Copepoda) – The Dominant Zooplankter of the North Atlantic Ocean

Assessing the impact of global warming on the food web of the North Atlantic will require difficult-to-obtain physiological data on a key copepod crustacean, Calanus finmarchicus. The de novo transcriptome presented here represents a new resource for acquiring such data. It was produced from multiplexed gene libraries using RNA collected from six developmental stages: embryo, early nauplius (NI-II), late nauplius (NV-VI), early copepodite (CI-II), late copepodite (CV) and adult (CVI) female. Over 400,000,000 paired-end reads (100 base-pairs long) were sequenced on an Illumina instrument, and assembled into 206,041 contigs using Trinity software. Coverage was estimated to be at least 65%. A reference transcriptome comprising 96,090 unique components (“comps”) was annotated using Blast2GO. 40% of the comps had significant blast hits. 11% of the comps were successfully annotated with gene ontology (GO) terms. Expression of many comps was found to be near zero in one or more developmental stages suggesting that 35 to 48% of the transcriptome is “silent” at any given life stage. Transcripts involved in lipid biosynthesis pathways, critical for the C. finmarchicus life cycle, were identified and their expression pattern during development was examined. Relative expression of three transcripts suggests wax ester biosynthesis in late copepodites, but triacylglyceride biosynthesis in adult females. Two of these transcripts may be involved in the preparatory phase of diapause. A key environmental challenge for C. finmarchicus is the seasonal exposure to the dinoflagellate Alexandrium fundyense with high concentrations of saxitoxins, neurotoxins that block voltage-gated sodium channels. Multiple contigs encoding putative voltage-gated sodium channels were identified. They appeared to be the result of both alternate splicing and gene duplication. This is the first report of multiple NaV1 genes in a protostome. These data provide new insights into the transcriptome and physiology of this environmentally important zooplankter.


Introduction
The recent crash of the cod fisheries in the North Sea has thrown a spotlight on a key prey of larval cod: the calanoid copepod Calanus finmarchicus [1]. The population of C. finmarchicus has been severely impacted in this part of its range, presumably because current environmental conditions are preventing it from completing its life cycle [2]. Individual-based population modeling has been an important approach to discover how physical, chemical and biological factors influence C. finmarchicus population dynamics [3]. These models have been hampered by a lack of adequate physiological information, leading to repeated calls for a better understanding of the physiological ecology of this planktonic organism [3][4][5][6]. However, because this species inhabits a 3dimensional space that spans thousands of kilometers horizontally and a kilometer in depth across the open ocean, the resultant inaccessibility limits studies on the physiological ecology across habitats, seasons, climate (decadal oscillations, global climate change) and other environmental factors. A complex life history (including facultative diapause) and small size, as well as handling stress and time delays associated with collections, further limit C. finmarchicus and other key plankton as subjects for physiological studies. Thus, alternative approaches need to be developed for assessing physiological state in this species [7]. The application of gene expression pattern analysis offers one promising solution. Timely gene expression profiles encompassing a broad range of physiological processes might be obtained by combining RNA-Seq technologies with in situ preservation techniques [8]. Recent studies using subtractive hybridization, microarray and quantitative realtime polymerase chain reaction (qPCR) have demonstrated the value of gene expression studies in identifying differences in physiological state for individuals collected from different depths or differing in morphotype [9][10][11]. Global gene expression can identify biological, cellular and molecular processes that are regulated developmentally, seasonally and/or environmentally, and may thus provide key data for individual-based models.
Next-generation sequencing (e.g. 454 and Illumina platforms) has opened opportunities for developing molecular resources for non-model species that are of biological and economic interest, but which lack reference genomes [12]. Crustaceans, including copepods, are among the important invertebrates for which genomic resources are still limited [7]; a single crustacean genome is currently available publicly, e.g. that for the highly-derived cladoceran, Daphnia pulex [13]. One barrier to crustacean sequencing projects has been that many of the potential target species, including C. finmarchicus, have large genomes (C-values . 5 pg; www.genomesize.com). Of course, this also raises the question of how these large genomes and transcriptomes differ from the much smaller one of D. pulex (C-value ,0.4 pg; www. genomesize.com). Next-generation sequencing and the development of software programs to assemble the resultant short sequence reads make it possible to obtain transcriptomes for organisms with large genomes, as the transcriptomes are, in general, much smaller in terms of their nucleotide content [12], and they are more closely linked to physiological state. Thus, deep sequencing using RNA-Seq technology followed by de novo assembly has become an alternative to genome sequencing in the development of resources for protein discovery, developmental and physiological studies, and phylogenetic and evolutionary analyses for these organisms. With assembly programs like Trinity [14,15], this approach can be used to identify rare transcripts and splice variants, and to formulate new hypotheses with respect to isoforms originating from single or multiple genes at a level not practical with conventional approaches in non-model organisms.
In the present study, we used Illumina sequencing technology and de novo assembly to generate a transcriptome for C. finmarchicus. Six multiplexed gene libraries derived from RNA from individuals at different life stages were sequenced in a single lane in order to include genes differentially expressed over the course of development. Depth and quality of the assembled transcripts were determined using a combination of global and targeted annotation. Since the build-up of lipid stores is a critical component of the C. finmarchicus life cycle, we used the resultant de novo assembly for targeted gene discovery and expression analysis focused on transcripts involved in lipid biosynthesis pathways. One result of this approach was the identification of transcripts with developmental expression patterns that are consistent with their involvement in the preparatory phase of diapause. Another key environmental challenge for some C. finmarchicus populations is the seasonal exposure to a toxic dinoflagellate (Alexandrium fundyense) with high concentrations of saxitoxins, neurotoxins that block voltage-gated sodium channels [16]. Efforts to sequence the voltage-gated sodium channel using traditional PCR and cloning techniques have been unsuccessful in this species (M.C. Chapline and A.E. Christie, personal communication). We identified and characterized multiple sequences encoding putative voltage-gated sodium channels by mining the de novo transcriptome. These transcripts appeared to be the result of both alternative splicing and gene duplication in C. finmarchicus. In summary, the data presented in our study represent a powerful new resource for protein discovery and stage-specific gene expression analysis in C. finmarchicus, which will provide important insights needed to understand the physiological ecology of this ecologically critical North Atlantic zooplankter.

Sample Preparation and Sequencing
Development in C. finmarchicus consists of an embryonic stage that occurs within the egg followed by six naupliar (NI-NVI) and six copepodite (CI-CVI) stages. For the transcriptome described here, total RNA was obtained from six developmental samples of whole individuals (Table 1): embryo (egg), early nauplii (stages NI and NII), late nauplii (stages NV and NVI), early copepodites (stages CI and CII), pre-adults (stage CV) and adult females (stage CVI). Adult females and pre-adults (CV stage copepodites) were collected in June and July of 2011 from coastal waters near Mount Desert Rock, Gulf of Maine, NW Atlantic Ocean (Lat: 44u 29N; Long: 68u39W) as described previously [17]. Adult females (10 individuals) and sub-adults (6 lipid-rich stage CV individuals) were isolated from the 14-July-2011 field collection upon return to the laboratory, rinsed in filtered seawater, placed on a sieve to remove the seawater and transferred into RNA extraction buffer. Samples for the other developmental stages (embryos, nauplii and early copepodites) were obtained from laboratory-reared individuals. After collection, individual animals were transferred into 3.5 and 10 L containers of filtered natural seawater and held in an incubator maintained at 8-9uC and 12:12 L:D. Cultures were fed three times per week on live Rhodomonas baltica and algal paste (Reed Mariculture Shellfish diet). Adult females and males were isolated from these holding containers and placed into brood chambers, fed on R. baltica ad libitum, and checked for eggs daily. Eggs were separated from the brood chambers and either prepared for RNA extraction (400 eggs), or transferred to small culturing jars (250 to 500 ml) and returned to the incubator. The jars were checked daily and nauplii and copepodites were staged. After nauplii reached the feeding stage (NIII) R. baltica was added. As cultures reached the target stages, individuals were harvested, transferred through several washes of filtered seawater, and then placed into RNA extraction buffer. The number of individuals in each sample was: 180 early nauplii, 50 late nauplii and 40 early copepodites. Total RNA was extracted using a QIAGEN RNeasy Plus Mini Kit (catalog # 74134) with Qiashredder (catalog # 79654) following the instructions of the manufacturer with a final elution volume of 30 ml. For sequencing, a single RNA sample was generated for each developmental stage (Table 1). Sample concentration and quality were checked using an Agilent model 2100 Bioanalyzer or an Agilent RNA 6000 Bioanalyzer Nano (Agilent Technologies). Total RNA samples were shipped on dry ice to the University of Georgia Genomics Facility (dna.uga.edu) for library preparation and sequencing. Double-stranded cDNA libraries were prepared from total RNA extracted using the TruSeq RNA sample preparation kit (Illumina catalog # RS-122-2001) following manufacturer's instructions. Briefly, RNA samples were first purified with two oligo-dT selection (poly(A) enrichment using oligodT beds), and then fragmented and reverse transcribed into double-stranded complementary cDNA. The resulting six cDNA libraries were prepared with a 350 bp insert and primed using random hexamers. Each sample was tagged with an indexed adapter prior to shipping to Alpha Hudson Institute for Biotechnology (www.hudsonalpha.org) for sequencing. The samples were paired-end sequenced (100 bp) in a single lane using an Illumina HiSeq 2000 instrument.

De novo Sequence Assembly and Mapping of Reads
Prior to assembly, raw reads were assessed for quality using FASTQC (v0.10.0) software. Over-represented sequences were checked using blastn and were found to be C. finmarchicus ribosomal RNA. These sequences, which were less than 10% of the raw read data, were removed. The random primer sequences (first 9 bases) were trimmed prior to assembly using FASTX Toolkit (version 0.013; http://hannonlab.cshl.edu/fastx_toolkit/) software. The resulting reads were then de novo assembled using Trinity 2012-03-17-IU_ZIH_TUNED software (http:// trinityrnaseq.sourceforge.net/) [14,15] on the National Center for Genome Analysis Support's (NCGAS; Indiana University, Bloomington, IN, USA) Mason Linux cluster; each node of this computer system is composed of four Intel Xeon L7555 8-core processors running at 1.87 GHz with 512 GB of memory. For the assembly, reads from all six developmental stage samples were combined and the minimum sequence length in the assembly was set to 300 bp. Trinity comprises three separate software programs (''Inchworm'', ''Chrysalis'' and ''Butterfly''), which process the data sequentially [14,15]. The final output from Trinity is a large number of assembled FASTA sequences that are each identified by a unique Chrysalis component number (comp), followed by a ''c'' identifier, which is a Butterfly disconnected sub-graph designation, and a ''seq'' designation which is a Butterfly reconstructed sequence [15]. For simplicity, we refer to the individual assembled sequences as ''contigs'', and the clustered components as ''comps''. For assembly, the initial parameters of Trinity were set as follows: -seqType fq -bfly_opts ''-edgethr = 0.05'' -kmer_method jellyfish -CPU 32-max_memory 20G -min_contig_length 300 -bflyHeapSpaceMax 8G -bflyGCThreads 4. The resulting de novo assembly was used to generate two transcriptomes, the ''complete'' assembly and the ''reference'' transcriptome. The complete assembly consisted of all contigs.
The reference transcriptome included only unique comps; in situations in which a given numbered unique comp contained multiple contigs, the longest contig was selected for the reference. It should be noted that this procedure can result in the omission of occasional distinct genes grouped by the software under a single comp number. The reference transcriptome was annotated using Blast2GO (see below). Both transcriptomes were used to map reads from each of the stage-specific samples, as well as in a combined sample, using Bowtie (version, 2.0.6; default settings of two mismatches) software [18]. Before Bowtie mapping, reads were again quality filtered using FASTX Toolkit software, with a Phred quality score of 20 used as a limit. Low quality reads (fewer than 8%) were removed from the dataset.
Additional assemblies using the same settings were performed on the individual stage-specific samples, as well as subsets of all reads starting at 6 million reads to assess assembly statistics as a function of developmental stage and sequencing depth. Subsets were acquired by successively extracting every other read from a fastq file using a custom written Perl script (fastqDivide.pl, available at http://github.com/LenzLab/RNA-seq-scripts). Different ''sequencing samples'' were generated by further dividing subsets and/or recombining mutually exclusive subsets.

Assembly Validation and Annotation
To determine the extent of coverage of the Trinity assembly, and to assess its similarity to those of other species, functional annotation was undertaken using Blast2GO (version 2.6.4) [19] for the reference transcriptome. For this analysis, the blastx algorithm was used to search against the NCBI non-redundant (nr) and SwissProt protein databases, which were downloaded (February 2013) onto a local Beowulf Linux computer cluster; a maximum Evalue for annotation of 10 23 was employed in both searches. Gene ontology (GO) annotations for biological and molecular processes and cellular component were assigned using Blast2GO, here with a maximum E-value of 10 26 required for annotation.
To verify the coverage of our annotation results, we compared the proportion of sequences annotated in selected GO term groups to the proportion in these categories reported for the Drosophila melanogaster genome from a pre-computed GO annotation (http:// www.b2gfar.org/showspecies?species = 7227). Percentages of GO for biological process, molecular function and cellular component at ontology level 2 were calculated by dividing the total number of sequences annotated to a given GO term by the total number of annotated compounds (x100). A large percentage of genes were not expressed in any particular stage. Thus, for each developmental stage, we did a functional analysis of the ''silent'' transcripts. We determined the relative percentages of GO terms for biological process, molecular function and cellular component that were not expressed (#2 mapped reads).
Targeted gene discovery was focused on two groups of transcripts encoding for: 1) putative proteins involved in lipid biosynthesis; and 2) voltage-gated sodium channels. This analysis was used to gain further insight into the completeness and quality of the assembly, as well as expression patterns during development and the biological significance of compounds with multiple sequences. In addition to searching the annotated reference transcriptome, the complete transcriptome assembly was downloaded to a TimeLogic DeCypher server at the Mount Desert Island Biological Laboratory (Salsbury Cove, Maine) and searched using the Tera-BLASTP algorithm for sequences that were putative homologs of a known protein query (typically ones from the fruit fly D. melanogaster). The C. finmarchicus nucleotide sequences identified in this manner were fully translated and then aligned with and checked manually for homology to the query protein. In addition, each deduced Calanus protein was used as the query in reciprocal BLAST analyses against 1) the annotated proteins in FlyBase and 2) the non-redundant proteins curated at NCBI to identify the most similar protein in each database as a second measure of annotation. Conserved regions were identified by aligning C. finmarchicus predicted proteins with the D. melanogaster sequence showing functional domains to confirm that each predicted protein possessed the correct structural hallmarks. Finally, in an attempt to assess the correctness of assembled nucleotide sequences, each was used as a query in a blastn search of the extant C. finmarchicus ESTs (,12,000 in total) [10] curated at NCBI for transcripts encoding identical or highly similar sequences. This targeted transcript discovery workflow was modified from one described in detail in several recent publications [17,20,21].
Sequence data and the de novo assembly have been submitted to the National Center of Biotechnology Information (NCBI; www.ncbi.nlm.nih.gov) under bioproject PRJNA236528.

Sequencing Results and Assembly
Illumina sequencing of the six C. finmarchicus developmental stage libraries (embryos, early nauplii, late nauplii, early copepodites, pre-adults and adult females) yielded over 400 million paired-end 100 bp reads, with an average of 69 million reads per developmental sample (Table 1). This species has a Cvalue (amount of DNA contained within a haploid nucleus) of 6.48 pg [22], which translates into an estimated genome size of more than 6,000 Mb (conversion factor 1 pg = 978 Mb) [23]. Assuming that only 7 to 10% of the Calanus genome is transcribed, the Illumina reads represent a sequencing coverage of approximately 60 to 90-fold for the combined samples. The number of base pairs used in the assembly exceeded 30 billion (number of reads multiplied by 91 bp [the 100 bp read trimmed of the 9 bp random primer sequence]), which generated a de novo assembly with a total length of 205 million base pairs (Table 2). Thus, the ratio of the number of base pairs in the assembled transcriptome to the total number of base pairs was approximately 150. These estimates suggest that the coverage obtained for the C. finmarchicus transcriptome is as deep or deeper than those obtained in other crustacean de novo transcriptomics studies [12,[24][25][26].
Assembly of the Illumina reads by Trinity generated 206,041 contigs with an average length of 997 bp ( Table 2). Half of these (N50) were at least 1,418 bp long and the longest contig was 23,068 bp long (Table 2). It contained 96,090 unique comps, of which 73,925 (77%) consisted of single contigs. The remaining comps consisted of multiple contigs and ranged from 2 to over 1,500 sequences ( Figure 1). Mapping of the Illumina-generated reads against the complete, 206,041-sequence assembly yielded an overall alignment of 89% (Table 3; the missing reads presumably belonging to sequences below the 300 bp cut-off). However, given the redundancy found in the multiple contigs represented within some comps, a large percentage (44%) of reads mapped more than once (Table 3). Thus, the longest contig for each comp with multiple sequences, plus all singletons, were selected to produce a reference transcriptome of unique comps (96,090 sequences). When this sub-set was used as reference in the mapping step, the alignment rate decreased to 75% and the number of reads mapped .1 time decreased to 0.7%; Table 3). An analysis of the frequency distribution of number of reads showed that 75% of the predicted transcripts had 10 to 1000 reads mapped to them ( Figure 2). Very few of the reference sequences had fewer than five (log 10 [reads+1] #0.75) or more than 10 5 reads mapped to them ( Figure 2).
In order to obtain a measure of completeness of the assembly from the full set of reads, a series of de novo Trinity assemblies was generated using an increasing number of reads, from 6 million to the complete, 400,000,000+-reads dataset ( Figure 3). The total number of contigs assembled increased steeply from 38,000 to 100,000 between 6 and 50 million reads (1.5 to 12.5% of total available reads; Figure 3). After this initial increase, the rate of increase declined ( Figure 3). The number of unique comps in the assemblies also increased with number of reads (Table S1). In contrast, average sequence lengths were nearly constant, fluctuating between 900 and 1000 bp in the assemblies generated from 25 million reads and above ( Figure 3; Table S1). The assembly statistics (average length, N25, N50, N75) obtained for the smaller data sets were similar over a similar range in number of reads (Table S1). These results suggest that good assemblies could be obtained from as few as 50 million reads, which is not surprising given that Trinity software is designed to generate good assemblies even when coverage is low [14]. However, the number of assembled contigs continued to increase with additional reads, suggesting that even at 400 million reads, rare transcripts were still missing from the de novo assembly. A 2-exponential fit to the data predicted an asymptote at ,300,000 sequences, suggesting that the current assembly had ca. 65% of the total number of expected contigs. Independent estimates of completeness of the transcriptome were obtained through targeted protein discovery [17,20,21]. Searches for circadian proteins and the enzymes involved in amine biosynthesis identified putative transcripts for all expected proteins (100% coverage) [20,21]. In contrast, searches for neuropeptide preprohormones and receptors yielded incomplete sets of predicted transcripts (52 to 60% of expected) [17]. Neuropeptide-encoding sequences are rare in whole organism transcriptomes since they are typically restricted to the nervous system and are expressed in a limited number of cells within this organ, including in C. finmarchicus [27][28][29].
De novo assemblies completed for the individual developmental stage samples are summarized in Table 4. The number of contigs obtained for each individual sample was lower than those generated by sub-samples of reads randomly selected from the combined samples (isolated points below curve in Figure 3, Table 4). The number of unique comps was also lower and ranged between 37,692 and 50,216 with 73 to 78% of these being singletons. This proportion of singletons was similar to the assembly of all reads combined. Average sequence lengths were longer than expected compared to the assembly statistics obtained for a similar number of randomly selected reads (isolated points above the curve in Figure 3). In addition, the longest contigs exceeded 20,000 bp in all stage-specific assemblies except for that derived from embryos ( Table 4).

Annotation of the Reference Transcriptome: BLAST Results and Gene Ontology (GO)
The reference transcriptome, comprising the 96,090 sequences representing unique comps, was annotated using Blast2GO. The assembled sequences were searched against the non-redundant (nr) and SwissProt protein databases using the blastx algorithm with an E-value cutoff set at 10 23 . Searching against the nr database resulted in 38,289 comps (,40%) having significant blast hits (Table 5). A large percentage of the comps with no blast hits were short, i.e. in the 300-400 bp range (23,403 out of 55,306 sequences). Many of these short sequences probably represent partial transcripts, which may have contributed to the ''no blast   Reads used in the assembly (see Table 2) were filtered for quality using FASTX Toolkit, and low quality reads (8%) were removed prior to mapping. doi:10.1371/journal.pone.0088589.t003 hit'' result. Blastx results using SwissProt as the reference database, which is manually annotated and reviewed, yielded understandably fewer significant hits, comprising 28,616 comps (Table 5). Further analysis for gene ontology using the SwissProt database led to GO and GOSlim annotations of nearly identical numbers of comps, 10,334 and 10,344, respectively ( Figure S1). We obtained fewer GO and GOSlim annotations using the nr database as reference (Table 5). Nearly 30% of blastx results against the nr database had top hits with high E-values (.10 210 ), while fewer than 25% had E-values below 10 250 (Figure 4). This is consistent with relative paucity of genomic resources for crustaceans [25]. In contrast, blastx homology results of a recent de novo transcriptome of an insect, the western tarnished plant bug (Lygus hesperus), returned 55% of top hits with E-values below 10 250 [30]. Another aspect of the automated annotation is that the blastx algorithm is limited to nucleotide sequences shorter than 8,000 bp. The automated BLAST2GO annotation was not able to process any of the very long comps. Thus, we translated these comps into predicted proteins using an online translation tool (web.expasy.org/translate/?). These translated sequences were manually entered into blastp online and searched against nr protein sequences (http://blast.ncbi.nlm.nih.gov). This led to putative identifications of an additional 130 sequences, which represented expected long transcripts encoding large proteins, such as kettin/titin, supervillin, beta spectrin, midasin, and cytoplasmic dynein 2 heavy chain. Kettin/titin and supervillin are both actin-binding proteins with kettin/titin involved in muscle function. Beta spectrin is a cytoskeletal protein involved in membrane integrity and neuronal function. Midasin is a nuclear chaperone involved in the assembly/disassembly of macromolecules in the nucleus. Cytoplasmic dynein 2 heavy chain is motor protein involved in converting chemical energy (ATP) into mechanical energy (movement). These identifications were added to the reference transcriptome, bringing the total number of comps with significant blast hits to 38,419.
The under-representation of crustaceans with respect to genomic resources was also evident from the taxonomic distribution of top hits in the blastx results for the nr database. Nine taxonomic groups were represented in the 29 top-hit species ( Figure 5). Arthropods accounted for only 60% of the species (18 out of 29): four non-malacostracan crustaceans, 13 insects and one chelicerate. The branchiopod crustacean Daphnia pulex and the insect Tribolium castaneum had the largest and second largest number of top hits (2,905 and 1,927 out of 35,164 hits), respectively. Three parasitic copepod crustaceans (Lepeophtheirus salmonis, Caligus rogercresseyi and Caligus clemensi) were among the tophit species, with 2,672 combined blast hits with E-values #10 23 . Interestingly, the 29 top-hit species did not include a single Drosophila species or any decapod crustaceans. The fourth top-hit species was the lancelet Branchiostoma floridae (a cephalochordate), a result that was similar to that found for the amphipod crustacean Parhyale hawaiensis by Zeng et al. [25].
In order to assess the representation of biological and molecular processes and cellular components among the assembled compounds, the distribution of GO terms in the C. finmarchicus transcriptome was compared to that from the genome of Drosophila melanogaster (http://www.b2gfar.org/showspecies?species = 7227). Broad representation was found in C. finmarchicus (Figure 6). At gene ontology level 2, the relative distribution of GO terms was similar to that for the genome of D. melanogaster, with the largest proportions of GO annotations for biological process (BP) indicated for involvement in response to stimulus, metabolic, cellular and developmental processes and biological regulation, while binding and catalytic activity were the most common GO annotations for molecular function (MF). Highest percentages in the cellular components (CC) domain were seen in cell and  organelle categories. Gene ontology analysis using multi-level pie charts also identified several GO terms that might be indicative of contamination by other organisms. Specifically, we found comps annotated as plastids (GO:0009536), thylacoids (GO:0009579), viral reproduction (GO:0016032) and symbiosis encompassing mutualism through parasitism (GO:0004419). The percentage of reads that mapped to these sequences was between 0.01 and 0.25%. We also searched the annotated sequences for contamination by Rhodomonas baltica (the algal food used) and found 14   comps with Rhodomonas sp. as top hit species. Mapping of reads against these comps indicated very low contamination and the percentage of mapped reads ranged between 10 25 and 10 26% . The overall level of contamination in this transcriptome was low. Previous targeted searches by Christie and colleagues of this de novo transcriptome have been performed to identify transcripts of interest involved in neurochemical signaling and circadian rhythms [17,20,21]. These searches have been able to confirm the results of the automated annotation, but also led to the identification and more complete annotation of a number of transcripts. In particular, neuropeptides are rarely annotated via automated means and require a targeted workflow for identification [17]. Interestingly, one of the conclusions of these studies has been that the neurochemical signaling systems of Calanus finmarchicus appear to be more similar to those of insects than to higher crustaceans, which is supported by our current result of a large number of insects among the top hit species. These results are consistent with the pancrustacean grouping [31][32][33]. However, the position of the copepods within the Pancrustacea is still in question, and this transcriptome may well contribute to the on-going discussions on the phylogenetic placement of copepods.

Stage-Specific Expression: Global Patterns and Target Transcripts
For an analysis of stage-specific expression patterns, reads from each developmental stage were mapped against the reference transcriptome using Bowtie software. The total number of high quality reads used in the mapping step ranged from 50 million to 66 million reads per stage. Alignment statistics for the individual stages were similar to each other, and to those for all reads combined, suggesting that the Trinity assembly produced similar coverage for all developmental stages (embryo, early nauplius, late nauplius, early copepodite, late copepodite [CV] and adult female). Overall, alignment rate ranged between 73 and 77%, and fewer than 1% of the reads mapped more than once (Table  S2).
Frequency histograms of the number of mapped reads per comp for the six developmental stages were characterized by a large number of comps with very low counts and a broad shoulder with expressions of 10 to1000 mapped reads per transcript (Figure 7). These distributions are in contrast to the frequency histogram of all samples combined (Figure 2). The two lowest bins, representing 0 to 2 mapped reads, have the highest numbers of counts (Figure 7). The percentage of transcripts in these bins ranged between 35 and 48%. Mixed embryos had the largest percentage of sequences with zero reads (34%), while the early copepodites had the lowest (25%). The stage-specific global gene expression data thus suggest that at least 1/3 of the transcriptome was not significantly expressed (compounds with #2 mapped reads = ''silent'') during any particular life stage.
In order to gain some insight into the function of the silent transcripts, we analyzed their gene ontology. Of the 10,344 GOannotated comps, the number of silent transcripts in any one stage ranged from 1,777 to 2,933. For each developmental stage, the relative abundance of silent transcripts by function at gene ontology level 2 is shown in Figure 8. These transcripts represented a wide range of biological processes, molecular function and cellular component. In many categories, the proportion of transcripts annotated was similar across stages with percentages differing by less than 5%. The exception to this was the biological process localization (GO:0051179), which was under-represented in nauplii and over-represented in embryos and adult females. The percent of membrane proteins (CC, GO: 0016020) was also over-represented in embryos and adult females, and a similar pattern was observed for catalytic activity (MF, GO:0003824). At this level of organization, this analysis only provided limited insights into the function of the unexpressed transcripts.
For a more detailed analysis of stage-specific expression patterns, we examined transcripts involved in lipid biosynthesis. Although it is well known that fatty acids are critical for normal development, the regulation of lipid metabolism is still poorly understood in eukaryotes, and is an active area of research [34]. In C. finmarchicus, lipid stores play an important role in the life history, including growth, reproduction, resistance to starvation and timing of pre-adult diapause [35,36]. Although lipid stores have been quantified in this species [37], less is known about the specific pathways and their regulation in different life stages and/or under environmental fluctuations. The annotated reference transcriptome was searched for compounds putatively encoding proteins that are involved in the establishment of lipid stores using as a guide the Kyoto Encyclopedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/pathway.html) list of lipid biosynthesis proteins found in D. melanogaster. Thirty-seven compounds with significant expression (# of mapped reads) were so identified ( Table 6). Among these putative identifications were seventeen acyltransferases, eleven elongases and three desaturases. Searches of the extant C. finmarchicus EST database using the identified compounds as the query sequences provided strong support ($ 95% identity at the nucleotide level in regions of overlap) for 10 transcripts (footnotes in Table 6). However, many of these ESTs had not been annotated, so this is the first extensive list of putative transcripts involved in lipid biosynthesis for this species.
Pronounced stage-specific expression was observed in three lipid metabolism transcripts, one acyltransferase, one elongase and one desaturase (Figure 9). These three transcripts were highly expressed in either one (elongase, desaturase) or two (acyltransferase) developmental stages. Expression levels differed by more than 100-fold between the highest and lowest stages. The most highly expressed acyltransferase (comp27780) transcript was identified as diacylglycerol o-acyltransferase 1, which is involved in the last step of triglyceride ( = triacylglycerol, TAG) synthesis in eukaryotes [38]. This transcript had partial overlap with three C. finmarchicus ESTs (Accession Nos. FK670374, FG632581, FK867624). The highest expression levels of this transcript were observed in adult females (.100 RPKM) and in embryos (, 60 RPKM), with moderate/low levels seen in late copepodites (stage CV, , 9 RPKM; Table 7). Expression in all other stages was below 1 RPKM ( Figure 9A). This expression pattern is consistent with lipid storage associated with egg production. Three other acyltransferases (comp66316, comp70420, comp73859) showed peak expression in embryos and adult females (20 to 40 RPKM, Table 7). Other expression patterns with developmental stages among the acyltransferases included similar expression across all samples, or broad expression peaks across three to four stages ( Table 7).
The other two transcripts showed even more extreme patterns with expression levels in late copepodites (stage CV) at 200 and 700 RPKM, respectively ( Figure 9B,C). One of these transcripts, comp35445 was annotated as an elongase. Lipid-rich stage CV copepodites collected in the Gulf of Maine in July are presumably preparing to enter diapause [9,37,39]. This requires the synthesis of wax esters, formed from fatty acids largely derived from the diet, and esterified with fatty alcohols synthesized de novo, the latter step involving chain elongation [40]. This elongase had support from two non-overlapping ESTs (ELOV4: Accession No. ES387246, and unidentified: ES387223), which had been identified in a previous study as up-regulated in shallow (actively preparing for diapause) vs. deep (presumably diapausing) pre-adult C. finmarchicus [9]. Our results are consistent with the pattern of up-regulation described by Tarrant using subtractive hybridization [9]. In addition, our data suggest that this elongase is highly expressed in the pre-adult stage, but not in the other developmental stages examined here. Thus, this elongase may be involved in the regulation of diapause, as it catalyzes the synthesis of very long chain fatty acids, which, given the timing of its expression, would likely contribute to an increase in total lipid storage in sub-adult individuals preparing to enter diapause. A similar up-regulation of  elongase expression, presumably also in preparation for diapause, has been noted in several insect species [41][42][43]. The expression pattern seen for the comp35445 elongase was not representative of the other elongases (Table 7). Although a common feature among the elongases was low expression in embryos, expression patterns varied among comps. Peak expression in the late copepodite was observed in two other elongases (comp25729 and comp71231), but overall expression was lower and expression differences were not as extreme across developmental stages (Table 7). Five elongases showed high expression in multiple developmental stages, in particular in the nauplii (early and late) and early copepodites (Table 7). These expression patterns suggest that the elongases differ from each other in function, and each may play a very specific role during development. Recent studies on model organisms have focused on stage-specificity in expression of transcripts and proteins involved in lipid metabolism, and have started to reveal that elongases can have very specific roles during development [34].
The putative delta-9 desaturase (comp5606) had very high expression in the late copepodites (CV). Comp5606, which encodes a full-length protein, had support from two EST sequences encoding partial proteins (Accession Nos. EH666911 and EL965872). In contrast, the two other putative desaturases (comp546275 and comp718898) had much lower expression and their peak expression was observed in the early copepodites with RPKM values of less than 10 ( Table 7). Delta-9 desaturase is the rate-limiting enzyme in the formation of monounsaturated fatty acids from saturated fatty acids, and thus is a critical component in the production of phospholipids, triglycerides and cholesteryl esters, the synthesis of which is accomplished using its products. It is involved in the de novo biosynthesis of the fatty alcohol component of wax esters [40]. As with the putative elongase encoded by comp35445, the timing of expression of this delta-9 desaturase is consistent with it contributing to an increase in total lipid storage in CV individuals preparing to enter diapause, a function previously shown for homologous transcripts/proteins in diapausing insects [41][42][43].
These expression results of the delta-9 desaturase, elongase, and the acyltransferase are interesting and they present a marked contrast of wax ester vs. triacylglyceride biosynthesis between the late copepodite (CV) and adult. Furthermore, given the number of sequences identified with similar functional annotations, the transcripts involved in these processes may be very specific, and potentially are good biomarkers for physiological processes.

Significance of Multiplicity of Sequences in Complete Transcriptome
23% of the comps (22,165 out of 96,090) contained clusters of multiple contigs. Potential sources of sequence variation that might combine multiple contigs into single ''comps'' include polymorphisms within a population, presence of two or more genes of recent origin from a common source, alternative splicing and differences in the targeting signals, leading to differences in the untranslated regions (UTR) [14]. Thus, these types of sequence variants potentially have a biological source that must be considered in transcriptome analyses. However, some variability in sequences may also arise from assembly artifacts. In order to examine how different factors contributed to sequence multiplicity within compounds in our assembly, we examined transcripts putatively encoding the alpha subunit of voltage-gated sodium channels (Na V ). The principal unit of the Na V is a large protein with four repeat domains (R1-R4) of six trans-membrane segments each (S1-S6). These are highly conserved and have   been well studied in model systems [44][45][46]. In C. finmarchicus, these channels are potential targets for saxitoxin originating from seasonal blooms of the dinoflagellate Alexandrium fundyensis.
Multiplicity of function in the voltage-gated sodium channel has been achieved in various taxa both through mechanisms producing splice variants and through gene duplication. The former has been described for D. melanogaster, while the latter has been found in vertebrate model organisms [44,45]. However, it is not clear how many different voltage-gated sodium channels are present in crustaceans. Currently, there are few annotated Na V sequences available for crustaceans through NCBI; a search conducted on 6/24/2013 identified just two full-length sequences (Cancer borealis and Daphnia pulex) and three partial ones (Lepeophtheirus salmonis, Hyalella azteca, Scylla paramamosain). In D. melanogaster, two genes (Na V 1 and Na V 2) encoding voltagegated sodium channels have been identified in the genome [46]. The Na V 1 protein is also known as paralytic or para, and it has been shown to include multiple isoforms, many resulting from alternative splicing [47,48].
The D. melanogaster Na V 1 channel, para isoform A protein sequence (Accession No. P35500) was used as a query to identify similar sequences in the Calanus transcriptome. Five comps were identified as highly similar with E-values below 10 2100 (Na V 1-I to Na V 1-V; Table 8). Each comp was represented by multiple contigs ranging in number from two to 20. One comp (comp44060) included two Butterfly disconnected subgraph groups (''c'' groups) with two contigs each, while the rest were assigned to a single such group (Table 8). A reciprocal blast against the nr arthropod database using the blastp algorithm indicated that all contigs but two were most similar to insect para sodium channels with top hits having E-values of 0.0 ( Table 9). The two contigs (comp44060_c0_seq1 and comp44060_c0 _seq2) that did not identify a para sodium channel as a top hit were removed from all further analysis. A D. melanogaster Na V 2 channel protein sequence (Accession No. Q9W0Y8) was used to query the complete transcriptome for putative Na V 2 transcripts. A single comp with three contigs was identified (Table 8). A reverse blast against all nr arthropod proteins identified the sequence as most similar the sodium channel 60E (an alternative designation for an Na V 2) from an ant (E-value = 0.0; Table 9). Seven additional comps were identified as voltage-gated sodium channels by searching the ''top hits'' in the Blast2GO annotations (Table 8). These were shorter in length, had higher E-values, and all except one consisted of single contigs. These comps appeared to comprise contigs encoding portions of the Na V 1 channel (Table 8, 9). One additional short sequence (670 bp) putatively encoding a Na V 2 channel was not included in this list, since there were questions regarding its correct identification given a high E-value (1.4610 26 ). Only a single Na V 1 sodium channel gene has been identified in the D. melanogaster genome [46], and indeed we can only identify one such in the NCBI databases for Daphnia (accession EFX81393). Thus, the question arises as to whether the different comps identified as Na V 1 represented transcripts derived from a single gene or whether it was more likely that they resulted from multiple genes. To assess this, the longest sequence from each comp was translated and aligned using the online program MAFFT. These alignments confirmed significant differences among the Na V 1 sequences. One comp (Na V 1-III [comp682803]) appeared to code a full-length sodium channel protein and included all four conserved subunit repeats of the 6 membranespanning segments ( Figure 10). Two (Na V 1-IV [comp299307] and Na V 1-V [comp233807]) seemed to be partial N-terminal sequences, while the other two (Na V 1-I [comp222993] and Na V -II [comp44060_c3]) represented partial C-terminal sequences ( Figure 10). The two pairs of overlapping comps differed substantially in their amino acid sequences both from each other and from homologous regions of Na V 1-III (comp682803). Identity across each of the four conserved repeat domains of the comps was modest, ranging from 57% to 87%, suggesting that they represented transcripts from different genes. However, the N-and C-terminal partial sequences seem likely to represent two parts of single transcripts, Na V 1-I and IV being the most Drosophila-like (76-83% identity in conserved domains) and Na V 1-II and V being less so (64-68% identity). Thus, the five longest compounds assembled by Trinity represent transcripts from putative para sodium channels (Na V 1) that appeared to have been derived from three different genes. Alignment of the remaining seven Na V 1 compounds obtained from the Blast2GO annotations indicated that these sequences encoded different portions of the protein, with minimal overlap with each other ( Table 8). All of these sequences differed from Na V 1-I to V in corresponding regions, even in conserved transmembrane regions, albeit portions were up to 90% identical. Thus, these short sequences may represent partial assemblies of a fourth Na V 1 channel. The occurrence of genes split among different comps is not surprising given 96,090 comps in an organism expected to have closer to 20,000 genes. It presumably reflects the incompleteness as well as the statistical limitations of such assemblies.
All Na V 1 sequences adhered to the DEKA pattern (Asp, Glu, Lys, Ala) in the 4-amino-acid ring forming the selectivity filter that characterizes Na V 1 channels [49,50], albeit the total of 40 amino acids of the 4 pore-lining regions (p-loops) differed from the homologous regions of D. melanogaster by 2, 4 or 8 residues respectively for putative genes Na V 1-I&IV, II&V and III. The differences with respect to the Daphnia Na V 1 (accession EFX81321) were greater (7, 8 and 11, respectively). If confirmed, this represents, to our knowledge, the first evidence for multiplicity within the Na V 1 family in a protostome.
In contrast to the para channel, a single comp encoding a putative full-length Na V 2 channel was identified (Table 8, 9). This had a selectivity filter pattern of DEEA, diagnostic for Na V 2 of Drosophila and other protostomes [49]. The aggregate p-loops differed by 3 residues from Drosophila, compared with 5 in Daphnia Na V 2 (accession EFX89321).
Another question concerns the multiplicity of sequences within each comp. Alternative splicing and intra-sequence deletions have been documented in D. melanogaster, which may have as many as 19 functional isoforms of the Na V 1 gene with each producing pharmacologically distinct sodium channels [51]. Trinity is designed to collect such variants into clusters [14]. In order to obtain some insights into the diversity of Na V channels in C. finmarchicus, we aligned the multiple contigs within each comp and analyzed their patterns.
Occurrence of splice variants would be one source of extensive sequence differences clustered or restricted to specific parts of the molecule. We searched for such clustering by dividing the nucleotide sequences within one comp into ''segments,'' with the boundaries defined by transitions in either direction between matched and unmatched in comparisons across sequences (e.g. sequence divergence, convergence or termination points). Withincomp differences tended to be grouped and specific to a given region or a few regions. Predicted protein sequences of the five transcripts coding for Na V 1-III include two sequence variants (''substitutions'') each in segments 2 and 4 suggesting alternative splicing ( Figure 10). However, the most extensive example of clustered-variation was found within the Na V 1-I compound, where two and only two variants were found in each of segments 6, 8 and 9, and these occurred in 7 of the 9 possible combinations ( Figure 10).
Intra-sequence insertion/deletion patterns, including those occurring at the ends also contributed to observed sequence variation among contigs within one comp. Deletions typically produce variation in length unless compensatory insertions occur elsewhere in the remaining sequence. This source of variation was observed in Na V 1-II and Na V 1-IV, and it was particularly prominent in Na V 1-I with segments 1, 3 and 4 absent in various combinations with the sequence variants in segments 6, 8 and 9 ( Figure 10). Premature termination of the expected amino acid sequence was observed in Na V 1-I, where mid-sequence ''stop'' codons produced predicted partial proteins for contigs 12-18 (excluding 14). This truncated the sequence at segment S6 of the 3 rd repeat. Similar premature terminations were found in Na V 1-V, in the Na V 1 fragment comp4062624 and in the Na V 2 fragment comp4340636. This type of termination, leading to truncated channel proteins, has been reported in several taxa and may be presumed to serve an as-yet undetermined biological function [51].
There was some evidence for assembly errors in at least one set of contigs. Specifically, we found that in one case two parts of valid sequences appeared to have been joined to form a ''chimera''. The translated Na V 1-II (comp44060_c3), when compared to Drosophila Na V 1, consisted of an N-terminal piece that blasted as complexin, and a non-contiguous C-terminal portion that showed good alignment with the Na V 1 D. melanogaster sequence ( Figure 10). Between these two regions was a stretch ,450 aa in length only sparsely assigned matches by MAFFT (indicated by a thin line in Figure 10).
The picture emerging from these studies, awaiting confirmation from conventional molecular approaches, is that Calanus finmarchicus possesses at least four voltage-gated sodium channel genes, one of which is a protostome ''Na V 2'' channel and the other three of which represent members of a family of Na V 's similar to, though less numerous than, the vertebrate Na V 1 family. Splice variants are present in at least two members of this latter family, and one of these in particular is expanded into a large family of splice variants, similar to the situation found in D. melanogaster.
Relative expression of Na V transcripts among developmental stages showed a similar pattern for both Na V 1 and Na V 2 sequences: low expression in the embryo, adult females and late copepodites (CV), and high expression in early and late nauplii and early copepodites ( Figure 11). Expression patterns in Na V 1-II and Na V 1-V were similar, with a peak in expression in the early copepodite, supporting the hypothesis that these two contigs represent partial sequences of a single gene. Since mapping of reads was limited to a single contig for each comp, it underestimated the total number of reads mappable to a given Na V 1 gene. This might be particularly true for Na V 1-I given the large number of splice variants predicted in this comp. Relative expression of the Na V transcripts in the stages with the highest expression ranged from 3 to 17 RPKM. These expression levels are low in comparison to many of the lipid biosynthesis transcripts.
The inclusion of RNA from the developmental stages in this transcriptome study may have been critical for predicting the Na V proteins in C. finmarchicus. If the samples had been limited to adults and late copepodites, we might not have obtained enough coverage to assemble the transcripts.
Using Calanus sequences as blast probes to search the Daphnia pulex genome identified four genes with good E-values (,10 282 ). Only one (EFX89321) was annotated as a voltage-gated sodium channel and turned out to be a Na V 2. A second, unidentified gene, EFX81393 referred to as Dappudraft 50150, matched both Calfi Na V 1-I and D. melanogaster SCNA_DROME sequences with Thus, the current state of knowledge suggests that Calanus probably possesses three and possibly four distinct genes within the Na V 1 family, each occurring in multiple isoforms, as contrasted to Daphnia and insects with a single such gene. The multiple sequences associated with the different comps is similar to the types of isoforms identified in D. melanogaster, and these alternatively spliced variants, if confirmed, may contribute to functional diversity as in the fruit fly. However, a cautionary note also emerged from this analysis: the alignment programs are designed to optimize the number of matches, but they may do this by splitting sequences. While permitting sequence splits as part of the alignment process is biologically valid (e.g. in alternative splice variants) [14], it may also lead to assembly artifacts. Thus, the alignments and segmentations deduced from this approach must be taken with some reservation, and need to be confirmed by other methods.

Summary and Conclusions
Illumina sequencing of six multiplexed samples comprising different stages of C. finmarchicus provided a transcriptome for this non-model species and new insights into stage-specific expression patterns. The de novo transcriptome developed significantly expands genomic resources for this important species, which heretofore have been limited to fewer than 12,000 EST sequences [10]. This de novo transcriptome provides an opportunity for additional interpretation of previous studies based on qPCR, subtractive hybridization and microarray technologies [9][10][11][52][53][54]. It adds to these studies by providing additional sequence data on the target transcripts, including predictions for full-length proteins, potential information on alternative splicing and even the presence of additional genes with similar function. Conversely, the EST database has been an invaluable tool for vetting the Trinity assembly, since these sequences are single pass reads on a Sanger sequencer.
Two transcriptomes were obtained from the de novo assembly. The complete assembly, composed of 206,041 contigs, includes many comps with multiple sequences. Using the voltage-gated sodium channel as an example, we determined that the multiple sequences within a comp represented different forms of presumably single genes. This analysis led to the hypothesis that multiplicity of function in the Na V 1 channel in C. finmarchicus is achieved through both alternative splicing and gene duplication. This diversity of transcripts for voltagegated sodium channels helps explain why these proteins have been so difficult to identify via molecular cloning. The hypothesized presence of multiple genes in the Na V 1 family is the first reported for any member of the protostomes. This complete assembly, which represents coverage of ca. 65% of the transcriptome or better, provides an excellent tool for gene and protein discovery.
The reference transcriptome was composed of a single representative sequence obtained from each unique comp, and included 96,090 sequences. The reference transcriptome was annotated using global annotation tools, and it provided a useful tool for gene expression studies. In contrast to the complete transcriptome, mapping statistics using this reference resulted in a very low percentage of reads that mapped more than once (, 1%). Mapping of the individual stages against the reference indicated that as much as 35 to 48% of the transcripts were not expressed in any particular developmental stage. Targeted analysis of expression pattern focused on transcripts involved in lipid biosynthesis suggests that transcripts with similar annotations may exhibit very different expression patterns during development. This is not surprising given the importance of lipids in developmental processes. Furthermore, two transcripts were highly expressed in the late copepodite (CV) stage. These lipid-rich individuals collected in July were most likely in a Table 9. Reciprocal blastp analyses of putative Calanus finmarchicus voltage-gated sodium channels identified in Table 8 against all NCBI curated non-redundant arthropod proteins (searches completed prior to August 1, 2013 pre-diapause stage, and these transcripts may be important in the regulation of diapause.   were trimmed by 9 bp prior to mapping. Forward and reverse reads were mapped separately to obtain technical replication. (DOCX) Figure 11. Relative expression of putative Na V 1 transcripts for different developmental stages. Yellow: Na V 1-I (comp222993); orange: Na V 1-II (comp44060_c3); red: Na V 1-III (comp682803); blue: Na V 1-IV (comp299307); green: Na V 1-V (comp428211). doi:10.1371/journal.pone.0088589.g011