Whole transcriptome analysis using next-generation sequencing of sterile-cultured Eisenia andrei for immune system research.

Recently, earthworms have become a useful model for research into the immune system, and it is expected that results obtained using this model will shed light on the sophisticated vertebrate immune system and the evolution of the immune response, and additionally help identify new biomolecules with therapeutic applications. However, for earthworms to be used as a genetic model of the invertebrate immune system, basic molecular and genetic resources, such as an expressed sequence tag (EST) database, must be developed for this organism. Next-generation sequencing technologies have generated EST libraries by RNA-seq in many model species. In this study, we used Illumina RNA-sequence technology to perform a comprehensive transcriptome analysis using an RNA sample pooled from sterile-cultured Eisenia andrei. All clean reads were assembled de novo into 41,423 unigenes using the Trinity program. Using this transcriptome data, we performed BLAST analysis against the GenBank non-redundant (NR) database and obtained a total of 12,285 significant BLAST hits. Furthermore, gene ontology (GO) analysis assigned 78 unigenes to 24 immune class GO terms. In addition, we detected a unigene with high similarity to beta-1,3-glucuronyltransferase 1 (GlcAT-P), which mediates a glucuronyl transfer reaction during the biosynthesis of the carbohydrate epitope HNK-1 (human natural killer-1, also known as CD57), a marker of NK cells. The identified transcripts will be used to facilitate future research into the immune system using E. andrei.


Introduction
Earthworms are well known as a key organism in the study of environmental toxicology [1][2][3]. Because they are easy to maintain in a laboratory setting [4,5], a large number of studies have been conducted using earthworms. Two earthworm species, Eisenia fetida and Eisenia andrei, have been used in European pesticide marketing authorization experiments since the early 1980s [6]. In these studies, several bioactive proteins were identified in the coelomic fluid of earthworms [7][8][9]. These proteins exhibit a variety of antibacterial, hemolytic, cytotoxic, hemagglutinating, and proteolytic activities [10]. Some reports have suggested that coelomocytes in the coelomic fluid play a role in immune defense [11][12][13]. For example, it was demonstrated that some coelomocytes possess cytotoxic activity similar to natural killer (NK) cells by co-culture with K592 cells, a human NK-sensitive cell line [14]. In that study, when the coelomocytes were co-cultured with K592 cells, they extended numerous small pseudopodia that bound to and killed K592 cells. In addition, cross-reactivity of coelomocytes with monoclonal antibodies against several human CD markers of immune-related cells was also reported [15]. Thus, earthworms are ideal models for conducting immune system research. Knowledge of the apparently less complex invertebrate immune system can contribute to our understanding of the more sophisticated vertebrate immune system, the evolution of the immune response and potentially lead to the identification of new biomolecules with therapeutic use [10,16]. However, as earthworms have only recently gained attention as a genetic model for immune system research, many basic molecular and genetic resources, such as expressed sequence tag (EST) databases, have yet to be established. EST information is an important resource for genetic and genomic studies, such as gene identification, verification of gene prediction, and gene sequence determination [17]. Traditional EST libraries are usually generated through the construction of an EST library and sequencing, which is time-consuming and expensive. Recently, with the development of the next-generation sequencing technologies, EST libraries have been successfully constructed from total RNAs in many model species [18][19][20]. Gui [21] analyzed the whole transcriptome of the Indo-Pacific humpback dolphin leukocyte, which resulted in the identification of putative genes involved in aquatic adaptation and the immune response. In addition, Xu [22] performed de novo assembly of the transcriptome of Setaria viridis, an emerging model species for genetic studies of C4 photosynthesis, and identified 60,751 transcripts. Their study is expected to provide the basis for future genetic studies of C4 photosynthesis. Thus, to support immune system research in earthworms, we used Illumina RNA-seq technology to perform a comprehensive transcriptome analysis using RNA samples pooled from sterile-cultured E. andrei. This report presents a description of the identified genes expressed in E. andrei and their functional annotation. All clean reads were assembled de novo using the Trinity program into 41,423 unigenes. Using this transcriptome data, we performed BLAST analysis against the GenBank nonredundant (NR) database and obtained a total of 12,285 significant BLAST hits. Furthermore, gene ontology (GO) analysis assigned 78 unigenes to 24 immune class GO terms. These identified transcripts can be used to facilitate future immune system research using E. andrei.

E. andrei materials, RNA isolation, and Illumina sequencing
The egg capsules (cocoons) of E. andrei were purchased from KIRYU Int. Co. (Miyazaki, Japan). After rinsing with PBS containing 50 μg/ml ampicillin, the cocoons were seeded onto 100-mm disposable petri dishes (Falcon), each coated with 100 ml of 1.0% agar supplemented with 0.5 mM NaCl, 0.05 mM KCl, 0.4 mM CaCl 2 , 0.2 mM NaHCO 3 and 50 μg/ml ampicillin. The cultures were kept at 20°C. After the eggs hatched out, the worms were fed powdered skimmed milk every 2 weeks, and grew to a length of 3-5 cm in approximately 3 monthcultures. The adult worms were immediately frozen in liquid nitrogen and stored at −80°C until use. The frozen whole worms were ground in liquid nitrogen and total RNA was prepared using the Fastpure RNA Kit (Takara Bio Inc., Otsu, Japan) according to the manufacturer's instructions. RNA integrity was assessed by agarose gel electrophoresis and the concentration was quantified using a Nanodrop ND-1000 spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA). The qualified RNA samples were analyzed by the Illumina Sequencing Services of Hokkaido System Science Co., Ltd. The cDNA library for sequencing was constructed as described previously [21]. Sequencing was performed on an Illumina HiSeq 2000 [Q30 = 97.55%].  Data pre-processing, filtering, and de novo assembly For transcriptome assembly, raw reads were filtered and adapter sequences, non-coding RNA (e.g., rRNA), low-quality read sequences with ambiguous bases 'N', and raw reads with average lengths of less than 20 bases were removed. The Trinity program [23] was utilized for de novo transcriptome assembly, combining read sequences of a certain length of overlap to form longer fragments without N gaps, called contigs. These contigs were further processed for read alignment and abundance estimation with Bowtie [24] and RSEM [25], and these sequences were defined as unigenes. Calculation of unigene expression was performed using the Fragments Per kilo base of exon per Million mapped fragments (FPKM) method, which was able to exclude sequencing discrepancies in the calculation of gene expression and the influence of different gene lengths. The number of unigenes was 41,423 with a threshold of more than FPKM = 2. All raw read sequences are available at the DDBJ Sequence Read Archive [26] with the accession number: DRA002587.

Functional annotation of unigenes and simple sequence repeats (SSR) identification
A homology search against the NCBI non-redundant (NR) protein database (http://www.ncbi. nlm.nih.gov; formatted on April 7, 2014) based on the BLASTx program [27] was performed for all unigenes using a cutoff E-value < 1E −6 , and the best aligning results were selected to annotate the unigenes. To further annotate the unigenes, the Blast2GO program v. 2.7.1 [28] was used to obtain gene ontology (GO) [29] terms and the Kyoto Encyclopedia of Genes and Genomes (KEGG) [30] information according to the NR results. The BGI WEGO program [31] was used to perform GO functional classification of all unigenes to visualize the distribution of gene functions. The identification of immune-related genes was performed as described by Pereiro et al. [32]. Furthermore, to detect more genes belonging to the relevant immune pathways in the transcriptome sequences, KEGG reference pathways identified in a previous report were utilized [21]. Microsatellite identification tool (MISA) (http://pgrc.ipk-gatersleben.de/misa/) with default parameters was used to detect microsatellites in the unigenes.
Prediction of the tertiary structure of E. andrei GlcAT-P The tertiary structure of E. andrei GlcAT-P was predicted by homology modeling with SWISS-MODEL server [33]. In the homology modeling, the GlcAT-P structure from human (Protein Data Bank (PDB) entry 1v83) [34] was used as a template. Structural Figs. were drawn using PyMOL [35].

Results and Discussion
Illumina sequencing and sequence assembly In total, Illumina sequencing yielded 47,555,164 reads comprising 4,803 M bases from the mRNA pool isolated from sterile-cultured E. andrei. After the removal of adaptor sequences, ambiguous reads and low-quality reads, we obtained 47,070,010 clean reads. The Q30 percentage (sequencing error rate, 0.01%) was 97.55%. As shown in Table 1, 47,070,010 clean reads were assembled de novo using the Trinity program. The clean reads were further assembled into 151,929 contigs with an average length of 968 bp and an N50 of 1,855 bp. The length and GC content distribution for all contigs are shown in Fig. 1A and 1B. In total, 72,539 contigs were longer than 500 bp and the GC of 17,320 contigs was greater than 50%. From the contigs, 41,423 unigenes were obtained with an average length of 1,633 bp and an N50 of 1,567 bp ( Table 1). The length and GC distributions of all assembled unigenes are shown in Fig. 1C and 1D. In total, 7,219 unigenes were less than 500 bp and 5,488 unigenes were longer than 3,000 bp. The GC content of 4,722 unigenes was greater than 50%.

Functional annotation
To validate and annotate the assembled unigenes, all assembled unigenes were searched against the GenBank non-redundant (NR) database using the BLASTx program (E-value < 1E -6 ). The results showed that 12,285 unigene sequences had BLAST hits to annotated proteins in NR. Analysis of the distributions of E-values indicated that 82.7% of the aligned sequences showed significant homology to the entries in the NR database (E-value < 1E -6 ) ( Fig. 2A). Further analysis of the similarity distributions indicated that 73.3% of matched sequences had alignment identities greater than 80% (Fig. 2B). A large number of the hits matched the sequences of Capitella teleta (24.8%) and Helobdella robusta (18.1%); other hits were identified within the reference protein databases of Crassostrea gigas (7.3%), Lottia gigantea (5.7%), Aplysia californica (5.4%), Saccoglossus kowalevskii (4.8%), and Strongylocentrotus purpuratus (4.7%) (Fig. 2C). There were also many unigenes with no BLAST hits; these might represent additional genes that were not represented in the annotated protein databases or sequences that were too short to produce hits. GO (gene ontology) is an international system for the classification of standardized gene functions and is used to annotate and analyze gene functions and gene products in any organism. GO contains three main independent ontologies: biological process, molecular function, and cellular component. To predict their possible functions, we used the Blast2GO program to analyze the GO annotations of the assembled unigenes, and then used WEGO software to perform GO functional classifications. Based on NR annotation, 12,285 unigenes were assigned to 99 GO terms belonging to the three main GO ontologies (Fig. 3). Further analysis of the 99 GO terms showed that the dominant terms were "cell", "cell part", "binding", "catalytic", "cellular process", and "metabolic process". Within the biological process group, the great majority of unigenes were related to the terms "cellular process" and "metabolic process". Within the cellular component, most unigenes were assigned to "cells" and "cell parts", followed by "binding" and "catalytic activity".
KEGG, a pathway-based categorization of orthologous genes, provides useful information for predicting the functional profiles of genes. Therefore, to better understand the function of the E. andrei transcriptome, the unigenes were assigned against the KEGG pathway. In total, 1,322 unigenes were grouped into 100 KEGG pathways (S1 Table). The top three ranking pathways were purine metabolism (155 unigenes), thiamine metabolism (53 unigenes), and glycolysis/gluconeogenesis (49 unigenes) (Fig. 4). Interestingly, 47 unigenes mapped to the T cell receptor signaling pathway, which is involved in immune response (Fig. 4).

SSR markers identification
Microsatellites, also known as SSRs, are classes of repetitive DNA sequences ubiquitous in eukaryotic genomes [36]. SSRs are ideal markers for use in paternity determination, population genetics studies, genetic diversity assessment and genetic map development [37]. The results of microsatellite analysis are shown in Table 2 and S2 Table. From the unigene sequences, 14,717 SSRs (10,764 with simple repeats and 722 with compound formation) were identified in 12,539 unique sequences, of which, 1,775 sequences contained more than one SSR. All SSRs were further classified by the number of repeat units. The results showed that the number of most potential SSRs were composed of three repeat units (54.1%), followed by two (31.2%), four (7.7%), and four (6.8%) repeat units. The SSRs of E. andrei could provide potential genetic markers for studies of population genetics, comparative genomics, as well as gene-based association studies aimed at understanding the genetic control of adaptive traits.

Identification of sequences related to the immune response
A keyword list and GO immune-related terms were used to search for genes putatively involved in the immune system of E. andrei (S3 Table). The results of this search identified a Transcriptome Analysis of E. andrei number of immune-related genes that were involved in well-recognized immune pathways, such as toll-like receptor (TLR) and interferon-gamma-mediated signaling. The toll receptor, as the signal transducer of the Toll pathway, plays a crucial role in the innate immune response. In this study, we identified a gene encoding TLR6 in the transcriptome datasets, as well as other genes belonging to the TLR signaling pathway, such as myeloid differentiation primary response protein (MyD) 88 and mitogen-activated protein kinases (MAPKs) [Fig. 5]. Members of the JAK (Janus kinase) family are intracellular, non-receptor tyrosine kinases that transduce cytokine-mediated signals via the JAK-STAT pathway [38]. Many studies have shown that signal transducers and activators of transcription (STAT) proteins are involved in the development and function of the immune system and play a role in maintaining immune tolerance and tumor surveillance [39]. In our study, a unigene was identified with high similarity to mammalian JAK 2 but not to other members of the STAT family (STAT1, STAT2, STAT3, STAT4, STAT5A/5B, and STAT6). Identification of additional JAK-STAT pathway-related genes will be useful for learning more about the complexities of immune responses in E. andrei. In addition, signaling and interaction molecules, such as cytokines and cytokine receptors, were also identified in the transcriptome, as were proteases, protease inhibitors, and stress proteins such as heat shock proteins.

Identification of markers of NK cell
NK cells are a type of cytotoxic lymphocyte that plays an important role in the innate immune system. The development of NK cells was a critical event in the evolution of the immune system [40]. Despite their importance, the evolutionary origin of NK cells has remained unknown [41]. Previous reports have suggested that earthworms have NK-like cells [12][13][14]. Therefore, these animals may occupy a key position in the evolution of the NK cell-dependent immune system. Thus, we next searched our unigene database for markers of NK cells, such as CD16 and CD56, and NK-related genes, using the BLASTx program (E-value < 1E -6 ). As shown in Fig. 6A, we detected a unigene with high similarity to beta-1,3-glucuronyltransferase 1 (GlcAT-P), which mediates a glucuronyl transfer reaction during biosynthesis of the carbohydrate epitope HNK-1 (human natural killer-1, also known as CD57, an NK cell marker). The similarity was much higher than that of other unigenes to human NK cell markers. Thus, we further investigated this unigene. We found that the unigene has 44% amino acid sequence identity and 55% amino acid sequence similarity with human GlcAT-P. The structure of human GlcAT-P and the structural basis for acceptor substrate recognition of human GlcAT-P in the biosynthesis of HNK-1 have been reported [30]. However, because newly synthesized polypeptide chains must undergo folding into tertiary conformations and post-translational modifications before becoming a functional protein, a high level of identity in primary amino acid sequence between species does not automatically confer identical tertiary structure. Thus, as a preliminary analysis of the tertiary structure, we estimated the structure of earthworm GlcAT-P using homology modeling with the human GlcAT-P, and obtained the tertiary structure of amino acid residues 101-355 of E. andrei GlcAT-P, including the uridine diphosphate (UDP) binding region which plays an important role in enzyme activity of GlcAT-P. As shown in Fig. 6B, S1 and S2 Movies, a high level of conservation of the UDP binding region was GlcAT-P. The tertiary structure of E. andrei GlcAT-P was modeled using the structure of human GlcAT-P as a template. In each GlcAT-P structure, UDP binding regions and amino acid residues in the active site are shown as a ball and stick models, respectively. The position of the UDP binding region in E. andrei GlcAT-P is based on the human GlcAT-P structure. doi:10.1371/journal.pone.0118587.g006 predicted between these two species in the tertiary structures of GlcAT-P. This suggests that the carbohydrate epitope HNK-1 is synthesized in E. andrei and that an anti-CD57 antibody could be used to identify NK-like cells in E. andrei. Epigeic species of earthworms, including E. andrei, live in top layer of the soil, which is rich in decaying organic matter and characterized by a wide variety of microbiota. These environmental contaminants and microbiota, including viruses, bacteria, and protozoans, could interfere with the abundance of epigeic earthworms by inducing a high mortality rate, decreasing reproductive success or synergistically increasing the virulence of other diseases. Therefore, to be successful as an epigeic earthworm, E. andrei needs to be resistant to various microorganisms present in the top layer of the soil. However, the mechanisms responsible for immune responses to these environmental contaminants and microbiota in E. andrei are not fully understood. Innate immunity is the first line of host defense against pathogens. Many immune cells, such as monocytes, macrophages, leukocytes (PMN), and NK cells, are involved in the detection and removal of microbial pathogens [42,43]. Compared to other model organisms such as mice and fruit flies, few immune-related genes have been identified in E. andrei. Our results revealed a large number of innate immune-related genes, covering almost all known innate immune pathways, including pathogen recognition, modulation, and signaling. These findings will facilitate our comprehensive understanding of the mechanisms involved in the immune response to environmental contaminants and microbiota in E. andrei.
In conclusion, we characterized the transcriptome of E. andrei and identified many SSRs and abundant specific gene families involved in the immune response. To our knowledge, this is the first report to analyze the whole transcriptome of this epigeic species of earthworm. The dataset generated in this study will provide a substantial transcriptome-level resource for the study of earthworm biology and help further our understanding of the molecular mechanisms of various pathways, including those of the immune response.
Supporting Information S1 Movie. Tertiary structure of human GlcAT-P. The human GlcAT-P structure was derived from Protein Data Bank (PDB) entry 1v83. (MPG) S2 Movie. Tertiary structure of E. andrei GlcAT-P. The tertiary structure of E. andrei GlcAT-P was predicted by homology modeling with SWISS-MODEL server. In the homology modeling, the GlcAT-P structure from human was used as a template. (MPG) S1