De Novo Assembly and Characterization of Early Embryonic Transcriptome of the Horseshoe Crab Tachypleus tridentatus

The horseshoe crab Tachypleus tridentatus is a unique marine species and a potential model for marine invertebrate. Limited genomic and transcriptional data are currently available to understand the molecular mechanisms underlying the embryonic development of T. tridentatus. Here, we reported for the first time the de novo transcriptome assembly for T. tridentatus at embryonic developmental stage using Illumina RNA-seq platform. Approximate 38 million reads were obtained and further assembled into 133,212 unigenes. Sequence homology analysis against public databases revealed that 33,796 unigenes could be annotated with gene descriptions. Of the annotated unigenes, we identified a number of key components of several conserved metazoan signaling pathways (Hedgehog, Wnt, TGF-beta and Notch pathways) and other important regulatory genes involved in embryonic development. Targeted searching of Pax family genes which play critical roles in the formation of tissue and organ during embryonic development identified a complete set of Pax family genes. Moreover, the full length T. tridentatus Pax1/9a (TtPax1/9a) and Pax1/9b (TtPax1/9b) cDNA sequences were determined based on the transcriptome, demonstrating the immediate application of our database. Using quantitative real time PCR, we analyzed the expression patterns of TtPax1/9a and TtPax1/9b in different tissues of horseshoe crab. Taking advantage of Drosophila model, we further found that TtPax1/9b, but not TtPax1/9a, can partly rescue the Drosophila homolog Poxm dysfunction-caused lethality at the larval stage. Our study provides the embryonic transcriptome of T. tridentatus which could be immediately used for gene discovery and characterization, functional genomics studies in T. tridentatus. This transcriptome database will also facilitate the investigations of molecular mechanisms underlying embryonic development of T. tridentatus and other marine arthropods as well.


Introduction
Emergence of new model organisms plays increasingly critical roles in embryogenesis and evolutionary developmental study. Until now, the most well-known model organism in arthropods is the fruit fly Drosophila melanogaster (belonging to Class Insecta), which is widely used in studies of genetics and embryogenesis [1,2]. The genome of water flea Daphnia pulex (belonging to Class Crustacea) was also published recently, which facilitates the study of cellular response to environmental challenges [3]. Nevertheless, due to lack of genomic information, understanding of developmental and molecular mechanisms of most arthropods is still lagged far behind that of the vertebrates, which consequentially hinders the appearance of new model organisms within arthropods. Fortunately, high throughput de novo transcriptome assembly has proven to be a valuable technology to obtain sequence information and expression level of large-scale target genes involved in a particular biological process without any prior knowledge of reference genome [4][5][6][7]. In fact, this technology has been applied to analyze transcriptomes from a variety of species in metazoan [8][9][10][11][12].
Horseshoe crabs (Family Limulidae, Order Xiphosura, Class Merostomata), which are an extremely ancient marine group, have emerged as a valuable laboratory animal model in developmental study of marine invertebrate for decades [13][14][15][16]. They inhabit the areas around the shallow coastal seas and breed on intertidal shores. Even though horseshoe crabs show some common features of crustaceans (crab-like shell and claws), they are more closely related to arachnids, such as spiders and scorpions [17,18]. To date, only four extant horseshoe crab species have been discovered in two distinctly separate regions of the world, viz. Tachypleus tridentatus, Tachypleus gigas, Carcinoscorpius rotundicauda (mainly found in Southeast Asia) and Limulus polyphemus (only found in western Atlantic coast of North America). T. tridentatus was once very common along the southeast coast of Mainland China. Unfortunately, its population has been declining for several decades due to excessive hunting by humans and environmental pollution. Fossil evidence revealed that the earliest horseshoe crab lived during the late Ordovician period, around 445 million years ago [19]. More strikingly, the horseshoe crab remains unchanged in overall morphology for over 200 million years, and therefore is considered as a "living fossil" [20]. Besides their importance for the evolutionary studies and preservation of ecological diversity, horseshoe crabs also serve as a multiple-use animal resource. For example, the Limulus Amebocyte Lysate test, which is widely applied in the detection and quantification of bacterial endotoxins, is based on an aqueous extract of amebocytes from horseshoe crab [21]. Moreover, it is believed that horseshoe crabs are essential for the maintenance of the ecology of estuarine and coastal communities [22].
It is reported that the horseshoe crabs take about 10-15 years to reach sexual maturity from fertilized eggs and more than ten molts occur during this period [23], which makes it difficult to record the whole growth process of horseshoe crab in natural habitat without interruption. Growth observation of horseshoe crabs from larvae to adulthood in laboratory also turned out to be a failure [24]. On the other hand, over the recent decades, the horseshoe crab has emerged as an experimental model for studying marine invertebrate embryology, structure-function relationship of the visual system and nervous system [15,25]. Morphological changes during early embryonic development of two horseshoe crab species, including T. tridentatus and L. polyphemus, have been studied [14,23]. According to Sekiguchi's classification [23], the embryonic development of horseshoe crab could be roughly divided into the 21 stages, mainly including cleavage, blastula, gastrula, appearance of germ disc, formation of appendages and finally hatch-out to the first instar "trilobite larvae". During hatch-out stage, the embryo grows remarkably and several organs further develop. For instance, the central eye becomes discernable as a brownish spot. The appendages of the prosoma are further expanded [23]. All these changes enable the horseshoe crab to survive in the new challenging circumstance without the protection of chorion.
Although the study of horseshoe crab morphological changes during embryonic development has been described, the detailed molecular mechanism underlying this process remains unknown, which is largely due to the lack of genomic information. In the present study, we employed high-throughput Illumina Solexa sequencing and gene annotation to characterize the transcriptome of T. tridentatus embryo at the hatch-out stage. We reported for the first time a comprehensive analysis of large-scale gene expression profile during T. tridentatus embryonic development, which could be immediately used for further gene discovery and functional genomics study. Our data indicated that the major signaling pathways and key regulatory factors involved in embryonic development were highly conserved between T. tridentatus and other metazoans, especially D. melanogaster. Molecular cloning and functional study of T. tridentatus Pax1/9a and Pax1/9b were also investigated in our study. Therefore, the transcriptome analysis reported here have important applications to the understanding of molecular mechanisms underlying T. tridentatus embryonic development.

Horseshoe crab maintenance and breeding
Adult T. tridentatus were maintained in a 3 m×1 m×2 m tank containing natural seawater (temperature 25°C, salinity 30 ppt) and fed with bivalves. Fertilized eggs were obtained by natural spawning and cultured in the laboratory with standard procedures [26]. The staging of embryos was according to Sekiguchi's developmental tables [23]. The embryos at Stage 21 (the hatch-out stage) were collected for high throughput transcriptome sequencing.

RNA extraction and quality determination
Total RNA of T. tridentatus embryos was isolated by TRIzol (Invitrogen, Carlsbad, CA, USA) according to the standard protocol. The RNA samples were treated with DNase I (TaKaRa, Japan) for 4 h. RNA was quantified by measuring the absorbance at 260 nm using a NanoDrop spectrophotometer (Thermo Fisher Scientific Inc., San Jose, CA, USA). The purity of RNA was assessed by the ratio of the absorbance at 260 and 280 nm. The integrity of the RNA samples was examined with an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).
Cloning of full length T. tridentatus Pax1/9 cDNAs To obtain the full length cDNAs of Pax1/9 genes, the BLAST search of human Pax1 and Pax9 genes were performed against the T. tridentatus transcriptome database, which resulted in two sequences with high homology. The 3'and 5' ends were obtained by rapid amplification of cDNA ends (RACE) approaches using 3'-Full RACE Core Set with PrimeScript™ RTase and 5'-Full RACE Kit with TAP (TaKaRa, Japan) following the manufacturer's instructions. Primers for 3'-RACE and 5'-RACE were listed in S1 Table. The PCR products were ligated into pMD-19T vector (TaKaARa, Japan) and transformed into the competent E. coli TOP10 cells. Positive clones with the expected-size inserts were determined by colony PCR and DNA sequencing.
cDNA library preparation, Illumina sequencing and sequence assembly The cDNA library was prepared using the TruSeq TM RNA Sample Preparation Kit (Illumina, San Diego, CA, USA) according to the manufacturer's instructions. Poly(A)-containing mRNA was purified by Oligo(dT) magnetic beads from 10 μg total RNA sample and fragmented using divalent cations. The cleaved RNA fragments were used for the first strand cDNA synthesis using reverse transcriptase and random primers, followed by second strand cDNA synthesis using DNA polymerase I and RNase H. After second strain cDNA synthesis, fragments were treated with end repair, A-base tailing and adapter ligation consecutively. The sample was further treated by gel size fractionation and PCR amplification to create final cDNA library. The cDNA library was sequenced on the Illumina Cluster Station and Illumina Genome Analyzer system according to the manufacturer's instructions. The Trinity method was used for de novo assembly of Illumina reads of T. tridentatus embryos [27]. Briefly, the trinity using de Bruijn graph algorithm was run on the paired-end sequences with the fixed default k-mer size of 25, minimum contig length of 200 and paired fragment length of 500.

Functional annotation
All possible coding sequences were predicted by GetORF model of EMBOSS (http://emboss. sourceforge.net/apps/cvs/emboss/apps/getorf.html) with default parameters. The longest ORF was considered as the candidate coding sequence (CDS). The assembled unigenes were annotated based on sequence similarity by sequential BLAST searches against National Center for Biotechnology Information (NCBI) non-redundant protein database (NR) and nucleotide sequences database (NT), the Swiss-Prot protein database, the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database, the Cluster of Orthologous Groups (COG) database and the Translated EMBL Nucleotide Sequence Database (TrEMBL). The Blast2GO software was used for blasting and assigning associated gene ontology (GO) terms describing biological processes, molecular functions and cellular components.

Phylogenetic analysis
Related sequences from homo sapiens and Drosophila melanogaster were included in the phylogenetic analysis (S2 Table). The DNA binding domains of Pax family proteins was aligned manually and Mega 3 [28] was used to generate the phylogenetic trees. The neighbor joining method with 1000 bootstrap replications was used to calculate each tree.

Results and Discussion
Sequencing and transcriptome assembly In this study, 10 μg of total RNA isolated from the T. tridentatus embryos at hatch-out stage were used to prepare cDNA library for subsequent transcriptome sequencing using an Illumina HiSeq2000 sequencer. RNA was pooled from multiple T. tridentatus embryos to prepare one sample for sequencing. The summary of Illumima sequencing and annotation was shown in Table 1. Sequencing of cDNA library generated a total of 37,902,792 paired end reads. These data are available from the NCBI Short Read Archive under accession number SRR946952. Transcriptome assembly was completed using RNAseq de novo assembler Trinity (k-mer length = 25). 4,864,778 contigs were identified with the nucleotide composition of A, T, C and G being 30.98%, 29.67%, 20.22% and 19.12% respectively, which gives rise to an overall GC content of 39.34% in the whole transcriptome. The contigs were further assembled into 176,231 transcripts falling into 133,212 unigenes (> 200 bp) with a mean unigene length of 746 bp and an N50 of 568 bp ( Table 1). The length distribution of transcripts and unigenes were shown in S1 Fig. Unigene annotation was achieved by BLASTx and BLASTn searches against NR, NT, Swissprot, and TrEMBL with an e-value less than 1×10 −5 . In total, there were 33,796 unigenes showing hits in one or more databases (Table 1), among which the NR database gave rise to the most annotations (29,918 hits). However, a large proportion of the sequences did not show significant blast hit. This was probably due to the lack of characterization of genes from closely related species, as the length distribution and the coverage of the annotated and unannotated unigenes were in a similar pattern (S2 Fig). The assembled sequences generated in our study are available from the authors upon request.

Functional annotation
In order to predict the functions of these unigenes, all the sequences were analyzed according to gene ontology (GO) database and clusters of orthologous groups (COGs). 15,768 genes were successfully annotated with 95,305 GO terms and were separated into three categories: biological process, cellular component, and molecular function, which were further divided into 15, 12 and 21 functional groups, respectively (Fig 1). Among these GO terms, 49,724 unigene sequences were assigned to biological process, 27,948 to molecular function, and 17,363 to cellular component. Interestingly, the top 5 groups with most unigenes involved in the biological process were: cellular process, metabolic process, biological regulation, developmental process, and multicellular organismal process. While the cell part and the cell functional groups were dominant in the cellular component category, and the binding and the catalytic activity were dominant in the molecular function category. The integral to membrane (GO: 0016021) contained 1572 unigenes, which belonged to the cellular component category. It had the largest number of unigenes among all the groups. In addition, the ATP binding (GO: 0005524) containing 1478 unigenes and the oxidationreduction process (GO: 0055114) containing 619 unigenes were the most represented GO terms for the molecular function and biological process categories, respectively. A high percentage of unigenes involved in the following functional groups should also be noted: cellular process (GO: 0009987) with 561unigenes, protein phosphorylation (GO: 0006468) with 596 unigenes and regulation of transcription, DNA-dependent (GO: 0006355) with 556 unigenes. We were interested in the embryonic development process of T. tridentatus, and thus investigated the unigenes with GO terms containing "embryo" or "development". As a result, a total of 183 GO terms were found with unigene numbers ranging from 1 to 241. The top 20 terms with the most number of unigenes were listed in Table 2. This demonstrated that our transcriptome database contained a variety of unigenes related to the embryonic development, which provided an abundant resource for further investigation of regulatory mechanisms of T. tridentatus embryonic development. Meanwhile, COG annotation was used for phylogenetic classification of the putative proteins from the T. tridentatus transcriptome, in which a total of 6,920 proteins were assigned to 25 clusters (Fig 2). Among the COG library, the cluster of "General function prediction only" consisted of the largest number of unigenes. The other five largest categories were: "replication, recombination and repair" (14.47%), "signal transduction mechanisms" (13.42%), "transcription" (13.12%), "translation, ribosomal structure and biogenesis" (8.44%), and "posttranslational modification, protein turnover, chaperones" (8.24%), suggesting the specific responses and molecular mechanisms of T. tridentatus early development to some extent.
To investigate the biological pathways which are possibly involved in early embryonic development process of T. tridentatus, we further mapped the unigenes to the reference canonical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. A total of 10,999 unigenes were assigned to 298 KEGG pathways. The pathways most represented by numbers of unigene sequences were "transcription factors" (865 unigenes), "chromosome" (802 unigenes), "ubiquitin system" (798 unigenes), "protein kinases" (784 unigenes), and "pathways in cancer" (518 unigenes).  network of signaling pathways and regulatory genes plays key roles during this process [31]. Analyzing these pathways and factors involved in the early embryonic development of T. tridentatus is of great importance for understanding of the developmental mechanisms of this ancient marine creature and may provide insights into the evolutional study of conserved signaling pathways and regulatory genes. In all the annotated unigenes of T. tridentatus transcriptome, we identified a number of key components of four major signaling pathways (including Hedgehog, Wnt, transforming growth factor-β and Notch pathways) conserved in metazoan. As shown in Fig 3, 98 possible homologues were identified from 126 key genes involved in these signaling pathways, representing 77.78% coverage of the total genes. The percentage of genes found in each pathway is 78.85% (Wnt), 77.27% (TGF-β), 77.78% (Hedgehog) and 87.5% (Notch) respectively. In case of D. melanogaster-specific pathways, the percentage was much higher, about 91.57% (76/83) homologues could be found in T. tridentatus transcriptome, indicating the higher similarity of T. tridentatus with D. melanogaster than other animal models in the aspect of signaling transduction pathways. It should also be noted that the sequence length of most gene fragments (> 200 bp) is sufficient for functional studies of these genes by modern molecular technology, such as real-time PCR quantification, in situ hybridization and antibody preparation. Moreover, with the sequence information of these gene fragments, it would be more efficient and reliable to obtain the full length of desired genes comparing with common degenerate PCR method. This is particularly helpful in case of gene discovery and functional study in the "non-model" organism horseshoe crab. Regulatory genes are also key players in the network governing embryonic development [32]. D. melanogaster is one of the best understood models of embryonic development, especially pattern formation process, and thus we analyzed the regulatory genes potentially involved in the T. tridentatus axis specification and patterning by using D. melanogaster as a reference model. The genetic control of axis specification and patterning in D. melanogaster requires a cascade of gene regulation events before the onset of blastoderm stage. The major genes involved in this developmental process of D. melanogaster can grouped into four categories, viz. maternal effect genes, gap genes, pair rule genes and segment polarity genes [33,34]. This developmental process starts with the diffusion of the maternal effect genes which are responsible for setting the anterior-posterior polarity of the embryo. The gap genes, which are among the first genes transcribed in the embryo, participate in the establishment of the subdomain of body plan under the control of maternal gradients. The pair rule genes subsequently divide the embryo into periodic units, whereas the segment polarity genes activated by pair rule genes further establish the periodicity of the embryo by dividing it into 14 segment-wide units [35,36]. Key regulatory genes of each category used in our study were selected according to S.F. Gilbert's and R.M. Twyman's classification [37,38]. Among all the 53 genes listed in Table 3, we identified a total of 40 genes (75.47%) which had homologs in T. tridentatus, suggesting that functions of these genes conserved in T. tridentatus. It should be pointed out that the percentage of gene homologs identified in the maternal effect genes category was much lower (11/18, 61.11%) comparing to that of the other categories: gap genes (12/14, 85.71%), pair rule genes (7/8, 87.5%) and segment polarity genes (13/16, 81.25%). This is possibly  because the mRNA products of the maternal gene homologs have been consumed at the hatchout stage, the last stage of T. tridentatus embryonic development. In addition, we also failed to identify some well-studied gene homologs (including knirps, tailless, even-skipped, invected, fused and costal 2), which may be due to the extremely low expression of these gene homologs in our sample. Anyway, we identified the majority of the candidate regulatory genes that are potentially involved in the axis specification and patterning of T. tridentatus embryo. Further study is required to characterize the function of these putative genes.

Case study: Detailed analysis of Pax family genes
In addition to transcriptome analysis of conserved signaling pathway components and regulatory genes, we also tried to identify genes known to be important for embryogenesis based on our transcriptome data. Here, we focused specifically on the Pax family genes. Pax genes, which are grouped into 4 subfamilies (Pax1/9, Pax2/5/8, Pax3/7 and Pax4/6), encode a group of transcription factors that have been conserved through millions of years of evolution and play roles in early development [39][40][41]. According to the homology within the highly conserved paired domain, the putative horseshoe crab Pax genes were identified. A complete set of Pax family gene homologs was found in T. tridentatus. Phylogenetic tree was constructed by aligning the conserved paired domain with neighbor joining method using Pseudomonas transposase as outgroup (Fig 4). Genes belonging to Pax4/6 subfamily branched out at the base of the tree. Genes from Pax4/6 subfamily were divided into two clades. One included Pax4/6 genes from Homo sapiens (HsPax4 and HsPax6), D. melanogaster (Dmtoy and Dmey) and T. tridentatus (TtPax4/6a and TtPax4/6b). The second clade was composed of two sister groups with one containing the genes Dmeyg and Dmtoe from D. melanogaster, and the other consisting of TtPax4/6c and TtPaX4/6d from T. tridentatus. Interestingly, all the Dmeyg, Dmtoe, TtPax4/6a and TtPaX4/6b had a truncated paired domain. It is possible that the truncated paired domain already existed in the common ancestor of horseshoe crab and Drosophila. After split into two lineages, specific gene duplications occurred independently in horseshoe crab and Drosophila. The second clade includes Pax2/5/8 subfamily genes. All members of TtPax2/5/8 genes were gathered closely, indicating that relatively recent duplications appear to have occurred in horseshoe crabs. The third clade was composed of Pax1/9 subfamily genes. TtPax1/9a branched out at the base against all other genes in this clade, whereas TtPax1/9b gathered with the vertebrate Pax1 and Pax9 genes. The last clade was constituted of Pax3/7 subfamily. The horseshoe crab TtPax3/7a was clustered with Drosophila Dmgsb and Dmgsbn genes.
While the majority of horseshoe crab genes of the Pax family show obvious homology to their respective subfamily, the placement of TtPax3/7b in the phylogenetic tree is ambiguous. It occupied the most basal position of the tree without obvious orthologs. The homeodomain  Fig). A multiple alignment of the paired domain sequences of Pax1/9 was performed in T. tridentatus and other representative metazoan species. As shown in Fig 5, the Pax1 and Pax9 showed high similarity among all the species examined. Interestingly, although both T. tridentatus and D. melanogaster belong to the Arthropoda, TtPax1/9a and TtPax1/9b showed higher protein identity to zebrafish (Danio renio), human (Homo sapiens) and mouse (Mus musculus) which belong to Chordata, comparing to D. melanogaster. Whereas, polypeptide identities of Pax paired domain between T. tridentatus and the Coelenterata species, including Acropora millepora, Chrysaora quinquecirrha, Hydra littoralis and Nematostella vectensis, were relatively low. In summary, our data demonstrated the immediate application of the transcriptome data for gene discovery in the "non-model" organism T. tridentatus.

Tissue distribution of Pax1/9a and Pax1/9b
Expression patterns of TtPax1/9a and TtPax1/9b genes in different tissues of horseshoe crab were analyzed by quantitative RT-PCR. The highest expression level of TtPax1/9a was found in the heart (Fig 6). TtPax1/9a was also abundantly expressed in the muscle, liver and intestine. On the other hand, the expression level of TtPax1/9a in the stomach, yellow connective tissue and gill was significantly lower. Expression pattern of TtPax1/9b is similar to that of TtPax1/ 9a, albeit at much lower levels.

TtPax1/9b partly rescues Poxm dysfunction-caused larval lethality of Drosophila
Due to lack of genetic manipulation tools for the gene function study in T. tridentatus, we further explore the function of TtPax1/9a and TtPax1/9b genes by taking advantage of Drosophila model. Drosophila Pax gene Pox meso (Poxm) plays a crucial role in the early development.
Our previous study showed that the loss-of-function mutant in poxm, Poxm R361 , causes the Drosophila death [29]. According to our protein sequence alignment data, TtPax1/9a and TtPax1/9b genes are homologs of the Drosophila poxm gene. Therefore in this experiment, we investigated whether the TtPax1/9a and TtPax1/9b genes rescue Poxm R361 -caused embryonic lethality of Drosophila. The TtPax1/9a and TtPax1/9b genes were expressed in the Drosophila under the control of the 8.4 kb poxm upstream region using GAL4/UAS system [29]. The Poxm R361 mutant alone (Fig 7A) and the Poxm R361 mutant with TtPax1/9a gene expression (Fig 7B) resulted in the Drosophila death at the larval stage. Surprisingly, The Poxm R361 mutant in the presence of TtPax1/9b survived at the pupal stage (Fig 7C). These results imply that the TtPax1/9b can at least partly rescue Poxm dysfunction-caused larval lethality of Drosophila.

Conclusions
In this study, we, for the first time, performed de novo transcriptome sequencing of T. tridentatus embryos in the absence of a reference genome using Illumnia Solexa platform. Of 133,212 unigenes obtained, 33,796 were annotated by BLAST with NR, NT, Swiss-Prot, GO, COG and KEGG databases. We further identified a number of candidate genes potentially involved in embryonic development of T. tridentatus, shedding light on future study on the characterization and function of genes of interest and detailed molecular mechanisms underlying embryonic development of T. tridentatus. Moreover, the cloning and phylogenetic analysis of Pax family genes were performed, demonstrating that the transcriptome sequencing is a fast and reliable technology for high-throughput gene discovery in "non-model" organisms and for evolutionary developmental biology study as well. Rescue study further indicated that TtPax1/9b gene is functionally relative to Drosophila Poxm gene.