De novo assembly of a transcriptome from the eggs and early embryos of Astropecten aranciacus

Starfish have been instrumental in many fields of biological and ecological research. Oocytes of Astropecten aranciacus, a common species native to the Mediterranean Sea and the East Atlantic, have long been used as an experimental model to study meiotic maturation, fertilization, intracellular Ca2+ signaling, and cell cycle controls. However, investigation of the underlying molecular mechanisms has often been hampered by the overall lack of DNA or protein sequences for the species. In this study, we have assembled a transcriptome for this species from the oocytes, eggs, zygotes, and early embryos, which are known to have the highest RNA sequence complexity. Annotation of the transcriptome identified over 32,000 transcripts including the ones that encode 13 distinct cyclins and as many cyclin-dependent kinases (CDK), as well as the expected components of intracellular Ca2+ signaling toolkit. Although the mRNAs of cyclin and CDK families did not undergo significant abundance changes through the stages from oocyte to early embryo, as judged by real-time PCR, the transcript encoding Mos, a negative regulator of mitotic cell cycle, was drastically reduced during the period of rapid cleavages. Molecular phylogenetic analysis using the homologous amino acid sequences of cytochrome oxidase subunit I from A. aranciacus and 30 other starfish species indicated that Paxillosida, to which A. aranciacus belongs, is not likely to be the most basal order in Asteroidea. Taken together, the first transcriptome we assembled in this species is expected to enable us to perform comparative studies and to design gene-specific molecular tools with which to tackle long-standing biological questions.


Introduction
Starfish (class Asteroidea) represent one of the most successful life forms in the animal kingdom. After their ancestors emerged circa 500 million years ago, as many as 1,890 diverse species of starfish are thriving today on the seabed all over the world [1][2][3]. As general predators living on bivalves, microalgae and corals, starfish have profound impacts on their ecological PLOS  communities, and often inflict serious harms on commercially valuable marine resources such as mussels, clams, and crabs [4][5][6].
In addition to their significant role as a 'keystone species' in ecology, starfish have also served as an important animal model in various fields of biological research. Starfish have long been noted for their remarkable capability of regenerating their body parts [7], and their embryonic development has been well described [8]. As starfish belong to Echinodermata, which is a phylum of deuterostome along with Chordata, findings in starfish would provide insights into biology of vertebrate animals.
Starfish have been particularly useful for cell biology, as their gonads provide a population of synchronized cells that are plenty and easy to access. Starfish oocytes are electrically excitable and highly apt for intracellular recording owing to their large size (200-350 μm in diameter) and resilient cell surface [9][10][11]. Optically transparent, these cells are also suitable for microinjection of fluorescent dyes that visualize changes of organelles, cytoskeleton, chromosomes, and free calcium ion levels during meiotic maturation, egg activation, and cytokinesis [12][13][14][15][16][17][18][19][20][21][22]. Thus, starfish oocytes have served as an experimental model system highly useful in addressing fundamental questions in cell biology on a single cell level.
For example, starfish oocytes provide an invaluable opportunity to study meiotic cell cycle controls. In breeding seasons, ovaries of female starfish are full of oocytes arrested at the first prophase of meiosis, marked by the presence of the tetraploid (4n) nucleus termed 'germinal vesicle (GV).' At spawning, the oocytes are unblocked from the cell cycle arrest to resume meiotic progression in response to the maturation hormone. With 1-methyladenine (1-MA) being identified as the oocyte maturation hormone virtually in all species of starfish [23], meiotic resumption can be easily induced by adding the hormone to the oocytes in seawater. As the oocytes reenter the cell cycle in response to 1-MA, a series of cytological changes take place, which include accelerated reorganization of the actin cytoskeleton, prompt increase of the intracellular Ca 2+ level, changes of the phosphorylation state of proteins, increase of protein synthesis, relocation of the cortical granules, and shift of the membrane potential [24][25][26][27][28][29][30]. The hallmark of meiotic maturation is the germinal vesicle breakdown (GVBD), which represents dissolution of the nuclear envelope, and the consequent intermixing of nucleoplasm and cytoplasm is essential for a normal Ca 2+ response at fertilization [31]. The major down-stream effector of the hormone that plays the central role in orchestrating the G2-metaphase transition is the maturation-promoting factor (MPF), which was identified in starfish oocytes as a heterodimer of CDK1 (cdc2) and cyclin B [32]. In the somatic and germ cells of starfish, however, not much has been known about the molecular repertoire of cyclin and CDK families and their combinatorial diversity.
Meiotic progression and the following intracellular changes in the eggs are finely coordinated with fertilization events, but there are subtle differences depending on the species. Whereas sea urchin eggs are spawned and fertilized after the completion of meiosis, the natural timing of fertilization for starfish is known to be between GVBD and the extrusion of the first polar body [33,34]. This disparity is quite remarkable when considering that both starfish and sea urchin belong to the same phylum and that fertilization is such a crucial event for survival of the species. As aforementioned, meiosis in the oocyte involves a series of cytological changes, yet fertilization triggers another series of comparable but distinct physicochemical changes in the eggs that are collectively called 'egg activation' [35][36][37]. The changes following gametes fusion may also vary considerably from species to species. Virtually in all animal species, however, fertilization is accompanied by Ca 2+ fluxes and waves in the egg. However, it bears emphasis that the precise mechanisms by which the Ca 2+ signal is produced and utilized are not completely understood [38][39][40][41][42].
Although the generation of intracellular Ca 2+ waves in fertilized eggs is a universal phenomenon, the molecular components of the Ca 2+ signaling machinery may vary markedly depending on the animal species. An instant example is found in phospholipase C (PLC), the enzyme that synthesizes inositol 1,4,5-trisphosphate (InsP 3 ), the major intracellular Ca 2+ -releasing second messenger. PLC exists as multiple isozymes such as ß, ɣ, δ, and ε, which transduce different signaling cues owing to the unique combination of various functional domains [43]. In mammals, a sperm-specific isozyme PLC-z was discovered and suggested as the trigger of Ca 2+ waves in fertilized eggs [44]. However, PLC-z does not exist in the genome of sea urchin (Strongylocentrotus purpuratus), suggesting that the triggering mechanism of the Ca 2+ waves in the fertilized eggs of echinoderm has to be quite different from that of the mammalian eggs [45]. Besides InsP 3 , generation and propagation of the intracellular Ca 2+ waves in echinoderm eggs at fertilization have been recapitulated with the concerted action of additional Ca 2+ -mobilizing second messengers such as cyclic-ADP ribose (cADPr) and nicotinic acid adenine dinucleotide phosphate (NAADP) [31,[46][47][48]. Interestingly, eggs from different species of starfish respond to these Ca 2+ -mobilizing second messengers with selective sensitivity. Whereas NAADP evokes an intense Ca 2+ response in the eggs of Asterina pectinifera (also called Patiria pectinifera), it produces only a modest Ca 2+ increase in the eggs of A. aranciacus. Conversely, in the case of cADPr, the opposite is true for the two species of starfish [16,49,50]. In addition, the abrupt increases of Ca 2+ in oocytes at the onset of meiotic resumption and during GVBD are also remarkably different in the two species of starfish [22]. Furthermore, it has been demonstrated that eggs of A. aranciacus release Ca 2+ from internal stores in response to an actin drug latrunculin A, which binds to actin monomers and inhibits their polymerization to change the cytoskeletal structure. This intriguing intracellular Ca 2+ increase that resembles the Ca 2+ signals in fertilized eggs and recurs for hours has been observed neither in the GV-stage oocytes nor in the eggs of other species of starfish [51]. These evident differences in the pattern or mechanism of intracellular Ca 2+ signaling in the eggs of different animal species may arise from species-specific differences in the molecular makeup of the Ca 2+ signaling toolkits or in their layout inside the cells [52][53][54]. Despite the essential attributes apt for studying various aspects of biology, scientific research using starfish oocytes has often been hampered largely because of the lack of comprehensive DNA and protein sequence information. This is particularly true for A. aranciacus, the Mediterranean species of starfish that we and other laboratories in Europe have used as an animal model for many decades. In the NCBI database, for example, merely 158 nucleic acid sequences have been deposited for this species to date. In contrast, two other starfish species being utilized as model organisms for developmental biology have accumulated a substantial amount of sequence information for DNA and protein. A. pectinifera, a Northern Pacific species easily found in Japan, China, Russia, and Korea, has over 118,000 nucleic acid sequences in the NCBI database amassed from cDNA libraries of unfertilized eggs and the embryos at gastrula stage. On the other hand, the genome of Patiria miniata, a species living on the other side of the Pacific near Alaska and California, has been sequenced and assembled (https:// www.ncbi.nlm.nih.gov/genome/?term=Patiria+miniata). The results of ongoing annotation for P. miniata genome and the genetic sequence information of some other echinoderm species are available at the echinoderm genomic database EchinoBase (www.echinobase.org). Likewise, genomes of two starfish species living in the Indo-Pacific near Australia and New Zealand have been recently sequenced and assembled: Acanthaster planci (https://www.ncbi. nlm.nih.gov/assembly/GCA_001949145.1) and Patiriella regularis (https://www.ncbi.nlm.nih. gov/assembly/GCA_900067625.1). Hence, compared with those biologically and ecologically important species in other regions of the world, the genome or transcriptome of the starfish species being used as biological model organisms in European seas and ocean has been far less explored.
In an attempt to crack this barrier, we have performed deep sequencing of RNA from the GV-stage oocytes, mature eggs, zygotes and early embryos of A. aranciacus in this study. Oocytes contain stockpiles of mRNAs and ribosomes required for early embryonic development, and have been known as the cell type manifesting the highest RNA sequence complexity [55,56]. The species we chose, A. aranciacus, belongs to the order Paxillosida in the taxonomy of Asteroidea. This group of starfish was considered a primitive order as judged by their inability to evert the cardiac stomach and the lack of the brachiolaria stage in their larval development. However, their fossil records are recovered later in the geological time table in comparison with the other starfish species, while classifications based on DNA sequences of a few genes have placed them as an order either primitive or derived in the phylogeny [57][58][59]. Hence, acquisition of comprehensive genetic information would not only facilitate experimental approaches to the biology of starfish oocytes, but also assist in resolving the controversial issues regarding ordinal classification of starfish.

Materials and methods
Animals, RNA preparation, and RNA-seq During the breeding season, starfish (A. aranciacus) were captured in the sea near Gaeta, Italy, under the authorization of the Ministero delle Infrastrutture e dei Trasporti Capitaneria di Gaeta. The animals were not endangered or protected species, and captured in locations that were not privately-owned or protected in any way. Animals were kept at 16˚C, and all experimental procedures were in compliance with the guidelines of the European Union Directive 86/609. Oocytes at the germinal vesicle (GV) stage were isolated as was previously described [21]. To obtain mature eggs, the oocytes were exposed to 1-MA (10 μM) for 70 min in filtered seawater. The eggs were then fertilized with spermatozoa from one male animal and allowed to undergo cleavage for 4 h and 20 min. Functional integrity of the cells was monitored by light microscopy, and most fertilized eggs appeared viable and reached the 8-cell stage on the average. At each stage, i.e. GV-stage oocytes, mature eggs, zygotes (20 min post-fertilization), and early embryos (4 h and 20 min post-fertilization), aliquots of cells were taken and rinsed in filtered seawater for RNA extraction. RNA was prepared from each sample by use the RNAquous-Micro kit (Ambion) and treated with DNAse following the manufacturer's instruction. The RNA samples were quantified by spectrophotometry (NanoDrop ND1000, Thermo Scientific), and its quality was assessed on RNA LabChip (Aligent bioanalyzer 2100, Aligent Technologies). The RNA samples derived from eggs of four females fertilized with sperm of the same male displayed high quality, as their λ 260nm /λ 280nm absorption ratio did not deviate from the range of 1.95-2.05, and their RIN (RNA Integrity Number) from the range of 8.8-10 in all cases. For RNA-seq, total RNA (> 6 μg for each sample) were sent to EMBL-Gen-eCore (Heidelberg, Germany) for the procedures of cDNA synthesis, fragmentation, library construction, and massive parallel sequencing by use of the Illumina system. RNA-seq for biological quadruple samples were all processed separately except for the GV stage oocytes, which were pooled together.

Data filtering and de novo assembly of the transcriptome
Raw sequencing data were subjected to adapter trimming and quality filtering by use of the preprocessing tool Trimmomatic [60] with the following parameters: ILLUMINACLIP:illumi-na_adapters.fa:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:20 MINLEN:25. More than 80% of the reads were retained on average. The filtered reads were normalized using the script normalize_by_kmer_coverage.pl (parameters:-seqType fq-JM 240G -max_cov 30), and assembled with the Trinity software version r2013_08_14 (parameters:-seqType fq-JM 240G -inchworm_cpu 24-bflyHeapSpaceInit 24G -bflyHeapSpaceMax 240G -bflyCalculateCPU-CPU 24-jaccard_clip-min_kmer_cov 2) [61]. The assembled contigs were further clustered by use of the software CD-HIT-EST [62]. To obtain quantitative information, we mapped the raw reads to the filtered assembled transcripts using Bowtie (parameters: -t -q -p 24-chunkmbs 10240-maxins 500 -S). To remove sequences not having enough representative reads, only those transcripts with the 'count per million (cpm)' values greater than 1 were retained for further analyses. After this series of filterings, the complete transcriptome contained 64,388 assembled sequences, which we will refer to as 'transcripts' for the sake of discussion. The completeness of the transcriptome was assessed by use of the software CEGMA [63].

Functional annotation and differential expression analyses
To annotate the transcripts, we used the Annocript pipeline [64] which runs different BLAST programs to align sequences against the UniProt database: i) BLASTP against the UniRef90 and SwissProt databases (parameters: word_size = 4 evalue = 10 −5 num_descriptions = 5 num_alignments = 5 threshold = 18), ii) RPSTBLASTN to align the transcripts against the Conserved Domains Database (CDD) (parameters: evalue = 10 −5 , num_descriptions = 20, num_alignments = 20). The results produced comprehensive lists of multiple sequences annotated on the basis of alignment models for ancient domains and full-length proteins derived from Pfam, SMART, COG, PRK, and TIGRFAM [65]. In addition, we used TBLASTN to obtain alignments against other non-coding RNAs (rRNA, tRNA, snRNA, snoRNA, miRNA) exploiting the database Rfam integrated with a customized set of NCBI ribosomal RNAs (parameters: evalue = 10 −5 , num_descriptions = 1, num_alignments = 1. To characterize the mRNA population, transcripts were further annotated using the Gene Ontology (GO) database [66], the Enzyme Commission IDs [67] and UniPathways [68]. The R package edgeR [69] was used to perform differential expression analysis of the genes for each pair of sequential stages of development, i.e. GV-stage oocytes versus mature egg, mature eggs versus zygotes, and zygotes versus early embryos. Transcripts were considered differentially expressed if the 'False Discovery Rate (FDR)' was less or equal to 0.05 and the fold change greater than two. When comparing expression levels of different transcripts in the sample, we also utilized RPKM (reads per kilobase of transcripts per million mapped reads) to compensate for the differences in transcript length, by use of the rpkm() function included in the edgeR package.

Validation of transcripts by RT-PCR and DNA sequencing
RNA was extracted from GV-stage oocytes, mature eggs, zygotes, and early embryos derived from eggs of four females fertilized with sperm of the same male, and converted to cDNA by use of SuperScript VILO MasterMix (Invitrogen) with random hexamers following the manufacturer's instruction. The RNA samples showed no sign of contamination from genomic DNA, as judged by the total elimination of RT-PCR signals by RNAse treatment prior to cDNA synthesis [70] (S1 Fig). Stocks of cDNA obtained from 2.5 μg RNA in 20 μl were diluted 40-fold and used for PCR to evaluate the primers elected from 32 different transcripts in the transcriptome (S1 Table). Initially, the target regions of the transcripts were amplified by a PCR thermal cycler EP S (Eppendorf) in 96-well plates, following the standard procedure for Taq DNA polymerase: 30 cycles at 95˚C for 1 min, 52˚C for 1 min, and 72˚C for 1 min. The resulting amplicons were resolved by electrophoresis (2% agarose gel) and purified by use of QIAquick gel extraction kit (QIAGEN). The eluted DNA (15 nM) was subjected to direct DNA sequencing (BigDye Terminator Cycle Sequencing technology, Applied Biosystems) using the same primers (4.5 μM) for PCR. For real-time qPCR, reaction mixtures were prepared in 384-well plates by use of Beckman Coulter's Biomek FX Laboratory Automation Workstation equipped with the ORCA robotic arm (Beckman Coulter, Fullerton, CA). Each reaction well contained diluted cDNA, a pair of primers (0.25 μM, each), and Fast SYBR Green Master Mix with ROX (Applied Biosystems). Reactions were run in a ViiA 7 Real-Time PCR System (Applied Biosystems). PCR for each developmental stage was performed in biological quadruplicate, and each of them in technical triplicate. The reaction cycle was repeated 40 times as follows: i) denaturation at 95˚C for 20 s, ii) annealing and elongation at 60˚C for 1 min. At the end of the PCR cycles, melting-curve analysis was performed between 95˚C and 60˚C using the machine-installed program to assess specificity of the primer pairs. The PCR data were analyzed using the ViiA 7 Software and Microsoft Excel 2010. Reference genes were selected by empirical data. Merging the transcripts annotation table and their raw counts, only transcripts with more than 5X coverage in all replicates were considered. Counts per million (CPM) were computed for all transcripts, and coefficient of variation (CV) was calculated from the formula RS/RM, where RS is the standard deviation and RM the mean of the CPM values. The transcripts annotated in SwissProt or Conserved Domain Databases with highest Query-and Hit-Coverage while scoring lowest CV were elected as reference genes: Histone deacetylase 1, Actin-related protein 2/3 complex subunit 5 (ARP 2/3), Protein SEC13 homolog (SEC 13), U5 small nuclear RNP helicase (snRNP helicase), Mediator of RNA polymerase II subunit 12-like (RNA pol II modulator), Pre-mRNA splicing factor 8 (splicing factor 5), and Elongation factor 1-alpha (EF1A). In the given experimental conditions of qPCR, the thresholds cycles (CT) for the reference genes were arrived as follows: snRNP helicase, CT = about 20; SEC 13, 21; RNA pol II modulator, 23; ARP 2/3, 24; EEA1, 18; Splicing factor 5, 19. Except for Histone deacetylase 1, each reference gene produced comparable and fair results with tightly uniform expression levels throughout the developmental stages. To minimize operational errors in overcrowded experiments, we used Splicing factor 5 as the representative reference gene, as its CT was close to the median of the CT values of all target genes that ranged from 16 to 24. Thus, the relative abundance of each target transcript in a given RNA sample was estimated by comparing the values of the threshold cycles (CT) for the target genes with that of the reference gene (splicing factor 5), which showed nearly constant CT values in all samples. The fold changes of a given transcript during the developmental transition were calculated using the 2 -ΔΔC T method [71].

Molecular phylogenetic analysis
The transcript of the mitochondrial gene encoding cytochrome c oxidase subunit I in A. aranciacus (comp182578) was analyzed with Virtual Ribosome [72] by use of the mitochondrial codons of echinoderm. After determining the coding region, homologous DNA sequences in other species of starfish were obtained from GenBank. The region commonly available in the majority of the species was defined (see Results), and the deduced amino acid sequences from various species were subjected to multiple alignment by use of MAFFT version 7 (http://mafft. cbrc.jp/alignment/server/) with default parameters. The alignment data were converted into a phylogenetic tree by the neighbor-joining algorithm [73] with the bootstrap resampling number set at 100. The resulting tree was visualized with Phylo.io 1.0 [74].

Statistical analysis
The average and variation of the data were reported as 'mean ± standard deviation (SD)' in all cases in this manuscript. Oneway ANOVA and t-tests were performed by use of Prism 3.0 (GraphPad Software), and P<0.05 was considered as statistically significant. For ANOVA results showing P<0.05, statistical significance of the difference between the comprising groups was assessed by post hoc tests.

Acquisition of the reference transcriptome
Sequencing of the cDNA from the GV-stage oocytes, eggs, zygotes, and early embryos of A. aranciacus led to collection of 4.8 x 10 8 paired-end reads made of 104 bp. After filtering with Trimmomatic [60], 372,263,524 reads were assembled using the software Trinity, which led to 68,218 contigs with a N50 value of 1,755 and the average contig length of 1,368 bp. The transcriptome was clustered by use of CD-Hit and filtered further on the basis of the criterion that the transcripts should be expressed in at least 2 samples displaying more than 1 count per million (cpm). Thus, we finally obtained 64,388 transcripts after assembly and filtering, which were defined as our 'reference transcriptome' of A. aranciacus. Then, completeness analysis of the assembly was performed exploiting CEGMA (Core Eukaryotic Gene-Mapping Approach, v2.5), which estimates the presence of the "core" genes using a representative set of 458 Core Eukaryotic Genes (CEGs) that are highly conserved among eukaryotic species. The results of CEGMA showed that 94.4% of the core genes were present within the A. aranciacus transcripts over 70% of the protein length (complete alignment), a clear indication of fairly good completeness of the transcriptome. Fastq files of the assembled reads for each sample are now available in ArrayExpress database under the accession number E-MTAB-5679 (https://www.ebi. ac.uk/arrayexpress/experiments/E-MTAB-5679).
Comparison of the A. aranciacus reference transcriptome with the existing genetic sequence databases The transcriptome was annotated with Annocript software that performs alignments against the UniProt databases (see Methods and materials). In this way, we have annotated 32,783 transcripts that exhibited at least one hit in the databases of SwissProt, UniRef proteins, Conserved Domains Database (CDD), or Rfam. The minimum and maximum lengths of transcripts were 201 and 34,228 nucleotides, respectively, and the mean of their CG contents was 41.95%. Annocript also enabled us to identify 3,133 putative long non-coding RNAs (lncRNA), whose definition was specified previously [64]. Due to the lack of the genomic sequence, these lncRNAs were not characterized further in this study. Among the annotated transcripts, a great majority encoded proteins, as a total of 32,406 transcripts corresponded to at least one protein in the UniRef90 database with the E-value lower than 5e10 -5 , and to 12,018 proteins in the SwissProt database. The transcriptome matched 4,891 protein domains in the CDD database, and 16 non-coding RNAs in Rfam. In view of taxonomy, the identified animal species that scored the most numerous closest hits to the A. aranciacus transcriptome in the UniRef90 database [75] to date was S. purpuratus, an echinoderm which exhibited 11,221 closest hits (34.2% of the cases), exceeding by far the second and third closest species: Branchiostoma floridae (2,046 hits) and Nematostella vectensis (529 hits) ( Table 1). This reflects the fact that the animal species closer to A. aranciacus, e.g. starfish, have not been much presented in the annotated genetic sequence database of UniProt. It is also noteworthy that a great portion (37%) of the A. aranciacus transcriptome matched the so-called 'unknown' sequences in terms of the taxonomy of UniRef, for which unambiguous assignment of taxonomic origin was not possible mainly because the sequences contained more than one open reading frame.

Comparison of the stage-specific mRNA populations on the basis of expression profiles
In an attempt to examine any shift in the mRNA population from eggs to early embryos, we have executed differential expression analyses of the transcripts at the different stages and annotated them on the basis of gene ontology (GO). When the count-based expression data from replicate samples (four animals) were analyzed by edgeR, it was estimated that the mRNA population hardly underwent abundance changes during the egg to zygote transition except for the 12 down-regulated transcripts (thus, 0.0186% of total transcripts). On the other hand, during the zygote to embryo transition (4h and 20 min post-fertilization), considerable changes were observed, as 2,137 out of 64,388 transcripts (3.3% of the total) were downregulated by more than 2-fold and 1,534 transcripts were upregulated (2.3%). When classified with respect to GO, the mRNA populations at the four stages covered a wide range of classes in terms of 'biological processes' (BP), 'molecular functions' (MF), and 'cellular component' (CC) (Fig 1). Apparently, the percentage of each functional GO class in the mRNA population underwent only marginal changes throughout meiotic maturation, fertilization, and cleavages. However, certain GO classes were conspicuously frequent in the cells of a specific developmental stage. For example, the transcripts classified as 'DNA integration' or 'RNA-dependent DNA replication' in the GO domain 'biological processes' (BP) were relatively more frequent in the GV-stage oocytes (Fig 1B, histogram BP). The proteins encoded by this class of transcripts may play a role in DNA replication or in retrotransposition process, respectively [76].
In line with this, when the expression profiles were compared in terms of MF, transcripts related to 'RNA binding' and 'RNA-directed DNA polymerase activity' were more frequently presented in the GV-stage oocytes in comparison with the rest (Fig 1B, histogram MF). In contrast, no marked difference was observed among the four different stages on the basis of the classification by CC (Fig 1B, histogram CC). As the great majority of the A. aranciacus transcripts encoded metabolic enzymes subserving specific physiological needs, we have compared the RNA population of the four stages on the basis of 'pathway' annotation [68]. As summarized in Fig 2, the transcriptome covered transcripts encoding enzymes of a wide variety of metabolic pathways, whose frequency of occurrence varied more than two orders of magnitude. For the enlisted 45 pathways, the mRNA populations of the four stages did not exhibit marked difference from one another. However, the GV-stage oocytes manifested more variety in the subset of transcripts encoding the enzymes involved in the '(S)-malate-from-fumarate' pathway in comparison with the other stages ( Fig 2C). This pathway refers to biochemical reactions involved in the Krebs cycle which regenerates NADH, the cell's main reducing agent that donates electrons or produces ATP [77].

Identification of the putative calcium signaling toolkit proteins
During the course of annotating the A. aranciacus transcriptome with BLAST, we were able to identify the transcripts encoding putative enzymes and ion channels that may play a role in the Starfish (Astropecten aranciacus) transcriptome generation, propagation, or attenuation of intracellular Ca 2+ waves. These are the enzymes that synthesize Ca 2+ -releasing second messengers such as InsP 3 (phospholipase C), cyclic ADP-ribose and nicotinic acid adenine dinucleotide phosphate (ADP-ribosyl cyclase), or the ion channels and pumps that transport calcium ions across the membrane ( Table 2). Out of the 38 transcripts tabulated in this category, 25 assembled sequences scored a hit coverage values greater than 85%, and many of them reached 95-100%. Even for the transcripts displaying moderate hit coverage values, the conserved regions were distributed throughout their entire length. For example, ryanodine receptor of A. aranciacus (5,161 AA) was most similar to that of sea urchin (Hemicentrotus pulcherrimus, 5,317 AA) [78] although the blast resulted in a coverage of 70%. Nonetheless, the global alignment between the two amino acid sequences using Clustal Omega indicated that the sequence similarity was maintained from the very start to the end with occasional short gaps. Hence, the great majority of the assembled sequences seemed to contain the entire coding region of the proteins of our interest.

Identification of cyclins and cyclin-dependent kinases
The cells of four developmental stages used in this study are of special interest for the study of cell cycle control. Starfish oocytes are arrested at the first prophase of meiotic cell cycle, whereas mature eggs have been just released from this block. Zygotes and early embryonic cells are to undergo rapid cycles of mitosis. As the key factors regulating the progression of cell cycle are cyclins and cyclin-dependent kinases, the cells of these stages are likely to be enriched with the mRNAs encoding these proteins. Indeed, annotation of the transcriptome led to identification of at least 13 distinct cyclins and as many cyclin-dependent kinases ( Table 3). The Starfish (Astropecten aranciacus) transcriptome tertiary structure of cyclin is characterized by the presence of two compact clusters of α-helices, so-called cyclin boxes, and the absence of ß sheets [79]. As expected, all putative cyclins identified in Table 3 contained the hallmark cyclin boxes and no ß sheet, as judged by Phyre2 analysis [80] (see https://figshare.com/s/de9aa8df2b0ef0e62dc1). It is noteworthy, however, that the putative cyclin K (comp200893) contained not only the two cyclin box folds near the N-terminus, but also an unusually long stretch of polypeptide chain that bears no sequence homology to the known protein domains in GenBank. Nonetheless, it was not unprecedented. Its homologous protein in S. purpuratus (XP_795740.3) was also intriguingly long for cyclin (816 AA) and had similar secondary structures. The unique polypeptide chains at the C-terminal side of these two putative cyclin K molecules were not similar to each other in their amino acid sequences. As to the cyclin-dependent kinase (CDK) family, all the putative enzymes enlisted in Table 3, except CDK-14, commonly contained all the characteristic domains such as cyclin interface, active site, ATP-binding sites, polypeptide substrate-binding sites, and an activation loop.

Quantitation of the transcripts encoding the proteins of cyclin and CDK families
As a means to assess quantitative validity of the transcriptome and to gain insights into the relevance of each member of cyclin and CDK families, RNA samples from the cells of four stages were subjected to real-time RT-PCR analyses by use of corresponding gene-specific primers (S1 Table), and the results were compared with the CPM (counts per million) values of each member of the cyclin and CDK families. Trinity assembly and edgeR analyses predicted that the transcripts of 13 distinct cyclin family members would greatly vary in their abundance. For example, it was estimated that cyclin B and A would be highly abundant in mature eggs, scoring 1.8x10 4 CPM (thus, 1.8% of the total library) and 6.1x10 3 CPM (0.6%), respectively. They were followed by cyclin B3, E, and D2, while cyclins L1, G2, H and F represented the rarest Starfish (Astropecten aranciacus) transcriptome members of the cyclin family ( Fig 3A, green bars). Thus, the CPM of cyclin members varied by four orders of magnitude. In large part, the CPM profile of cyclins is in agreement with the results of qPCR. When relative expression was presented as a ratio of each cyclin to the internal control transcript (splicing factor 5), cyclin A and B comprised the most abundant cyclins, followed by cyclins D2, E, and B3. Again, cyclin F was the least abundant, and cyclin H and L1 joined cyclin F as the rarest cyclin transcripts. Thus, the relative expression levels among cyclins varied at least by three orders of magnitude in qPCR (Fig 3B). However, there were also subtle disagreements between the two assessments. For example, cyclin B was three times as abundant as cyclin A in CPM, but the results of qPCR showed that cyclin A was more abundant than cyclin B by 60-85% throughout the stages, although the difference was statistically significant only in the zygotes (77%, P<0.05) and early embryos (85%, P<0.01). Likewise, when cyclins were ranked in the order of descending abundance in the two assessments, it appeared that cyclin D2 and G2 were slightly underestimated in CPM, while cyclin J was overestimated. These discrepancies might be attributable to the fact that the transcripts being compared varied in length, which was not formally reflected in CPM. Indeed, when transcript  Starfish (Astropecten aranciacus) transcriptome abundance was assessed in RPKM (Fig 3A, brown bars), cyclin C and G2 rose up by two steps and one, respectively, to assimilate to the qPCR data, but the rank order of more abundant transcripts (i.e. cyclins B to I) remained unchanged in comparison with CPM. The moderate discrepancy between the transcriptome and qPCR data cannot be ascribed to the potential differences in the primer efficiency among different genes, either, as the estimated primer efficiencies for all cyclins were virtually invariable (1.98 ± 0.08, n = 13) and identical to that of the internal control (1.95) [81]. It is also noteworthy that none of the cyclin transcripts underwent significant alterations of their relative abundance during meiotic progression, egg activation and cleavages (Fig 3B), suggesting that the noted periodic appearance of cyclins in starfish eggs and early embryos [82,83] during this period is likely to be regulated at the translational and post-translational levels. Unlike the cyclin family, transcripts of the CDK family were not much diverse in their abundance. For example, the difference between the most frequently encountered one (CDK1, 149 CPM) and the rarest (CDK10, 5 CPM) were merely 30-fold in mature eggs (Fig 4A, green bars). Reflecting nearly uniform distribution of their abundance, which was much lower than that of cyclin transcripts, the rank orders of CDK transcripts were often reversed in the qPCR analysis ( Fig 4B). In particular, the results of qPCR indicated that CDK5 transcript was among the most abundant members of the CDK family together with CDK2 and CDK12 throughout these stages (Fig 4B). On the other hand, the transcript CDK1, encoding a component of MPF, was merely close to the median in qPCR and significantly lower than CDK5 in all these stages (P<0.05), although CDK1 was estimated to be the most abundant in the CPM analysis, and the second highest in RPKM (Fig 4A, brown bars). Nonetheless, in agreement with the transcriptome data (Fig 4A), transcripts encoding each member of the CDK family exhibited only a modest variability in qPCR analysis, as the most (CDK5) and the least (CDK7) abundant ones differed hardly beyond one order of magnitude (Fig 4B). The apparent increase (by 70%) of CDK16 during the zygote to early emryo transition in qPCR (Fig 4B) was not statistically significant (P = 0.3684), although the transcriptome data in RPKM predicted a 3-fold increase (P<0.001). Taken together, the results indicated that the assembled transcriptome was quantitative in general terms, but demonstrated compromised accuracy as judged by qPCR. The qPCR data corroborated the quantitative difference between two transcripts only when their abundance varied by 2-3 orders of magnitude. Thus, despite the general agreement between transcriptome and qPCR, it could be said that the transcriptome may not reliably predict more subtle differences in transcript abundance within one order of magnitude.

Validation through selected genes
For further quantitative assessment of the mRNA populations, we have selected five genes that manifested significant CPM changes during the course of egg activation and cleavages. As shown in Fig 5A, the transcript of histone H2A was over 10 times as numerous as the transcript of the control gene (splicing factor 5) in the count-based data of the transcriptome. Interestingly, the transcript count of H2A significantly increased (P<0.05) after fertilization (Fig 5A, H2A, asterisk), which was reminiscent of the transcriptional activation previously reported in the zygotes of sea urchin [84]. However, validation with qPCR agreed only to that H2A mRNA is over 10 times as abundant as the same control gene, but its modest increase during egg activation was not demonstrable by qPCR (Fig 5B, H2A). Similarly, the progressive decrease of the transcript encoding Y-Box transcription factor was also observed in qPCR, but the changes were not statistically significant. On the other hand, the quantitative data from Trinity assembly and qPCR did not match each other in the case of catenin ß and 40S ribosomal protein S2 (Ribo S2). Nonetheless, we were able to verify that the negative  regulator of mitotic cell cycle, Mos, is significantly down-regulated as the zygote proceeds to cleavages (Fig 5B, Mos), which also makes sense physiologically. Hence, taken together with the qPCR data of cyclins and CDKs, we came to the conclusion that the assembled transcriptome was quantitative to a certain extent, but not precise enough to predict modest up-or down-regulation of individual genes with reliable accuracy. For this reason, we decided not to pursue further in silico quantitation of individual genes by use of the paradigms analyzing 'enrichment'.

Molecular phylogenetic analysis
We noted that the 8 most abundant transcripts in our transcriptome are all derived from the mitochondrial genome and encode enzymes regulating cell redox and bioenergetic reactions, e.g. cytochrome c oxidase subunit 1 (COI), 2 and 3; cytochrome b; NADH-ubiquinone oxidoreductase chain 1,4, and 5; ATP synthase subunit a. The DNA sequence of COI gene has been particularly useful for phylogenetic analyses [85][86][87][88]. To illustrate the utility of the transcriptome, we took this opportunity to compare COI, which was the most abundant mRNA in the transcriptome, with that of other starfish species, and built a phylogenetic tree on the basis of the strictly homologous region (Fig 6). When translated with the mitochondrial codons specific for echinoderms, a transcript from A. aranciacus (comp182578) encoded a protein whose deduced amino acid sequence (514 amino acid residues) matched the orthologues in most other starfish species with 99-90% identity. Unfortunately, the DNA sequence data for COI in starfish are scanty, and only partial sequences of the homologous gene (often not overlapping) were available for many species. Thus, we restricted our analysis to the region encoding the homologous peptides of uniform length near the N-terminus that are commonly available in most starfish species (211 amino acid residues ending with invariable sites, QHL). The collected amino acid sequences covered 7 orders of Asteroidea: Brisingida (2 species), Forcipulatida (6), Notomyotida (3), Paxillosida (6), Spinulosida (1), Valvatida (11) and Velatida (2). No species in the order Peripoda provided the homologous sequence in the corresponding region, and it was not included in our analysis. As referring points, instead, we included homologs from three outgroup species belonging to different phyla or class, i.e. Ciona intestinalis (phylum Chordata), Octopus bimaculoides (phylum Mollusca) and S. purpuratus (phylum Echinodermata, class Echinoidea), which shared with A. aranciacus about 67.5, 80.5 and 91.5% of sequence similarity at the amino acid level, respectively, within this homologous region. As expected, the gene region-specific phylogenetic tree constructed after multiple sequence alignment showed the two remote species Octopus bimaculoides and Ciona intestinalis first separated out with long branches (Fig 6). Then, at the next node, sea urchin S. purpuratus diverged from all the starfish species (N1, Fig 6). After that, two Velatida species Diplopteraster multipes and Pteraster tesselatus branched off from the rest of starfish (N2, Fig 6). At the following node N3, a Valvatida species Patiriella vivipara branched off from the rest. Up to this point, the phylogenetic relationships are supported by high or moderate confidence values of the bootstrap analysis (100 for N1, 66 for N2, and 83 for N3), but the following nodes exhibited relatively weak bootstrap values (<50) in the majority of cases, making it difficult to draw definitive conclusions on phylogenetic relationships. Nonetheless, they are not randomly mixed across the taxonomic boundaries, but often formed exclusive clades for the orders. For example, all the examined species of Forcipulatida comprised a monophyletic group (Fig 6). Although cautions must be taken in interpreting the contructed phylogenetic tree, the results of our analysis suggested that Velatida, and not Paxilosida, might be the most basal order of starfish.

Discussion
A. aranciacus is a common starfish native to the Mediterranean Sea and the East Atlantic. Being encountered relatively easily, its systematic classification dates back to C. Linnaeus in 1758, and it has also been used as an excellent experimental animal model for biological researches ever since. To facilitate future studies of ours and others on reproduction, cell cycle, and intracellular Ca 2+ signaling using starfish oocytes, eggs, and early embryos, we have assembled its first transcriptome, focusing on this relatively short developmental time frame. Our major goal in this study was to retrieve sequence data of the mRNAs encoding the key proteins playing their roles in the aforementioned biological processes of our interest. This approach was encouraged by the fact that oocytes and early embryos comprise mRNA populations of Starfish (Astropecten aranciacus) transcriptome the richest variety [55]. The transcriptome assembled by RNA-seq methodology provided a collection of sequence data comprising 9.8x10 7 nucleotides (= 68,218 contigs x 1,368 nucleotides on the average). Considering the species difference and some redundancy of the contigs, this number compares fairly well with the early estimation of the sequence complexity (the length of single copy nucleic acid sequence represented in the RNA population) of the sea urchin eggs estimated by the hybridization kinetics: 3.7x10 7 nucleotides [55]. Annotating the transcriptome, we were able to identify 50.9% of these transcripts in the databases of SwissProt, UniRef, CCD, and Rfam. Most of them (98.9%) manifested E-values lower than 5e -5 , an encouraging outcome on the identity of the annotated transcripts. The assembled transcriptome also exhibited acceptable results in the completeness analysis (CEGMA), as was reflected on the high coverage values ranging 85-100% in most transcripts encoding the putative enzymes and ion channels involved in intracellular Ca 2+ signaling ( Table 2) or the families of cyclin and CDKs ( Table 3). The nucleic acid sequences of the transcripts encoding the key components of the specific cell signaling pathways are expected to enable us and others to develop molecular probes for the starfish cells and to perform comparative studies with other animal species. A comprehensive data set as such is only an example of the utility of our transcriptome.
In the list of the transcripts encoding Ca 2+ signaling toolkit proteins (Table 2), we have identified 5 subtypes of the enzyme synthesizing InsP 3 (PLC). The z isoform of PLC, which is present in mammalian sperm [43] as well as in the genomes of birds and bony fishes, was not found in the A. aranciacus transcriptome. To the best of our knowledge, PLC-z gene is also absent in the genomes of S. purpuratus and other echinoderm species mentioned in Introduction. Thus, its absence in the A. aranciacus transcriptome is not likely to be due to the negligible contribution of sperm's RNA, if any, to the transcriptome of the zygote. Similar to sea urchin eggs [89], starfish eggs and early embryos contained transcripts encoding ion channels, pumps, and enzymes that cover nearly all known aspects of intracellular Ca 2+ signaling: i) three ADP-ribosyl cyclases (cADPr and NAADP synthesis), ii) one InsP 3 receptor; iii) one ryanodine receptor; iv) 3 two-pore channels (putative NAADP receptor); v) one SERCA, one PMCA, three Na + /Ca 2+ exchangers, and three Na + /K + /Ca 2+ exchangers (Ca 2+ removal); vi) 7 voltage-gated Ca 2+ channel subunits; vii) one Ca 2+ channel flower homologue; viii) 6 TRP (transient receptor potential) channels. In addition, the transcriptome contained STIM-1 and orai-2, the putative components of a store-operated calcium channel. It would be of interest if the transcripts are also translated to functional proteins in the eggs and early oocytes, for it will exemplify operation of Ca 2+ release-activated Ca 2+ currents in excitable cells [90].
Cyclins are periodically synthesized and degraded at certain phases of cell cycle and thereby regulate enzyme activities of cyclin-dependent kinases that choreograph many aspects of mitosis and meiosis. For starfish, nucleic acid sequence of cyclin mRNA has been known only for cyclin A, B, B3, and E, but our study with A. aranciacus has revealed sequences of at least 9 other distinct cyclins being expressed in the eggs and early embryos of starfish (Table 3). Likewise, our study retrieved nucleic acid sequences of transcripts encoding 13 different putative CDKs. In theory, this might create enormously diverse cyclin/CDK pairings, but the functional combination of cyclin and CDK that regulate cell cycle is known to be restricted to a few specific pairings [91,92]. Thus, certain cyclin/CDK pairs carry out specialized roles in cell cycle control, while other members of cyclin and CDK families may not be directly involved in it [93]. Notably, all the CDKs of A. aranciacus being expressed by the early cleavage stage (Table 3) have their homologs in S. purpuratus, except for CDK6. Instead of it, S. purpuratus has CDK4, which was not present in the list of A. aranciacus. However, CDK4 is similar to CDK6 in that both molecules specifically pair with cyclin D, and that both CDKs may not be needed for cell cycle progression [93][94][95]. As cyclin D/CDK4 sustains normal development of sea urchin embryos beyond blastula stage [96], it would be interesting to see if CDK6 plays a similar role in the embryos of A. aranciacus or additional CDKs are expressed at a later developmental stage that was not addressed in our study.
Quantification of the relative gene expression levels by real-time qPCR confirmed that cyclin A and B are the two most abundant members of cyclin family in the oocytes, eggs, zygotes, and early embryos of A. aranciacus (Fig 3B). Both cyclin A and B can bind to CDK1 to regulate its enzyme activity [97], and studies in starfish have shown that cyclin B/CDK1 is almost exclusively accountable for the MPF activity during the meiotic progression of the oocytes (maturation). Instead, cyclin A/CDK1 activity is negligible during meiotic maturation, but becomes predominant at mitosis during early cleavages [83]. It is because cyclin A protein is nearly undetectable until the first cleavage, while cyclin B is newly synthesized in the maturing oocytes and degraded at first meiotic anaphase [83,98]. Hence, our finding that mRNA levels of the cyclin and CDK families remain unchanged during the period of meiotic cycle and the first few cleavages (Fig 3B) indicate that oscillation of cyclins is not regulated at the transcriptional level.
In this context, it should be remembered that the mRNAs stockpiled in the eggs and early embryos are mostly metabolic provisions for future use during development, and do not necessarily reflect the current composition of the protein population of the cells. Translation activity remains at the basal level in starfish oocytes, but the overall activity of protein synthesis is evidently enhanced by the time of GVBD when expression of some selected proteins such as cyclin B is overtly enhanced [24,56,98,99]. After fertilization, the rate of de novo protein synthesis is accelerated with a considerable time lag for most animal species [24,[100][101][102][103][104][105], as mRNAs get unmasked, polyadenylated, or engaged with active polysomes [106][107][108]. Other post-transcriptional controls such as polyadenylation or partial degradation of the transcripts during meiotic maturation and fertilization may also contribute to translatability of the mRNAs [109,110]. Hence, quantitative characterization of the transcriptome alone does not provide sufficient information on biochemical activities of the eggs and early embryos. In this regard, qualitative and spatial controls of mRNA should also be carefully examined, and be complemented by other approaches such as proteomics in future studies [89,111].
During the periods of meiotic maturation and fertilization, the maternal mRNA population might change as a result of RNA synthesis and degradation. In sea urchin, de novo transcription seems to take place in the pronucleus of mature eggs, and presumably also in the male pronuclei of the zygotes [112,113]. In the clam oocytes and early embryos, on the other hand, the competent mRNA population was virtually unchanged during this transition period, as judged by the results of the cell-free translation system [103]. In our study, despite the great diversity of the mRNA population in the oocytes, eggs and early embryos, the profiles of the mRNA population apparently did not undergo a marked shift except for the frequency changes of the transcripts belonging to few GO classes (Fig 1) or a metabolic pathway (Fig 2). Nonetheless, the analyses of the assembled transcriptome predicted that a number of transcripts may undergo considerable changes in their abundance during the period of maternal to zygote transition. However, validation of the transcriptome data with real-time qPCR indicated that the estimations by the two methods agreed with each other only on a large scale, and that the differences smaller than 10-fold were often difficult to resolve (Fig 5). This discrepancy may stem from the facts that: i) digital gene expression levels by RNA-seq displays wider dynamic range than qPCR does [114], ii) the RNA samples for RNA-seq and qPCR were taken from different breeding seasons for a more rigorous verification (biological replicates rather than technical replicates), iii) sequencing depth may have not been sufficient to minimize the variability of in silico analysis [115], or iv) efficiency of reverse transcription might have been substantially variable among transcripts of different genes. The possibilities of variable contributions from genomic DNA contamination or from the batch effect of assembled sequences were all ruled out by the control experiments using RNAse A (S1 Fig) and cluster analysis (S2 Fig), respectively. At any rate, we have confirmed that the level of the transcript encoding a serine/threonine-protein kinase, Mos, is significantly reduced after a few rounds of cleavage (Fig 5). In view of the fact that Mos is a negative regulator of cell cycle [116,117], its significant reduction in transcript number during the rapid rounds of mitosis makes a perfect sense.
The phylogeny of starfish has been highly controversial, especially for the placement of Paxillosida to which A. aranciacus belongs [59]. The ordinal classification of starfish has mainly relied on anatomical and embryological characters (e.g. morphology of pedicellarie and spines, eversion of cardiac stomach, and developmental passage through brachiolaria stage, etc.) because DNA sequence information is largely lacking for most species [57,118]. The genome of mitochondrial DNA has been recognized as a useful source of phylogenetic analysis for the cases in which phylogenetic distance is relatively small [119]. Here, we considered 6 Paxillosidan species falling into three families (i.e. Luidiidae, Astropectinidae, and Pseudarchasteridae) along with 25 other starfish species. Although the length of homologous region is relatively short and the taxa sampling is sparse due to the limited availability of DNA sequences, our analysis with the deduced amino acid sequences revealed approximate interrelationship of orders. Whereas previous molecular trees often suggested Paxillosida as the most basal order in the phylogeny of starfish [120,121], our results rather suggested that two Velatidan species were basal to all the rest starfish species (Fig 6). This discrepancy may in part arise from the fact that the aforementioned studies did not include the two Velatidan species in their analyses. Our data placing Paxillosida as a derived order are in line with more recent studies utilizing 13 protein-coding genes in the mitochondrial genome [122].
In summary, we have described the utility of the first transcriptome that was assembled from A. aranciacus. Despite the inevitable limitations stemming from the lack of genomic sequences and the relatively short developmental time frame, the annotated transcriptome is likely to provide useful information in designing molecular approaches to a number of biological research topics that utilize starfish. early embryos (EE), were pooled together and converted to cDNA. As a negative control, aliquots of RNA were pretreated with DNAse-free RNAse prior to cDNA synthesis, as described previously [70]. While cyclin A primers (S1 Table)  reads) read counts were clustered using the cmdscale function in R. It is evident that the biological replicates of each stage were converged in a restricted area, but were appreciably separated from the converged replicates of other stages. Reflecting the little changes of gene expression between Mature eggs (Mat) and Zygotes (Zyg, fertilized eggs), the converged area defined by M1-M4 was relatively close to that of F1-F4. The data from the GV-stage were not included in this cluster analysis because they were processed as pooled samples. (TIF)