Transcriptome Analysis of the Portunus trituberculatus: De Novo Assembly, Growth-Related Gene Identification and Marker Discovery

Background The swimming crab, Portunus trituberculatus, is an important farmed species in China, has been attracting extensive studies, which require more and more genome background knowledge. To date, the sequencing of its whole genome is unavailable and transcriptomic information is also scarce for this species. In the present study, we performed de novo transcriptome sequencing to produce a comprehensive transcript dataset for major tissues of Portunus trituberculatus by the Illumina paired-end sequencing technology. Results Total RNA was isolated from eyestalk, gill, heart, hepatopancreas and muscle. Equal quantities of RNA from each tissue were pooled to construct a cDNA library. Using the Illumina paired-end sequencing technology, we generated a total of 120,137 transcripts with an average length of 1037 bp. Further assembly analysis showed that all contigs contributed to 87,100 unigenes, of these, 16,029 unigenes (18.40% of the total) can be matched in the GenBank non-redundant database. Potential genes and their functions were predicted by GO, KEGG pathway mapping and COG analysis. Based on our sequence analysis and published literature, many putative genes with fundamental roles in growth and muscle development, including actin, myosin, tropomyosin, troponin and other potentially important candidate genes were identified for the first time in this specie. Furthermore, 22,673 SSRs and 66,191 high-confidence SNPs were identified in this EST dataset. Conclusion The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in Portunus trituberculatus. The data will also instruct future functional studies to manipulate or select for genes influencing growth that should find practical applications in aquaculture breeding programs. The molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will be essential for accelerating aquaculture breeding programs with this species.


Introduction
Portunus trituberculatus (Crustacea: Decapoda: Brachyura), commonly known as the swimming crab, is widely distributed in the coastal waters of Korea, Japan, China, and southeast Asia [1]. This species inhabits estuaries and coastal waters, which belong to typical euryhaline crab species. In China, it is a major edible crab species and one of the most important fishery resources [2] and the production has now reached 92,907 tons2011 [3]. At present, Commercial crab farming largely depended on wild seed stock and the commercial characteristics (growth rate, flesh quality and disease resistance) of the cultured stocks have also declined after many years of culturing [4], and wild populations of Portunus trituberculatus have dramatically declined for the last decades due to over-exploitation and the deterioration of environmental conditions in China [5]. To improve the germplasm of swimming crab, a selective breeding for fast growing Portunus trituberculatus has been carried out since 2004 at the Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences (Qingdao, China). The selected fast-growing population of the crab had been examined and approved by China National Aquaculture Variety Approval Committee as a new variety for aquaculture and named ''HuangXuan No. 1'' (Authorization number: GS-01-002-2012) in 2012, of which growth rate increased by 20.12% compared to wild seed stock.
However, the molecular mechanisms involved in growth are poorly understood. Due to a lack of genetic and genomic information, growth-determining genes have not been identified in the crab, even genes related to growth have rarely been reported. Therefore, more genome-wide or transcriptome-wide datasets should be generated as a basis for functional genomics approaches aimed at improving the aquaculture performance of this species. Despite the aquacultural and biological importance of Portunus trituberculatus, previous studies have mostly focused on the isolation of microsatellites [6][7][8], the investigation of genetic diversity [9], Sanger-based sequencing of expressed sequence tags (ESTs) [10,11], the characterization of single functional genes [12][13][14] and sequencing of the mitochondrial genome of Portunus trituberculatus [15][16][17].
When no genome sequence is available, transcriptome sequencing is an effective way to obtain large numbers of molecular makers and identify transcripts involved in specific biological processes [18]. Massively parallel sequencing of RNA (RNA-Seq) has offered the opportunity to characterize the transcriptome with unprecedented sensitivity and depth. It has already revolutionized the way we study the transcriptome. The latest paired-end sequencing of RNA-Seq techniques have further improved the efficiency of DNA sequencing and expanded short read lengths, permitting a deeper understanding of the transcriptome [19]. Because it is not restricted by the unavailability of a genome reference sequence, this approach has been applied in decoding the genomes of several non-model organisms, providing valuable information in the understanding of gene function, cell responses and evolution [20][21][22]. Significant progress has aslo been made in understanding the transcript of various marine crustacea by RNAseq over the last two years, such as Litopenaeus vannamei, Fenneropenaeus chinensis, Eriocheir sinensis and Macrobrachium nipponense [23][24][25][26][27], which is essential to better understand a species' biology and to devise strategies to improve productivity in culture. However, such investigations in Portunus trituberculatus have not been reported.
In this study, we present the first Portunus trituberculatus transcriptome using massively parallel mRNA sequencing. We perform Illumina sequencing of the eyestalk, gill, heart, hepatopancreas, and muscle tissues to characterize the Portunus trituberculatus transcriptome. The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in Portunus trituberculatus. According to our sequence analysis, many genes involved in growth were identified. In addition, a variety of markers potentially useful for genomic population studies including simple sequence repeats (SSRs) located within coding regions and single nucleotide polymorphisms (SNPs) detected amongst deep coverage sequence regions reads are also reported.

Ethics Statement
The crabs used in the present study were marine-cultured animals, and all of the experiments were conducted according to the regulations of the local and central government.

Sample Preparation
The swimming crabs, Portunus trituberculatus at 100 days age (20.62,64.19 g in body weight), were obtained from a local farm in Qingdao, China. All the samples were acclimated in the laboratory (33 ppt, 18uC) for one week before the experiment treatment. The crabs (nine males and nine females) were all anaesthetized on ice and dissected to collect samples, including the eyestalk, gill, heart, hepatopancreas, and muscle. All of the samples were immediately in RNAlater (Ambion) at 4 uC overnight and then at 220 uC until RNA extraction within 2 weeks.

RNA Isolation, cDNA Library Construction and Illumina Deep Sequencing
Total RNA was isolated from each sample by trizol (Invitrogen, CA,USA). RNA degradation and contamination was monitored on 1% agarose gels. RNA purity was checked using the NanoPhotometer spectrophotometer (IMPLEN, CA, USA). RNA concentration was measured using Qubit RNA Assay Kit in Qubit 2.0 Flurometer (Life Technologies, CA, USA). RNA integrity was assessed using the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, CA, USA). A total amount of 5 ug RNA per sample was used as input material for the RNA sample preparations and all samples had RIN values above 8. Then, all samples were pooled in equal amounts to generate one mixed sample. The pooling samples were then used to prepare one separate Illumina sequencing libraries.
cDNA libraries were generated using Illumina TruSeq RNA Sample Preparation Kit (Illumia, San Diego, USA) following manufacturer's recommendations. Briefly, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. Fragmentation was carried out using divalent cations under elevated temperature in Illumina proprietary fragmentation buffer. First strand cDNA was synthesized using random oligonucleotides and SuperScript II. Second strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities and enzymes were removed. After adenylation of 39 ends of DNA fragments, Illumina PE adapter oligonucleotides were ligated to prepare for hybridization. In order to select cDNA fragments of preferentially 200 bp in length, the library fragments were purified with AMPure XP system (Beckman Coulter, Beverly, USA). DNA fragments with ligated adaptor molecules on both ends were selectively enriched using Illumina PCR Primer Cocktail in a 10 cycle PCR reaction. Products were purified (AMPure XP system) and quantified using the Agilent high sensitivity DNA assay on the Agilent Bioanalyzer 2100 system. In the final step, the library preparations were sequenced on an Illumina Hiseq 2000 platform and 100 bp paired-end reads were generated.

Availability of supporting data
The data sets of Illumina sequencing are being submitted to the NCBI Short Read Archive (SRA) database.

Bioinformatic Analysis
Quality control. Raw data (raw reads) of fastq format were firstly processed through our self-written perl scripts. In this step, clean data(clean reads) were obtained by removing reads containing adapter, reads containing ploy-N and low quality reads from raw data. At the same time, Q20, Q30, GC-content and sequence duplication level of the clean data were calculated. All the downstream analyses were based on clean data with high quality.
Transcriptome assembly. Reads were assembled using Trinity [28], followed by TIGR Gene Indices clustering tools (TGICL) [29], with default parameters. The longest assembled sequences were referred to as contigs. The reads were then mapped back to contigs with paired-end reads to detect contigs from the same transcript and the distances between these contigs. Finally, sequences were obtained that lacked Ns and could not be extended on either end [30]. Such sequences were defined as unigenes.
Transcriptome annotation. The unigenes were predicted to mapping to protein-coding sequences by GetORF of EMBOSS [31]. The predicted protein-coding sequences were compared with the NCBI non-redundant (Nr) protein database and UniProtKB database using BLASTx with E values less than 1.0610 25 (E values less than 1.0610 25 were considered as significant) [32,33]. Based on Nr annotation, we used BLAST2GO program (http:// www.BLAST2go.org/) to get GO annotation of unigenes [34]. GO functional classification for all unigenes was performed using WEGO software (http://wego.genomics.org.cn/cgi-bin/wego/ index.pl) [35]. KEGG metabolic pathway annotation and COG classification of unigenes were determined by BLASTx searching against KEGG database and COG database, respectively [36,37].
Makers detection. SSR of the transcriptome were identified using MISA (http://pgrc.ipk-gatersleben.de/misa/misa.html), and primer for each SSR was designed using Primer 3 (http:// primer3.sourceforge.net/releases.php). SNP were detected according to align clean reads to the reference transcriptome using SOAP2, then duplicated reads and multi-mapped reads were filtered from the alignment results in order to eliminate the PCR interference and ambiguous mapping. SOAPsnp was used to call SNP based on the sorted alignment results. SNPs qualified for the following standards were selected as the final SNP sets: quality score is not lower than 20, and distance between two SNPs are greater than 5.

Real-time PCR Assays
14 annotated unigenes that may relate to growth were selected to be analyzed using real-time PCR, and their specific primers were listed in Table S1. The crab were siblings generated from a single pair of broodstock. The large and small sizes of crab were selected at 100 days age, respectively, from the .90 and ,10 percentile regions of the growth distribution curve. Eyestalks were collected from nine healthy crabs of small size group (SG, 20.665.4 g in average body weight) and large size group (LG, 64.266.1 g in average body weight), respectively. All the samples were acclimated in the laboratory (33 ppt, 18uC) for one week before the experiment treatment. Total RNA was isolated according to the manufacturer's instructions of TRIZOL LS reagent (Invitrogen, Carlsbad, CA, USA). Then, RNA samples of three individuals were pooled within each group in equal amounts to generate three mixed sample, respectively. (three biological replicates of each group). RpL8 gene was selected as an internal control for qPCR analysis and the primers reference Xu's literature [38]. First strand cDNA was synthesized from 1 mg of RNA using M-MuLV reverse transcriptase (Qiagen). The qPCR reaction mixture (20 uL) consisted of 26 Power SYBR Green PCR Master mix, 0.9 M each of the forward and reverse primers, and 1 mL of template cDNA. PCR amplification was performed under the following conditions: 50uC for 2 min and 95uC for 30 s, followed by 40 cycles of 95uC for 15 s and 62uC for 1 min, and a final extension at 72uC for 5 min.

SSR validation and Polymorphism evaluation
Genomic DNA of crabs was extracted from muscle tissue using genomic DNA extraction kit (BioTeke, Beijing, China) following the protocols. Electrophoresis through a 1.5% agarose gel was used to check DNA integrity. The SSR markers were initially tested for amplification using a pool DNA sample of 10 crabs. PCR amplifications were carried out using Master-cycler gradient thermal cycler (Eppendorf) in a final volume of 10 ul. Each reaction tube contains 1.0 ml of 106PCR buffer, 0.8 ml of dNTP (2.5 mM), 0.4 ml of each primer (10 umol), and 0.5 ml of genomic DNA (20 ng/ul), 0.05 ml of rTaq DNA polymerase (5 U/ ul,Takara), 6.85 uL of ddH 2 O. The PCR reaction program was: DNA denaturation at 94uC for 5 min; 35 cycles of 94uC for 30 s, 50-60uC for 30 s, 72uC for 30 s.; and 72uC for 7 min as a final extension. The primers that were not successful for amplification or produced multiple bands were reanalyzed using the touchdown PCR method with 1uC increments. The optimized SSR primers were used to amplify DNA from 30 wild individuals of P. trituberculatus collected from Jiaozhou, Shandong province, China for polymorphism evaluation. Amplification products were resolved via 8% denaturing polyacrylamide gel, and visualized by silver-staining. A 10-bp DNA ladder (Invitrogen Inc.) was used as a reference marker for allele size determination.
The number of alleles (Na), polymorphism information content (PIC), expected and observed heterozygosities (He and Ho, respectively) were calculated with the software CERVUS 3.0 [39].

SNP validation
To validate the putative SNPs identified in transcripts, the same cDNA samples as for the transcriptome profiling (pool of eighteen wild Portunus trituberculatus) were used. Twenty transcripts containing 56 potential SNPs and sufficient flanking regions were randomly selected for primer design. PCR products were sequenced directly in both directions with forward and reverse primers using Sanger technology on the ABI3730 platform (Applied Biosystems). Sequencing chromatograms were visually analyzed with Chromas2.32 (Technelysium Pty. Ltd.), and SNPs were identified as overlapping nucleotide peaks.

Illumina Draft Reads and Sequence Assembly
In order to achieve a comprehensive Portunus trituberculatus transcriptome, total RNA was extracted from a variety of tissues, including the eyestalk, gill, heart, hepatopancreas, and muscle. Equal quantities of RNA were mixed together to construct a cDNA library and perform Illumina sequencing. This pooling strategy was widely used in some similar studies [23,24,40,41]. The schematic of Illumina deep sequencing and analysis are shown in Figure 1. The overall Illumina read pairs and clean bases for all samples are 65,846,872 and 12.86G, respectively ( Table 1). Files containing these data were deposited in the Short Read Archive of the National Center for Biotechnology Information (NCBI) with accession numbers of SRR1168416 and SRR1168417.
After assembly analysis based on all Illumina reads, we identified 120,137 transcripts. The average length of all transcripts was 1,037 bp, with the smallest sequence of 201 bp and the largest one of 33,865 bp. The sequence length distribution of transcripts is indicated in Figure 2 and Table 1. The average length of our assembled contigs was longer than that previously reported for Litopenaeus vannamei (average of 396 bp), Eriocheir sinensis (average of 385 bp) and Fenneropenaeus chinensis (average of 676 bp) [24,25,42]. Long sequences of good quality could enable us to obtain more information about genes. Therefore, this transcriptome dataset provides a useful resource for future analyses of genes related to economic traits. To the best of our knowledge, this is the first comprehensive study of the transcriptome in Portunus trituberculatus.
To assess the abundance and coverage of the transcriptome data, we matched the assembled unigenes against the known EST library on Genbank. The 13,985 ESTs downloaded from NCBI were clustered and assembled, and 2,612 assembled EST-unigenes with mean length of 783 bp were generated. Comparisons between transcriptome unigenes and EST-unigenes were performed using BLASTn algorithm. All of the EST-unigenes can be matched in the transcriptome unigenes library, whereas only 3.0% of the transcriptome unigene sequences can be found in the ESTs library. It suggests the transcriptome data provide abundant information besides the now available ESTs sequences, and will vastly expand the number of genes identified in this species.

COG, GO and KEGG Classification
The assembled unigene sequences were subjected to BLAST searching against GO, COG and KEGG databases, and the summary statistics of BLAST assignment was shown inTable 1.
COG is a database where orthologous gene products were classified. Every protein in COG is assumed to be evolved from an ancestor protein, and the whole database is built on coding proteins with complete genome as well as system evolutionrelationships of bacteria, algae and eukaryotes [50]. Phylogenetic classifications of the predicted CDSs of unigenes were analyzed by searching against COG database to predict and classify possible functions of the unigenes ( Figure S1). Possible functions of 14,263 unigenes were classified and subdivided into 26 COG categories, among which the cluster for'General function prediction only' represents the largest group (2652, 16.38% of the matched unigenes), followed by 'Signal Transduction' (2,279,14.08%) and 'Posttranslational modification, protein turnover, chaperones' (1,483, 9.16%).
The Gene Ontology (GO) project provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences. Many model organism databases and genome annotation groups use the GO and contribute their annotation sets to the GO resource [51]. Among 87,100 assembled unigenes, 26,732 were successfully annotated by GO assignments, belonging to one or more of the three categories: biological process, cellular component, and molecular function. Among the annotated unigenes, 18,517 are involved in various biological process categories, cellular process (13,977 unigenes; 16.05%), metabolic process (12,657; 14.53%), biological regulation (4,824; 5.54%) and regulation of biological process(4,612; 5.30%) comprised the largest proportion. Further, 15,405 unigenes are involved in cellular component categories, among which, cell part (13,174; 15.12%), cell (13,174; 15.12%), organelle (5,712; 6.56%) and macromolecular complex (3,748; 4.30%) comprised the largest proportion. In addition, 21,988 unigenes are involved in molecular function categories, the top four categories were involved in binding (14,155; 16.25%), catalytic activity (10,779; 12.38%), transporter activity (2,467; 2.83%) and structural molecule activity (1,776; 2.04%) ( Figure S2). In summary, these terms account for a large fraction of the overall assignments in Portunus trituberculatus transcriptomic dataset. Understandably, genes encoding these functions may be more conserved across different species and are thus easier to annotate in the database.
The KEGG pathway database records networks of molecular interactions in the cells and variants of them specific to particular organisms. Pathway-based analysis helps us to further learn biological functions of genes [36,52]. To systematically analyze their inner cell metabolic pathways and complicated biological behaviors, we classified the unigenes into biological pathways by mapping the annotated CDS sequences to the reference canonical pathways in the KEGG database ( Figure S3). 7,588 unigenes were consequently assigned to 31 KEGG pathways, among which 858 members assigned to 'Translation', followed by 'Signal

Genes of interest related to growth
The transcriptome of Portunus trituberculatus was primarily examined to identify a wide range of candidate genes that might be functionally associated with growth. Traditionally, such gene discovery in non-model organisms has required degenerate PCR, which is labor-intensive and prone to failure [53]. The annotated transcriptome reported here allows researchers to identify genes of interest more easily than that of using degenerate PCR, especially in crustaceans whose genome information is relatively poor. According to our sequence analysis, many genes involved in growth were identified ( Table 2) via three principal search strategies [54]: 1) associations between genes and growth reported in crustaceans, 2) growth-related genes involved with moulting, 3) muscle development and degradation genes involved in moulting.
In this study, a total of 21 categories of growth-related genes were found, amongst these, ten were found in our transcriptome data which have been identified previously to have roles in growth in crustaceans. These genes include 1. 5-Hydroxytryptamine receptor, Alpha-amylase, Cathepsin L, Cyclophilin, Fatty acidbinding protein, Fibrillarin, Glyceradehyde-3-phosphate dehydrogenase, Growth hormone and insulin-like growth factor, Myostatin and growth differentiation factor 8/11, Signal transducer and activator of transcription, Secreted protein acidic and rich in cysteine(SPARC), and Translin-associated factor X (TRAX). Previous studies have shown that SNP in 5-Hydroxytryptamine receptor [55], Cathepsin L [56], Myostatin and growth differentiation factor 8/11 genes [57] show significant associations with growth traits in crustaceans. Besides, Cyclophilin, Fibrillarin and Secreted protein acidic and rich in cysteine(SPARC) gene expression showed a negative correlation with body weight in shrimp by Pearson's correlation analysis [58].
In crustaceans, periodic shedding of the exoskeleton is one of the most important physiological processes essential for crustacean growth and postembryonic development including moulting and regeneration [59]. Although the functions of many of the hormones and genes involved in this process are still not well defined, a number of studies have indicated that moulting and reproduction in crustaceans is regulated by the eyestalk derived CHH gene family which is one of the major groups of peptide hormones produced in the XO-SG [60,61]. In addition, MIH is responsible for maintaining animals in the intermoult stage which is an important regulator of steroidogenesis in the YO [62]. In this research, a few genes which belong to crustacean hyperglycemic hormone neuropeptides (CHH) family were identified including MIH, CHH and ecdysteroids. Correlations between SNPs in the CHH and MIH gene with individual growth performance show that the CHH and MIH gene has high potential to impact body weight variation in crustaceans [63] and should, therefore, be considered as a primary gene of interest in growth studies.
Crustacean muscle growth is not continuous and is strongly influenced by the moulting cycle [59]. During the moult, muscles regenerate, and energy reserves including glycogen and lipids are accumulated in the hemolymph and the midgut for the next moult [64,65]. Overall muscle protein synthesis is very important for growth, reproduction and other metabolic activities in crustaceans. Recent studies of invertebrates have highlighted the importance of muscle specific genes and proteins in crustaceans [43]. In this research, six genes related to muscle build-up or degradation during the moulting event were identified including Actin, Myosin, Alpha skeletal muscle, Calponin/calponin transgelin, Tropomyo- sin and Muscle lim protein. Among which, both actin and myosin proteins showed a high number of transcripts. It has been reported that actins are expressed in abundance as they are critical to formation of muscle filaments [66]. Different actin isoforms have been identified in various crustaceans [67], and are likely to be involved in playing important roles in cytoskeletal structure, cell division and mobility, and muscle contraction [68,69]. As evidence for a role for actin in muscle build-up during the moult cycle in crustaceans, Cesar and Yang (2007) reported that muscle structural a-actin and cytoskeletal b-actin increased during the intermoult and premoult stages, a phase where high muscle growth occurred in the abdominal muscle of L. vannamei [70]. However, in a recent SNP association analysis study, four synonymous polymorphisms were identified in an actin fragment but SNP allele distributions were not related significantly to individual growth performance in the two studied groups of giant freshwater prawn M.rosenbergii [63], and further studies will be required to investigate. Myosins are a major component of the contractile apparatus and consist of two heavy (MHC) and four associated light chains (MLC) [71]. Previous study showed, myosin gene expression levels could provide a good molecular marker of individual growth potential in the Atlanticpink shrimp Farfantepenaeus paulensis that identified MHC as a possible growth candidate gene [72]. A high number of actin and myosin protein transcripts observed here may regulate muscle development and function in Portunus trituberculatus, and similar results have been found in Macrobrachium rosenbergii [43], however, further studies are needed to confirm these observations. The current study identified a number of putative genes that are potentially involved with growth in Portunus trituberculatus. However, further studies are needed to understand the molecular functions of these putative genes with growth performance.

Real-time RT-PCR confirmation of growth-related genes
To futher confirm the growth-related genes obtained from the transcriptome data, 14 candidate genes were selected to be analyzed using real-time PCR. Since the XO-SG complex in the crab eyestalk produces a variety of neuropeptides/neurohormones [60,61], it was selected, in this study, as the target tissue for differential gene expression analysis between large and small crab.
All of the 14 selected genes revealed significant differences in gene expression between small size group (SG) and large size group (LG) (Figure 4), which consistent with the fact that body weight is a complex trait regulated by the coordinate action of several genes. Most of gene (12) were significantly down-regulated in LG including cyclophilin A (comp59112_c0), fibrillarin (comp41046_c0) and SPARC (comp49996_c0). In P. monodon, Tangprasittipap et al. (2010) reported the index of relative cyclophilin, SPARC and fibrillarin-like expression was negatively correlated with body weight (p,0.05) [58], which were similar to our results, suggests that these genes may have some effect on individual growth performance, and warrants further study in crustacean species.
Myostatin (comp623701_c0) and Myosin heavy chain (comp31357_c1) were up-regulated in LG. In vertebrates, Myostatin (MSTN), principally controls growth of muscle cells as a negative regulator of muscle development [73], however, MSTN show positive regulation of growth in the crab. Similar result was found in P. monodon, reduced levels of MSTN transcripts resulted in a dramatic slowing of growth rate compared with control groups  [74], which suggests the genes may regulate growth positively in crustaceans. Myosins are a major component of the contractile apparatus and consist of two heavy and four associated light chains [71]. In south-western Atlantic pink shrimp Farfantepenaeus paulensis, higher MHC expression was observed in a high weight shrimp group [72], a result was similar to our research, which suggested MHC as a possible growth candidate gene in crustaceans.
The result here show that seeking the growth-related genes via high-throughput transcriptome sequencing and bioinformatics analysis is an effective way. Further studies should be carried out to elucidate the function of these genes in growth.

Putative molecular markers
SSR characterization and Polymorphism evaluation. SSRs, or microsatellites, are polymorphic loci present in genomic DNA. They consist of repeated core sequences of 2,6 base pairs in length. Among the various molecular markers, SSRs have been proven to be an efficient tool for constructing genetic linkage, performing QTL analysis and evaluating the level of genetic variation in a species because of the high variability, abundance, neutrality and co-dominance of microsatellite DNA [75]. We obtained a total 22,673 SSRs in the transcriptomic dataset. Of these, 67.03% were di-nucleotide repeats, followed by 30.06% tri-nucleotide repeats and 2.92% tetra/penta/hexa-nucleotiderepeats ( Figure 5). Generally believed that SSR of animals are mainly di-nucleotide repeats [76,77], and our findings support this conclusion. Among the di-nucleotide repeats motifs, (GT/TG)n, (GA/AG)n and (CA/AC)n were the three predominant types with frequencies of 36.46%, 27.03% and 22.80%, respectively. In the 20 types of tri-nucleotide repeats, (GGT/GTG/TGG)n, (GGA/ GAG/AGG)n, and (CCA/CAC/ACC)n were the most common types with a combined frequency of 39.84%. The results of this study is differ from other studies which indicate that microsatellite repeats types have species-specific in crustacea [26,43]. Primer for each SSR was designed using Primer 3 (http://primer3.  sourceforge.net/releases.php), and 14.5% (3,858) SSRs can be designed priemers successfully (the data do not show). To date, only a few microsatellites have been available for Portunus trituberculatus from NCBI. Thus, the development of SSRs for this species is highly desirable.
Fifteen SSRs were randomly selected for primer synthesis and validation, among which, 9 were successful in PCR amplification using genomic DNA from Portunus trituberculatus. The remaining 6pair primers failed to generate PCR products, even when the annealing temperature was reduced by 8uC. Of the 9 primers, 1 primers were monomorphic, the other 8 primers were polymorphic (Figure 6), the proportion of polymorphic primers was 53.3%. From the 8 polymorphic loci, the number of alleles per locus ranged from 2 to 8 alleles. A total of 34 alleles were identified, with an average of 4.25 alleles per locus. Across 8 loci, the polymorphic information content (PIC) ranged from 0.15 to 0.78 (Table 3), with an average of 0.58, suggesting that the developed EST-SSRs were highly polymorphic. The results obtained in this study indicated that these SSRs developed from EST in the swimming crab will be a useful tool for the genetic research such as population variation, parentage analysis, stock enhancement evaluation, and the establishment of effective conservation strategy of Portunus trituberculatus.
SNP characterization and validation. SNPs were identified from alignments of multiple sequences used for contig assembly. By excluding those that had mutation frequency of bases lower than 1%, we obtained a total of 66,191 SNPs, of which 23,734 were putative transitions (Ts) and 42,457 were putative transversions (Tv), giving a mean Ts: Tv ratio of 1:1.79 across the transcriptome of Portunus trituberculatus (Figure 7) which can help identify genes affected by selection [78]. Further analysis found the Ts: Tv ratio were species-specific in crustacean species, in Oriental River Prawn (Macrobrachium nipponense) and Giant Freshwater Prawn (Macrobrachium rosenbergii), the Ts: Tv ratios were 1.99:1.00 and 1.32:1.00, respectively. The AT/TA, AG/GA and CT/TC SNP types were the most common. In contrast, GC/CG types were the smallest SNP types because of the differences in the base structure and the number of hydrogen bonds between different bases [26]. To verify the potential SNPs, a subset of 20 transcripts containing 56 SNPs were selected randomly. A pooled cDNA sample of eighteen wild Portunus trituberculatus was amplified by PCR. Subsequently, PCR products were sequenced bidirectionally with forward or reverse primers (Table S3), among which, sequencing of four transcripts failed. Of the 43 SNPs predicted to reside in the amplified sequences, 21 (48.8%) showed polymorphisms in the sample and were validated ( Table S3). The rate of polymorphic SNPs was probably an underestimate because only eighteen individuals were used. In addition, the way of SNP validation in this study has some limitations, SNPs with smaller variation frequency were difficult to identify via overlapping nucleotide peaks. Consequently, using more Portunus trituberculatus samples, the polymorphic rate of potential EST-SNPs should be higher than that found in our validation.
SSRs and SNPs detected in this study (Table S4 and Table  S5) are likely to be highly transferable to other closely related species, as has been the case in other crustacean species [79,80]. These potential markers identified within the ESTs will be valuable for studying the evolution and molecular ecology, genome mapping, and QTL analysis of Portunus trituberculatus.

Conclusion
Here we report the first comprehensive transcript dataset of the Portunus trituberculatus, a non-model species for which little molecular knowledge currently exists. The 120,137 transcripts identified and assembled will enable gene discovery in Portunus trituberculatus, and with the significant number of putative growthrelated genes identified should facilitate genomics approaches to improving the growth performance of domesticated stocks used for aquaculture. In addition, the large number of SNPs and SSRs detected provide targets for identifying polymorphisms across Portunus trituberculatus populations useful for parentage assignment and for managing in breeding in cultured populations.