In-Depth Transcriptome Analysis of the Red Swamp Crayfish Procambarus clarkii

The red swamp crayfish Procambarus clarkii is a highly adaptable, tolerant, and fecund freshwater crayfish that inhabits a wide range of aquatic environments. It is an important crustacean model organism that is used in many research fields, including animal behavior, environmental stress and toxicity, and studies of viral infection. Despite its widespread use, knowledge of the crayfish genome is very limited and insufficient for meaningful research. This is the use of next-generation sequencing techniques to analyze the crayfish transcriptome. A total of 324.97 million raw reads of 100 base pairs were generated, and a total of 88,463 transcripts were assembled de novo using Trinity software, producing 55,278 non-redundant transcripts. Comparison of digital gene expression between four different tissues revealed differentially expressed genes, in which more overexpressed genes were found in the hepatopancreas than in other tissues, and more underexpressed genes were found in the testis and the ovary than in other tissues. Gene ontology (GO) and KEGG enrichment analysis of differentially expressed genes revealed that metabolite- and immune-related pathway genes were enriched in the hepatopancreas, and DNA replication-related pathway genes were enriched in the ovary and the testis, which is consistent with the important role of the hepatopancreas in metabolism, immunity, and the stress response, and with that of the ovary and the testis in reproduction. It was also found that 14 vitellogenin transcripts were highly expressed specifically in the hepatopancreas, and 6 transcripts were highly expressed specifically in the ovary, but no vitellogenin transcripts were highly expressed in both the hepatopancreas and the ovary. These results provide new insight into the role of vitellogenin in crustaceans. In addition, 243,764 SNP sites and 43,205 microsatellite sequences were identified in the sequencing data. We believe that our results provide an important genome resource for the crayfish.


Introduction
The red swamp crayfish Procambarus clarkii is a freshwater crayfish species that is native to parts of Mexico and the United States [1], but is also commonly found outside its natural range in Asia, Africa, Europe, and elsewhere in the Americas, where it is often considered to be an invasive pest [2]. P. clarkii was introduced to China from Japan in the 1930s [3]. Crayfish farming began in in the 18th century in Louisiana in the USA, where the species was cultivated in rice fields. Crayfish have been farmed extensively in China since the 1990s, and China is now the world's leading crayfish producer [4].
Despite great interest in this organism, knowledge of the crayfish genome is very limited, and gene discovery has been performed on a relatively small scale. Only 330 expressed sequences (EST) and 547 nucleotide sequences have been deposited in GenBank (accessed on Jul 29, 2014) for the crayfish, which is fewer than the close relative Pacifastacus leniusculus (1063 EST and 1100 nucleotide sequences) and far less fewer than other economically important crustaceans, such as the freshwater prawn Macrobrachium nipponense, the giant freshwater prawn Macrobrachium rosenbergii, the pacific white shrimp Litopenaeus vannamei, and others. Furthermore, only a very few genetic markers have been discovered for P. clarkii [3,[24][25][26].
The traditional methodology to explore expressed sequence tags (ESTs) involves construction of a cDNA library followed by Sanger sequencing, which is time-consuming and inefficient. Normally, the numbers of ESTs generated using this method is no more than ten thousand [27]. In recent years, next-generation sequencing technologies from companies such as 454 Life Sciences, Illumina, and Applied Biosystems (SOLiD sequencing) have been widely used to explore genomic information in model and non-model organisms. In comparison to traditional Sanger sequencing technology, next-generation sequencing technologies are superior in many aspects, and in general they are able to provide enormous amounts of sequence data with a greater breadth and depth of information, in shorter times and at a significantly lower cost [28][29][30]. The expressed sequences generated using next-generation sequencing technologies are often on the order of thousands or hundreds of thousands of sequences, which are ten-fold or onehundred-fold greater than the number identified by traditional technologies.
In this study, hi-seq sequencing technology was used to sequence the transcriptomes of 4 major organs in the crayfish: hepatopancreas, muscle, ovary, and testis. This data was used to generate expressed sequence data, simple sequence repeat markers, and SNP markers that represent a resource for trait mapping, as well as differential organ gene expression profiles, to better understand the functions of the studied organs in the crayfish. We believe that the data obtained from this study represent an import resource for crayfish research into gene function, molecular events associated with breeding, and other areas.

Ethics statement
This study was approved by the Animal Care and Use Committee of the Center for Applied Aquatic Genomics at Chinese Academy of Fishery Sciences.

Animal collection
P. clarkii weighing approximately 10-20 g were collected from a crayfish farm in Xuyi, Jiangsu Province, China. Collected crayfish were cultured in water tanks with adequate aeration at 20uC and a natural photoperiod, and were fed with a commercial crayfish diet once per day. Four tissue types (hepatopancreas, muscle, ovary, and testis) were collected, and each group of tissues contained samples from approximately ten crayfish. The tissue samples were frozen immediately in liquid nitrogen, and stored at 280uC.

RNA isolation and Illumina sequencing
Total RNA from various tissues was isolated using the RNeasy Plus Mini Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol, and treated with RNase-free DNase I (Qiagen) to remove genomic DNA. RNA integrity was evaluated by 1.5% agarose gel electrophoresis. RNA concentrations were measured and purity was determined using a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA).
RNA-seq library preparation and sequencing was carried out by the Genomic Analysis Lab of The Institute of Genetics and

De novo assembly and transcriptome analysis
Raw reads, which were generated by the Illumina/Solexa sequencer, were first trimmed by removing adapter sequences. Low quality reads (quality scores less than 20) were trimmed and short length reads (,10 bp) were removed [45][46]. The resulting high-quality reads were used in subsequent assembly. The Crayfish transcriptome was de novo assembled using Trinity software (vision 2013.02.25) with the default parameters [47]. In brief, three steps were performed. First, data was processed by Inchworm, in which the high-quality reads were combined to form longer fragments called contigs. Second, data was processed by Chrysalis, in which sequences were obtained by connecting contigs in such a manner that they could not be extended on either end, which resulted in de Bruijn graphs. Finally, de Bruijn graphs were further treated by Butterfly to obtain transcripts.

Transcriptome annotation and gene ontology analysis
All transcripts were compared with the NCBI non-redundant (nr) protein database, GO database, COG database, and KEGG database for functional annotation using BLAST software with an e-value cutoff of 1e-5 [48]. Functional annotation was performed with gene ontology (GO) terms (www.geneontology.org) that were analyzed using Blast2go software (http://www.blast2go.com/ b2ghome) [49]. The COG and KEGG pathway annotations were performed using Blastall software against the COG and KEGG databases [48].

Differentially expressed genes
To obtain expression levels for every transcript in different tissues, cleaned reads were first mapped to all transcripts using Bowtie software [50][51], then the FPKM (fragments per kilobase of exon per million fragments mapped) value of every transcript was obtained using RSEM (RNASeq by expectation maximization, http://deweylab.biostat.wisc.edu/rsem/) software. Differentially expressed genes were identified using edgeR (empirical analysis of digital gene expression data in R, http://www.bioconductor.org/ packages/release/bioc/html/edgeR.html) software [52][53]. For this analysis step, the filtering threshold was set as an FDR (false discovery rate) ,0.5.

RT-PCR amplification of transcripts
To validate the assembly of the crayfish transcriptome, 20 selected transcripts were used for expression analysis by RT-PCR. Total RNA was prepared from the four tissues (hepatopancreas, muscle, ovary, and testis) of crayfish using Trizol reagent in accordance with the manufacturer's instructions. Total RNA was treated with RQ1 RNase-free DNase (Promega, Madison, WI, USA) to avoid genomic DNA contamination, and then reverse transcribed using M-MLV reverse transcriptase (Promega, USA) according to the manufacturer's instructions. The synthesized cDNA was used as a template for PCR. The following PCR program was used: denaturation at 94uC for 3 min, 35 amplification cycles of 94uC for 30 s, 58uC for 30 s, and 72uC for 30 s, and a final extension at 72uC for 10 min. The PCR products were determined by 1.5% agarose gel electrophoresis using DNA markers. The expression of the 18S RNA gene of Procambrus clarkii (accession number: EU920952.1) was selected as reference gene, using the primer pair Pc18S-F (59-ATCACGTCTCTGACCGCAAG-39) and Pc18S-R (59-GACACTTGAAAGATGCGGCG-39).

SNP and microsatellite sequence identification
The raw reads were exported in FASTQ format to allow them to be imported into software for SNP calling. SAMtools (http:// samtools.sourceforge.net/) and VarScan (http://varscan.sourceforge. net/) software were applied to align reads to the reference transcriptome and to detect SNPs [54][55]. For this analysis step, the filtering threshold was set as a quality score no less than 20.
Msatcommander software was used to identify microsatellites from assembled contigs, as well as for primer design [56]. The mononucleotide repeats were ignored by modifying the configuration file. The repeat thresholds for di-, tri-, tetra-, penta-, and hexa-nucleotide motifs were set as 8, 5, 5, 5, and 5, respectively. Only microsatellite sequences with flanking sequences longer than 50 bp on both sides were collected for future marker development.

Illumina sequencing from crayfish tissues
Illumina-based RNA-sequencing (RNA-Seq) was performed with samples of four tissue types from crayfish. A total of 324.97 million paired-end reads were generated with a read length of 100 bp, of which 102.46 million reads were from the hepatopancreas, 83.51 million reads were from muscle, 84.94 million reads were from the ovary, and 36.06 million reads were from the testis. All raw sequence data were deposited in the NCBI Sequence Read Archive (SRA) under accession code SRP044128. After trimming of low quality reads and short reads, a total of 306.73 million highquality sequences (94.73%) were obtained (Table S1 in File S1), and these sequences were used for further analysis.
De novo assembly of the transcriptome At present, P. clarkii has no reference genome sequence, therefore a de novo assembly strategy was utilized, in which the crayfish transcriptome was de novo assembled by Trinity software (version 2013.02.25) using the default parameters. De novo assembly of 306.73 million high-quality sequences generated a total of 88,463 transcripts that ranged from 351 to 34,708 bp in length, with an average length of 1655.49 bp ( Table 1). The length distribution of transcripts is shown in Figure 1. Most of the transcripts (23.71%) were 401-600 bp in length, 11.51% ranged from 601-800 bp, and 10.90% ranged from 1-400 bp. These 88,463 transcripts yielded a total of 50,219 non-redundant transcripts because of alternative splicing; therefore, it was possible to match two or more transcripts to one gene. Our sequence data provided a large number of transcripts as compared to publicly available data from the Genbank database, which represent a convenient source of information for future full length cDNA cloning and gene function research in P. clarkii. Prior to this study, only 330 EST and 547 nucleotide sequences were listed in the GenBank database. Our study supplied 50,219 non-redundant transcripts, which has significantly enriched our knowledge of the P. clarkii genome and will facilitate further study of the functions of P. clarkii genes.

Functional annotation
Protein coding sequences of transcripts were predicted using a tool supplied by Trinity software (http://trinityrnaseq.sourceforge. net/analysis/extract_proteins_from_trinity_transcripts.html). Of the 88,463 transcripts, 42,905 were found to contain open reading frames (ORFs), with an average protein coding length of 552.78 bp and a mean nucleotide length of 2551.34 bp. These isosequences likely represent genes that play essential roles in P. clarkii biological processes.
88,463 transcripts were compared with the NCBI nonredundant (nr) protein database, GO database, COG database, and KEGG database for functional annotation using BlastX with an e-value cutoff of 1e-5 (Table S2 in File S1). A total of 31,763 transcripts (35.91% of all transcripts) had significant hits in at least one of these databases, which corresponded to 11,222 genes (21.63% of all genes). A total of 30,779 transcripts (96.90% of all annotated transcripts) had significant hits in the nr protein database, which corresponded to 10,862 genes (96.79% of all annotated genes). The gene names of top BLAST hits were assigned to each transcript with significant hits, and 3110 transcripts from P. clarkii were best matched with genes from Daphnia pulex, 2675 transcripts were best matched with genes from Tribolium castaneum, and 1837 transcripts were best matched with genes from Pediculus humanus ( Figure S1). Daphnia pulex is a primitive water flea, Tribolium castaneum is a type of beetle, and Pediculus humanus is a louse species that infests humans. Thus, the genes from P. clarkii were most similar to those known from crustaceans and insects, and the distribution of significant BLAST hits over different organisms reflects the phylogenetic relationship between P. clarkii and other species.
GO analysis was conducted on annotated transcripts using blast2go software. A total of 15,457 transcripts, corresponding to 5890 genes, were assigned at least one GO term for biological processes, molecular functions, and cellular components, and the output of the GO annotations was plotted ( Figure 2). Terms from the molecular function term group made up the majority of significant terms (12,842 transcripts, 88.08%), followed by the biological process group (10,241 transcripts, 66.25%) and the cellular component group (7,406 transcripts, 47.91%). For biological processes, genes involved in cellular processes (GO: 0009987, 8,148 transcripts), metabolic processes (GO: 0008152, 6,622 transcripts), and single-organism process (GO: 0044699, 5,623 transcripts) were highly represented. For molecular functions, binding (GO: 0005488, 8,313 transcripts) and catalytic activity (GO: 0003824, 6,781 transcripts) were the most represented GO terms. For cellular components, cells (GO: 0005623, 5,778 transcripts), cell part (GO: 0044464, 5,778 transcripts), and organelles (GO: 0043226, 3,632 transcripts) were the most represented terms. There were 9 identified terms that contained fewer than 10 transcripts.
Transcripts were also compared with the COG database, and 6,034 transcripts (2,386 genes) were matched to database entries. These transcripts were classified into 25 functional categories (Figure 3), among which the largest group (2031 transcripts) was signal transduction mechanisms, followed by general function prediction only (1718 transcripts), transcription (914 transcripts), posttranslational modification, protein turnover, and chaperones (782 transcripts). Genes matched to the nuclear structure category (17 transcripts) represented the smallest group.
In addition, a KEGG pathway analysis was performed on all assembled transcripts as an alternative approach for functional categorization and annotation. A total of 14,596 transcripts, corresponding to 5414 genes, were categorized into functional groups, in which the metabolism group was the most well represented, with 7821 transcripts, followed by the human disease group (7597 transcripts), organismal systems group (7179 transcripts), environmental information processing group (3460 transcripts), cellular processes group (3420 transcripts), and genetic information processing group (2481 transcripts) ( Table S3 in File S1). Each functional group was made up of genes from different KEGG pathways. In addition, the number of transcripts in each KEGG pathway was counted, and the most abundant 20 KEGG pathways are shown in Figure 4. In brief, 2016 transcripts were categorized into metabolic pathways, followed by pathways for biosynthesis of secondary metabolites (552 transcripts), cancer (435 transcripts), focal adhesion (431 transcripts), and endocytosis (419 transcripts).

Differential analysis of gene expression profiles between tissues
The expression levels of whole transcripts in the hepatopancreas, the testis, the ovary, and muscle were evaluated (Table S4 and S5 in File S1). Transcriptomic analysis of these tissues showed that more genes were overexpressed in the hepatopancreas as compared with genes expressed in the other three tissues, while more genes were underexpressed in the testis and the ovary compared with genes expressed in the hepatopancreas and muscle ( Figure 5). Interestingly, although more genes were overexpressed in the hepatopancreas, fewer genes were expressed in the hepatopancreas (32999 genes) than in the muscle (37873 genes), the ovary (45083 genes), and the testis (39503 genes). This result may be due to the crucial role of the hepatopancreas in growth, which resulted in genes for metabolism being actively and highly expressed, while the ovary and the testis are reproductive organs, and thus more functional molecules are needed in reserve, but are not highly expressed.
Enriched pathways in the hepatopancreas. Metabolism is the basic physiological process that sustains living organisms, and it includes multiple reactions, such as the synthesis of digestive enzymes, secretion, digestion, nutrient absorption, excretion, lipid and glycogen storage, and mobilization [31]. In crustaceans, the hepatopancreas is the major metabolic organ. KEGG pathway enrichment analysis showed that regulation of amino acids, carbohydrates, lipids, and glycan metabolism were significantly enriched in the hepatopancreas compared to the other three tissues ( , and only 7 transcripts (3 genes) were found to be significantly underexpressed. It was also found that the expression levels of genes in the fatty acid metabolism pathway in the hepatopancreas were significantly higher than those in the ovary and the testis, indicating that active  fatty acid metabolism takes place in the hepatopancreas of P. clarkii (Table S6 in File S1). Enriched pathways in xenobiotic metabolism, heavy metal and oxidative stress, and the innate immune system were also found in the hepatopancreas. The crayfish is well-known for its ability to survive in polluted environments, including water polluted by heavy metals, pesticides, and other chemicals, and also to tolerate hypoxia. The hepatopancreas is the primary site for accumulation  and detoxification of xenobiotic pollutants in lysosomes. The R cells in the hepatopancreas perform biotransformation using enzymes, such as cytochrome p450, to sequester and detoxify xenobiotic pollutants [31]. Compared to the other tissues, several pathways were significantly enriched in the hepatopancreas: ''lysosome'', ''peroxisome'', ''metabolism of xenobiotics by cytochrome p450'', and ''drug metabolism -cytochrome P450'' ( Table 2). Peroxisomes are essential organelles that play key roles     Figure S2, Table S7 in File S1). Only 11 transcripts were expressed at significantly lower levels in the hepatopancreas than in muscle tissue. Methyl farnesoate and ecdysteroids are important hormones in crustaceans. Methyl farnesoate, which is synthesized in the mandibular organ (MO), is an insect juvenile hormone homologue that is believed to act as a juvenile hormone in crustaceans [58]. Juvenile hormones are involved in many biological processes, including development and reproduction. The major function of ecdysteroids is to control molting, but they are also involved in reproduction [59]. Here, genes in the insect hormone biosynthesis pathway were identified from the P. clarkii transcriptome. Among these genes, genes in the juvenile hormone synthesis pathway were significantly overexpressed in the hepatopancreas compared to the other three tissues (Table S8 in File S1). In particular, seven of nine transcripts that encoded a CYP15A1 (cytochrome P450, family 15, subfamily A, polypeptide 1) homologue were highly expressed in the hepatopancreas, in which the expression levels of these transcripts were more than 100-fold greater than in the other tissues. These results indicate that the hepatopancreas is an important site for genes that are responsible for the synthesis of juvenile hormone. In contrast, genes in the ecdysteroid synthesis pathways did not show the same trend, and the differences in expression of these genes were not significant between the examined tissues.
Enriched pathways in the ovary and the testis. The ovary and the testis are the major reproductive organs, in which the processes of oogenesis, sperm genesis, DNA replication, and meiosis occur frequently. As expected, KEGG pathway analysis showed that the pathways of ''DNA replication'', ''cell cycle'', ''mismatch repair'', ''homologous recombination'', were significantly enriched in the ovary compared to muscle and the hepatopancreas. The pathways of ''DNA replication'', ''pyrimidine metabolism'', ''meiosis-yeast'', and ''Nucleotide excision repair'' were significantly enriched in the testis compared to muscle and the hepatopancreas. For example, 51 transcripts were identified in the DNA replication pathway (ko03030), of which 32 transcripts (26 genes) were expressed at significantly greater levels in the ovary than in the hepatopancreas, and of which only 2 genes were expressed at significantly lower levels in the ovary than in the hepatopancreas. Analysis showed that 26 transcripts (22 genes) were expressed at significantly higher levels in the testis than in the hepatopancreas, and only 6 transcripts (4 genes) were expressed at significantly lower levels in the testis than in the hepatopancreas (Table S9 in File S1). In oviparous animals, vitellin is the major yolk protein that provides nutrition during embryonic development. The precursor of vitellin is vitellogenin (Vg). It is believed that extraovarian Vg is synthesized in the hepatopancreas and secreted in the hemolymph, where it is sequestered into developing oocytes by the Vg receptor (VgR) through receptor-mediated endocytosis [60]. It has been reported that multiple genes encode vitellogenin in various crustaceans, such as the shrimp Metapenaeus ensis, the freshwater water flea Daphnia magna, and the banana shrimp Penaeus merguiensis [61][62][63]. Here 29 transcripts (20 genes) were determined to encode vitellogenin, of which 14 transcripts were highly expressed specifically in the hepatopancreas, 6 transcripts were highly expressed specifically in the ovary, and no transcript was found to be highly expressed in the testis or in muscle. Indeed, vitellogenin was extremely difficult to detect in muscle (Table S10 in File S1), suggesting that vitellogenin is synthesized in the hepatopancreas and the ovary of P. clarkii. This result is consistent with previous reports in the Chinese mitten-handed crab Eriocheir sinensis, the tiger shrimp Penaeus monodon, the blue crab Callinectes sapidus, the freshwater crayfish Cherax quadricarinatus, the freshwater prawn Macrobrachium rosenbergii, the green mud crab Scylla paramamosain, and other species of shrimp and crab [64][65][66][67][68]. Interestingly, no transcript among the 20 transcripts determined to encode vitellogenin was found to be highly expressed in both the hepatopancreas and the ovary. Thus, expression of the identified vitellogenin transcripts was tissuespecific, including 14 transcripts that were hepatopancreas-specific and 6 transcripts that were ovary-specific. It has been reported that MeVg1, one of two vitellogenin genes in the shrimp Metapenaeus ensis, is expressed only in the ovary and the hepatopancreas, while the other vitellogenin gene, MeVg2, is expressed exclusively in the hepatopancreas [62,69]. These results provide new insight into the expression of vitellogenin genes in the hepatopancreas and the ovary, and provide the basis for future studies on the manner in which vitellogenin genes collaboratively perform their specific functions at different developmental stages in the ovary.
The vitellogenin receptor is located in the cell membrane of oocytes and mediates vitellogenin absorption by oocytes through receptor-mediated endocytosis (RME) [70]. Unlike vitellogenin in P. clarkii, which was highly expressed in the hepatopancreas and in the ovary, the vitellogenin receptor gene was highly expressed in the ovary only, which is consistent with its ovary-specific expression pattern in the shrimp Penaeus monodon and in the freshwater prawn Macrobrachium rosenbergii (Table S10 in File S1) [60,64].
RT-PCR Assays. To validate the assembled transcripts and their expression profiles in the 4 collected tissue types, 20 transcript sequences were selected for RT-PCR (reverse transcription polymerase chain reaction) amplification. Their putative gene names, primer sequences, and expected PCR product sizes are shown in Table S11 in File S1. All 20 primer pairs gave amplification products of the expected sizes ( Figure 6). For G09 and G13, in addition to the expected PCR products, larger PCR unexpected products were also found in the ovary. Analysis of the FPKM levels of these 20 selected transcript sequences showed that 8 sequences (G01-G08) were specifically expressed in the hepatopancreas, 3 sequences (G09-G11) were specifically expressed in muscle, 3 sequences (G12-G14) were specifically expressed in the ovary, 1 sequence (G15) was specifically expressed in the testis, and the other 5 sequences (G16-G20) were highly expressed in 3 or 4 tissue types (Table S11 in File S1). RT-PCR analysis showed that, with the exception of sequence G11, which was indicated by FPKM analysis to be specifically expressed in muscle, but was indicated by RT-PCR to be highly expressed in both muscle and testis, the expression modes of the other 19 sequences in the 4 tissue types were consistent with their FPKM levels ( Figure 6). The evaluation and validation of the assembled Figure 6. RT-PCR amplification and agarose gel (1.5%) electrophoresis of 20 transcripts. G01-G20: names of transcripts, represented transcript_ID given in Table S10  transcripts verified the high accuracy of Illumina paired-end sequencing and de novo assembly, and thus indicated that our study could be useful for further research into gene function.

SNP identification
Single-nucleotide polymorphisms (SNPs) are the most common type of variation in the genome. SNPs were identified by alignments of multiple sequences used for contig assembly. After excluding those that had a base mutation frequency of less than 1%, a total of 243,764 SNPs were obtained (Figure 7). The proportions of transition substitutions were 34.44% for C:GRT:A and 31.74% for T:ARC:G, compared with smaller proportions of transversion for C:GRA:T (8.49%), C:GRG:C (6.42%), T:ARA:T (11.05%) and T:ARG:C (7.86%). The total transition:transversion ratio was 1.96:1. Differences in base structure and the numbers of hydrogen bonds between different bases resulted in a large proportion of transition type SNPs and a small proportion of transversion type SNPs. The ovary had the most SNPs (94023 SNPs), followed by the testis (62601 SNPs), hepatopancreas (54855 SNPs), and muscle (32285 SNPs). Statistics for identified SNPs in the crayfish transcriptome are shown in Figure 7.

Microsatellite sequence identification
Microsatellite sequences, or simple sequence repeats (SSRs), are polymorphic loci present in genomic DNA that consist of repeated  core sequences of 2-6 base pairs in length [71]. A total of 27,451 SSRs were initially identified from 29,534 transcripts, including 36.92% trinucleotide repeats, 24.14% di-nucleotide repeats, and 2.48% tetra/penta/hexanucleotide repeats (Figure 8). In addition, a total of 4775 SSRs (27.12%) were found that were more than 15 base pairs in length. Among the tri-nucleotide repeat motifs, (AGC/GCT)n (3,693 SSRs, 23.16%) and (ACC/GGT)n (3368 SSRs, 22.13%) were the most common types, and appeared significantly more than the other types of tri-nucleotide repeat motifs (Figure 8). After removing the microsatellites that lacked sufficient flanking sequences for primer design, 16953 unique sequences with microsatellites possessed sufficient flanking sequences on both sides of the microsatellites to allow the design of primers for genotyping.

Conclusions
This is the report on the transcriptome of P. clarkii using de novo assembly techniques with next-generation sequencing. We identified 50,219 non-redundant transcripts that will provide the basis for future studies on crayfish gene function. We also explored gene expression patterns in four different tissues from P. clarkii, and a number of candidate novel genes were identified that may be involved in important physiological processes and are worthy of further investigation. In addition, a large number of predicted SNPs and SSRs were reported that provide a basis for further genetic analysis and crayfish breeding. Figure S1 The hit species distribution based on BLASTx.

Supporting Information
(TIF) Figure S2 Transcripts identified as differentially expressed between muscle tissue and the hepatopancreas in the peroxisome pathway (ko04146). Genes for which expression levels in muscle were higher than those in the hepatopancreas are shown with a red frame, and genes for which expression levels in muscle were lower than those in the hepatopancreas are shown with a green frame.

(RAR)
File S1 Tables S1-S11. Table S1, Statistics for P. clarkii sequencing data.  Table S6, Transcripts identified in the fatty acid metabolism pathway (ko00071) and differential expression analysis between the hepatopancreas and muscle tissue. Table S7, Transcripts identified in the peroxisome pathway (ko04146) and differential expression analysis between the hepatopancreas and muscle tissue. Table S8, Transcripts identified in the insect hormone biosynthesis pathway (ko00981). Table S9, Transcripts identified in the DNA replication pathway (ko03030) and differential expression analysis between the ovary and the hepatopancreas and between the testis and the hepatopancreas. Table S10, Transcripts encoding vitellogenin and the vitellogenin receptor identified from the P. clarkii transcriptome. Table S11, Information on the 20 transcript sequences selected for RT-PCR. (XLSX)