De novo transcriptome sequencing and analysis of male, pseudo-male and female yellow perch, Perca flavescens

Transcriptome sequencing could facilitate discovery of sex-biased genes, biological pathways and molecular markers, which could help clarify the molecular mechanism of sex determination and sexual dimorphism, and assist with selective breeding in aquaculture. Yellow perch has unique gonad system and sexual dimorphism and is an alternative model to study mechanism of sex determination, sexual dimorphism and sexual selection. In this study, we performed the de novo assembly of yellow perch gonads and muscle transcriptomes by high throughput Illumina sequencing. A total of 212,180 contigs were obtained, ranging from 127 to 64,876 bp, and N50 of 1,066 bp. The assembly RNA-Seq contigs (≥200bp) were then used for subsequent analyses, including annotation, pathway analysis, and microsatellites discovery. No female- and pseudo-male-biased genes were involved in any pathways while male-biased genes were involved in 29 pathways, and neuroactive ligand receptor interaction and enzyme of trypsin (enzyme code, EC: 3.4.21.4) was highly involved. Pyruvate kinase (enzyme code, EC: 2.7.1.40), which plays important roles in cell proliferation, was highly expressed in muscles. In addition, a total of 183,939 SNPs, 11,286 InDels and 41,479 microsatellites were identified. This study is the first report on transcriptome information in Percids, and provides rich resources for conducting further studies on understanding the molecular basis of sex determinations, sexual dimorphism, and sexual selection in fish, and for population studies and marker-assisted selection in Percids.


Introduction
Yellow perch, Perca flavescens, natively distributes from central Canada, east and southeast through the Great Lakes-St. Lawrence and the upper Mississippi basins and on the Atlantic slope from Maine to Georgia [1]. It is a vital recreational and commercial species in Chesapeake Bay and the Great Lakes [2]. Yellow perch females outgrow males starting from 8-11 cm in total length and also have dominant and subordinated females [3]. Yellow perch have an XY sex-determination system [4]. Uniquely, yellow perch female have only one ovary and spawn a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Transcriptome sequencing and assembly
From Illumina sequencing results, a total of 572,074,124 paired-end reads were generated. The raw transcriptome sequences in the present study have been submitted to the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) site, the accession number is SRP074479. The low-quality sequences and ambiguous nucleotides were trimmed, and then the remaining high-quality reads, a total of 548,699,676 (95.9%), were obtained for transcriptome assembly and analysis (S1 Table). A total of 212,180 contigs were assembled, ranging from 127 to 64,876 bp, and the average length was 695 bp, with the N50 value of 1,066 bp ( Table 1). The contigs less than 200 bp were removed and finally, a total of 211,976 contigs (length distribution see Fig 1) were for blasting and annotating, and 30.98% of all contigs were less than 301bp ( 300bp). The workflow of transcriptome data analysis was shown in S1 Fig.

Mapping reads to contigs
The reference sequence is the assembly sequence containing 211,976 contigs which are not less than 200 bp. Of the clean reads from six libraries, for the NF gonad (NFG), 66.49% were reads mapped in pairs, 14.21% were reads not mapped; for NF muscle (NFM), 62.74% were reads mapped in pairs, 18.47% were reads not mapped; for NM gonads (NMG), 76.90% were reads mapped in pairs, 12.95% were reads not mapped; for NM muscle(NMM), 64.85% were reads mapped in pairs, 16.15% were reads not mapped; for PM gonads (PMG), 68.83%, 12.40% were reads not mapped; for PM muscle (PMM), 62.28% were reads mapped in pairs, 18.35% were reads not mapped (S2 Table).
The unigenes were then used to perform gene ontology (GO) annotation by Blast2GO. With a total of 72,801 GO assignments, the GO analysis showed that 16,038 annotated unigenes were assigned at least one GO term. For the biological process, cellular process (GO: 0009987) was the most represented category, and genes involved in single-organism process (GO: 0044699), metabolic process (GO: 0008152), response to stimulus (GO: 0050896), biological regulation (GO: 0065007), localization (GO: 0051179), developmental process (GO: 0032502); and signaling (GO: 0023052) were also highly represented. For molecular function, binding (GO: 0005488) was the most represented category, followed by catalytic activity (GO: 0003284), transporter activity (GO: 0005212) and molecular transducer activity (GO:  (Fig 3). Of all GO categories, the biological process ontology was the most prevalent, followed by molecular functions ontology and cellular component ontology. Within biological process ontology, a total of 307 unigenes were identified related to growth (GO: 0040007), 2,969 unigenes were annotated to developmental process, 93 unigenes were annotated to reproduction. The contigs assembled with previously identified gene matches were carried for functional pathways analysis based on the KEGG (Kyoto Encyclopedia of Genes and Genomes) database. The results showed that 14,277 unigenes were categorized into different functional groups ( Table 2). The largest functional group was organismal system (29.13%, 4,159), including immune system (983), endocrine system (1,039) and nervous system (787). Unigenes grouped into metabolism (4,046, 28.34%), included global and overview maps (1,547), carbohydrate metabolism (479), lipid metabolism (439), amino acid metabolism (379) and glycan biosynthesis and metabolism (306). Environmental information on processing, cellular process groups, and genetic information processing contained 3,096 (21.69%), 1,689 (11.83%) and 1,287 (9.01%) unigenes, respectively.

Sex and tissue specifically expressed genes and involved pathways
Based on the transcriptome mapping data, 93, 1,440 and 3 contigs were identified as specifically expressed in pseudo-male, male and female respectively; 9,476 and 858 contigs were identified as specifically expressed in gonads and muscles (S3 Table). However, only 4 (1), 55 (29) and 0 (0) contigs were blasted to match genes (mapped to GO terms) for pseudo-male, male and female respectively (S4 Table); 696(396) and 45 (25) contigs were blasted to match genes (mapped to GO terms) for gonads and muscles (some of them listed in S5 Table). Of sex specifically expressed ORFs searching, 19 ORFs were found specifically expressed in male, 6 ORFs specifically expressed in pseudo-male, and no ORF specifically expressed in female (S3 Table; S2 Fig).
The contigs identified respectively as specifically expressed in pseudo-male, male and female were further carried for functional pathway analysis to identify biological pathway that shows sexual dimorphism in yellow perch based on the KEGG database. The number of sexbiased genes was counted in different pathway categories (S6 Table). However, there is no pseudo-male-and female-biased gene involved in any pathway. Male-biased genes were involved in 29 pathways, and the Enzyme and related pathways involved for genes specifically expressed in male showed in S6 Table. Of the functional pathways, trypsin (enzyme code, EC: 3.4.21.4) and neuroactive ligand-receptor interaction (Fig 4) were most specifically expressed in yellow perch males. Similarly, the pathway analysis for tissue specifically expressed genes showed that an enzyme named pyruvate kinase (enzyme code, EC: 2.7.1.40) was involved in most pathways for muscle specifically expressed genes, and two pathways involving most genes for gonad specifically expressed genes were neuroactive ligand-receptor interaction ( Fig  5) and metabolic pathways (S7 Table). Besides, for gonad specifically expressed genes, several pathways associated with gonadal development and sex maintenance were found, including Oocyte meiosis, GnRH signaling pathway, TGF-beta signaling pathway, Oxytocin signaling pathway, and Ovarian steroidogenesis. Additionally, TNF signaling pathway and Apoptosis pathway were found in gonads. Of the enzymes involved, c-Jun N-terminal kinase (enzyme code, EC: 2.7.11.24) and TGF-beta receptor type-1 (enzyme code, EC: 2.7.11.30) were most expressed in gonads (S7 Table).

SNPs, InDels and microsatellites discovery
A total of 183,939 SNPs and 11,286 InDels were identified. These SNPs and InDels were located on 39,362 contigs, in a total length of 71,759,949 bp and the estimated frequency of SNPs was 2.59 per kb. The number of substitution for A/G, C/T, A/T, A/C, G/T and C/G was 52,498, 56,834, 17,553, 20,460, 21,907 and 14,687, respectively and the transition/transversion (Ts/Tv) ratio was about 1.46. Among these SNPs, 111,727 were located in ORF and 25,172 were missense mutation. A total of 41,479 microsatellites were identified from 18,210 unigene sequences ( Table 3). The most frequently repeated motifs were dinucleotide repeats (accounted for 77.14% of all microsatellites identified), followed by tri-(17.42%), tetra-(5.03%), penta-(0.27%), and hexa-nucleotide repeats (0.14%). Based on female, male, and pseudo-male specifically expressed contig sequences, no microsatellite loci, 133 and four repeat motifs were identified from them respectively. Based on tissue specifically expressed contigs, 93 and 1,036 repeat motifs were identified in muscles and gonads respectively.
Forty-eight primer pairs were randomly selected for primer synthesis and validation, and 27 primer pairs were successful in PCR amplification using genomic DNA from 12 individuals of P. flavescens. The remaining 21primer pairs failed to generate PCR products in one or some individuals. Of 27 primer pairs, 7 primer pairs generated multiple bands that included the target bands and a larger band (>1,000bp). The rest of 20 primer pairs only amplified the target bands. The eight microsatellite loci selected from 48 microsatellite loci, were used to analyze the polymorphism assessment in one hatchery population of P. flavescens. Seven microsatellite loci are polymorphic. The number of alleles varied from 0 to 14, and observed heterozygosity ranged between 0.205 (YPM6) and 0.89 (YPM46) ( Table 4).

Transcriptome sequencing, assembly and annotation
Transcriptome sequencing using Illumina platforms technology provides resources for gene expression profiling studies as well as identification of molecular markers, alternative splice variants and RNA editing events [15]. For example, through Illumina platforms, sex differentially expressed genes were identified in S. marmoratus [10], sex differentiation and potential sex determining genes were identified in P. margaritifera [11], a nearly complete source of transcriptomic sequence as well as marker information for sharpsnout seabream (Diplodus puntazzo) [12]. These studies focused on sex differentially expressed genes and markers information. The present study focused on applications of Illumina platforms sequencing to transcriptome analysis of yellow perch and laid the foundation for further sex dimorphism and determination research. The transcriptomes of gonads and muscles from female, male and pseudo-male were sequenced on Illumina platforms and annotated to different biological process ontology and functional pathways.
In the present study, 17.04% of all contigs assembled (36,117) had a significant hit as unigenes, which was lower than that obtained from flounder (22.31%) and sturgeon (22.05%), and similar to crucian carp (17.44%) [7,16,17]. The low percentage of matched sequences might be due to 1) short contigs occupying large percentage, however, the significance of a BLAST comparison depends in part on the length of query sequences [18]; 2) the parameters for blasting setting the significance higher, and only getting the highest significant similar sequences; 3) blast method. In this study, 30.98% of the unigene sequences were not very long ( 300 bp). The parameters for blasting are: e-value cutoff 1e-10, best hit score edge 0.05, best hit overhang 0.25 and max target seqs 1. The standalone blast+ programs were used to blast the query transcriptome sequences against the six reference protein sequences including Zebrafish, Tilapia, Cichlid, Amazon molly, Guppy, and Medaka. The possibility that yellow perch may have more unique genes compared to other species was not excluded. However, the absolute number of contigs matched from yellow perch (36,117) was higher than flounder (21,697), sturgeon (12,189) and crucian carp (22,273) [7,16,17]. Thus, the contigs from trascriptome sequencing and blasting were good enough for further annotation.

Sex specifically expressed genes and ORFs
Four and 55 genes were identified for pseudo-male and male respectively, and the genes identified were studied further for pathway analysis. Sex specifically expressed ORFs (Open Reading Frames) were searched, and 19 ORFs were found specifically expressed in males, 6 ORFs were specifically expressed in pseudo-males. As the literature we searched, there are no references about the potential function of the sex specifically expressed ORFs, and whether the ORFs locate on the sex chromosomes of yellow perch is not known so far. Further experiments would be cloning these genes and investigating their functions since these genes are not reported presently but expressed specifically. However, there was no female-biased gene/ORF identified in this study. About the sample FG, The detected value of RNA RIN was always low in the experiment. However, we found the FG didn't degrade even though the value of RIN is low. When the oocyte becomes competent to undergo fertilization, the oocyte itself produce some molecules including maternal mRNAs, RNAs, proteins, carbohydrates, lipids, hormones and vitamins, which are important for the proper development of the embryo [19]. The 5S rRNA is one important molecule involved considerably in this process. For example, up to 75% of the total RNA from mullet ovaries was 5S rRNA [20]. As we know, the RIN value is lower if the 5S rRNA band is brighter. The cDNA library was constructed based on capturing the mRNA molecules. Thus, this might be the reason of that the number of reads in FG is lower than other samples in this study. The female-specific expressed contigs might more exist in the 5S rRNA bands in the stage of yellow perch we collected. This might be the reason of lacking female-specific contigs in this study. The findings could provide the reference for collecting samples in the future for investigating the sex-specific expressed genes in ovary.
In previous study, females were significantly larger than males starting from at least oneyear-old fish (data not published), however, unfortunately, in this study we didn't find any specifically expressed contigs in female. The specifically expressed contigs in muscle are showed in S5 Table. However, we didn't find the annotated genes specifically related to growth expressed in muscle, such as Igf, Igfr or Myob.

Pathway analysis of sex and tissue specifically expressed genes
Of the functional pathways involved in normal males, pathway of neuroactive ligand receptor interaction had the highest hits, and enzyme of trypsin (enzyme code, EC: 3.4.21.4) was highly expressed. The neuroactive ligand-receptor interaction and metabolic pathways were involved in most gonad specifically expressed genes. The neuroactive ligand-receptor interaction, calcium signaling pathways, cell cycle and DNA replication pathways were testified to be associated with lactation performance in mice, and genes associated with neuroactive ligand-receptor interaction and calcium signaling pathways were significantly upregulated and positively correlated with lactation performance [21]. As we know, fish have no lactation performance as mammals do. However, we believed that these pathways to some extent might be correlated with gonad development of fish. Generally, most one-year-old yellow perch males could reach sexual maturity although some documents reported that age for yellow perch at maturity was generally two years [22]. In the present study, the yellow perch we sampled were slightly older than one year old. The neuroactive ligand-receptor interaction was only found in males and gonads, respectively, which maybe to some extent indicated that the pathway was correlated with sexual maturity and the genes involved in the pathway might regulate the gonad development. Additionally, TNF signaling pathway and apoptosis pathway was only found in males, which to some extent was attributed to the sampled males being sexually mature and the testes sampled were in the degenerating stage. The regular spawning season for yellow perch is generally from March to April [23]. Samples were collected in the middle of May. Further experiments will be carried out to testify these hypotheses and inferences. For pathway analysis of muscle specifically expressed genes, more transcripts were found for an enzyme named pyruvate kinase (enzyme code, EC: 2.7.1.40) being involved in most pathways. Pyruvate kinase is a key glycolytic enzyme, and is known to be the rate-limiting step of glycolysis [24]. The role of pyruvate kinase is complex and it plays a part in the nucleus not only as a pro-proliferative, but also pro-apoptotic stimuli [25]. Generally, pyruvate kinase plays important roles in cell proliferation that only proceeds when metabolism is capable of providing adequate metabolic intermediates to ensure both energy regeneration and the synthesis of cell building blocks in sufficient amounts [26,27]. Pyruvate kinase determines whether glucose is converted to lactate for regeneration of energy or used for the synthesis of cell building blocks [28]. Presently, pyruvate kinase M2 has been extensively studied in cancer research, and it was found that RNA interfering (RNAi) targeting PKM2 significantly inhibited tumor growth [24,29,30]. However, to our knowledge, there is no any document that reported pyruvate kinase research in aquaculture animals. Then what role(s) pyruvate kinase play(s) in the growth of yellow perch and what the gene expression profiles are would be interesting for aquaculture researchers. In the present study, pyruvate kinase was only found in muscles instead of gonads, which could, in part, indicate that the fish sampled were still in the stage of vigorous growth while the gonads of the fish were not in the stage of growing development.
No female-and pseudo-male-biased genes were involved in any pathways while malebiased genes were involved in 29 pathways. The results might indicate that: 1) female and pseudo-male yellow perch had much the same genomic resource including sex-related genes; 2) female and pseudo-male were in much the same special stage and had much the same trascriptomic profile; 3) there exists but cannot be identified based on the present databases. Yellow perch have sexual size dimorphism that females were significantly larger than males starting from at least one-year-old fish [31], since lower growth rate was observed in mature males [32,33], and estimates of female age and length at 50% maturity ranged from 1.0 to 5.8 years and from 15.7 to 20.1 cm, respectively, while male age and length at 50% maturity ranged from 1.2 to 2.1 years and from 9.4 to 12 cm [34]. To some extent, the results from pathway analysis also showed that the sexual size dimorphism among female, male, and pseudo-male did not yet occur significantly in yellow perch when sampling was performed. However, males and females have presented different transcriptome profiles. Besides, based on specifically expressed genes and ORFs identified, the results from pseudo-males presented some difference as those from females, which to some extent indicated that the transcriptome profile of pseudo-male started to present a differentiation from female. Further experiments would be needed to investigate the correlations between sexual size dimorphism and transcriptome profiles including the pathway analysis and gene expression profiles.
Several sexual dimorphic biological pathways were found for gonad specifically expressed genes including oocyte meiosis, GnRH signaling pathway, TGF-beta signaling pathway, oxytocin signaling pathway, and ovarian steroidogenesis. The c-Jun N-terminal kinase (enzyme code, EC: 2.7.11.24) and TGF-beta receptor type-1 (enzyme code, EC: 2.7.11.30) were most expressed in gonads. However, no sex-biased gene was involved in sexual dimorphic biological pathways. That means, of sexual dimorphic biological pathways, there is no difference for female, male, and pseudo-male of the sampled yellow perch. Nevertheless, the differential expression profile of the genes involved in sexual dimorphic biological pathways may be different and needed to be further investigated.

Transcriptome-derived SNPs, InDels and microsatellites discovery
Molecular genetic markers are essential to study genetic diversity, population genetic structure, evolution, and ecology [35,36]. Next generation sequencing provides a good resource for the development of molecular genetic markers because of the enormous amount of sequence data that it generates [37]. Transcriptome sequencing is a next generation sequencing based on mRNA [38]. The RNA-seq sequences can provide an excellent source for mining and development of molecular markers. Molecular genetic markers based on transcriptome sequences are not only useful for studying genetic diversity and population genetic structure, but also for the detection of functional variation and gene associated genetic analysis. Marker-assisted selection has great potential to increase production and identify the sex in aquaculture molecular breeding.
In the present study, a total of 183,939 SNPs and 11,286 InDels were identified. The estimated frequency of SNPs was far lower than 4.28 per kb in Larimichthys crocea population [39] and 3.58 per kb in Larimichthys crocea individual [40]. Such low frequency of SNPs might be because: 1) the nine fish were descended from a single mate-pair and the frequency of SNP in the study more reflected the frequency in individual than population; 2) only the reads properly located on the same contigs were extracted to call SNPs, which would make these contigs more reliable and also threw away a small part of true mutations; 3) the model used in SNP calling was based on diploid, which would filter out relatively low frequency mutations, and the SNP calling procedures were expected to improve the true positive frequency.
Based on 211,976 assembled contigs, 18,210 unigene sequences were identified to contain total 41,479 microsatellites. About 8.6% of the transcriptomic sequences possess microsatellites. This microsatellite frequency was a little higher than Macrobrachium nipponense (8.2%), but slightly lower than Carassius auratus(8.9%) [16,41]. The microsatellite frequency based on transcriptomic sequences sometime depends on the search parameters for mining microsatellites, such as the repeat length threshold and the number of repeat motifs unit. In this study, the mononucleotide repeat motifs were excluded, which is consistent with the studies of Macrobrachium nipponense and Carassius auratus. The software used to identify microsatellites loci also can affect the microsatellite frequency. Some tools can detect imperfect microsatellites (e.g., Sputnik) while others detect perfect microsatellites (e.g., MISA and Mreps). This study and the other studies in Macrobrachium nipponense and Carassius auratus used the tools that can detect perfect microsatellites (MISA for this study and that of Carassius auratus; Mreps software for the study of Macrobrachium nipponense) to discover microsatellites. The di-nucleotide repeats (accounted for 77.14% of all microsatellites identified) were the most frequent repeats, followed by tri-(17.42%), tetra-(5.03%), penta-(0.27%), and hexa-nucleotide repeats (0.14%). The results were consistent with the microsatellite distributions based on RNA-seq reported [16,39]. In this study, microsatellites were also identified from female, male, pseudo-male muscles, gonads, and tissue specifically expressed contigs sequences. However, as we know, it is impossible to find tissue specifically owned microsatellite loci since all the cells in the body have the same DNA [42]. Similarly, it is hard or even impossible to find sex specifically owned microsatellite loci since female and male share nearly identical genomes except the sex chromosome [43]. But here, the analyses could provide some reference for finding and cloning specifically expressed genes easier since the specifically expressed contigs were correlated with specific genes. Through designing primers based on the repeat motifs and PCR amplifications, maybe new specifically expressed gene(s) or marker(s) would be found and cloned. To tentatively validate the microsatellites identified, 48 primer pairs were randomly selected for primer synthesis, and 27 primer pairs were found to be amplified successfully in all individuals tested. Of the eight microsatellite loci selected from 48 microsatellite loci, seven one are polymorphic. The results showed that the microsatellites identified from transcriptome sequencing data could be developed for yellow perch to find useful polymorphic microsatellites. In the next study, we will collect more yellow perch populations to further identify more polymorphic microsatellites from the microsatellites identified.

Conclusions
A total of 211,976 contigs were obtained and 183,939 SNPs, 11,286 InDels and 41,479 microsatellites were identified from assembled contigs. No female-and pseudo-male-biased genes were found being involved in any pathways. Male-biased genes were found being involved in 29 pathways, of which neuroactive ligand receptor interaction was highly involved. Pyruvate kinase was highly expressed in muscle. This study is the first report on transcriptome information in Percids, and provides rich resources for conducting further studies on understanding the molecular basis of sex determinations, sexual dimorphism and sexual selection in fish, and for population studies and marker-assisted selection in Percids. Further experiments would be needed to test the roles of the enzyme pathways involved and the gene expression profiles associated with those pathways.

Ethics statement
All animal care and experimental procedures were approved by the Institutional Animal Care and Use Committee of Ohio State University.

Sample preparation and collection
A pair-mating was conducted in the breeding season of 2013 at the Aquaculture Research Center, the Ohio State University South Centers to generate a full-sib family. The experimental fish were fed twice a day and reared at 24±2˚C in 400 L round tanks in the green house. Onehalf of four-week old fry from the family were subjected to a testosterone treatment to produce a sex-reversed male or all-male population. When the fish reached~10 gram, the reversed fish were PIT tagged and then combined with another half of mixed sex population fish and raised in a single 2-meter round tank. In 2014, three normal males (NM), three normal females (NF) and three pseudo-males (PM) were sampled from the tanks when they were 1+ year-old. Males with a single gonad were identified as pseudo-males because yellow perch females have only one gonad. Gonads and muscles were dissected from those fish which were anaesthetized by MS-222 and stored in RNAlater 1 stabilization solution (Life technologies, USA) over 24 hours at-4˚C, then stored at-80˚C until total RNA isolation.

RNA isolation and cDNA library construction
Total RNA was isolated (gonads and muscles) using TRIzol™ (Invitrogen, Carlsbad, CA) according to the manufacturer's protocols. The concentration was quantified using NanoDrop spectrophotometry (NanoDrop Technologies, Wilmington, DE, USA), and the quality of total RNA was checked by electrophoresis in 1% agarose gel and Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). The RNA with good integrity was utilized to construct the library. The cDNA library was constructed using a Truseq 1 Stranded mRNA Sample Prep kit (Illumina, USA) according to the manufacturer's protocols. Finally, the six cDNA libraries were sequenced on an Illumina platform using paired-end strategy.

Illumina sequencing and de novo assembly
Sequence reads (2 Ã 100 bp pair-end sequencing) pools were generated on an Illumina platform from the Ohio State University Molecular and Cellular Imaging Center. The raw data were deposited in the NCBI SRA site and can be found under accession number SRP074479. Quality filtering of sequence data was performed using FastQC. Before using the RNA-seq reads for de novo assembly, a Create Sequencing QC Report was run using CLC Genomics Workbench v7.5.1 (following called CLC; QIAGEN Aarhus A/S; www.clcbio.com) to produce the quality reports. Low-quality sequences and the adaptors were trimmed using Trim Sequences tools in CLC, and then the trimmed sequences were checked by running the Create Sequencing QC Report again. The clean reads of all six samples were assembled to one sequence as referenced by de no assembly tool in CLC. The parameters for the de novo assembly were set as: the Word size was set to 20, the Automatic bubble size was set to 50, the Minimum contig length was set as default value, the option of "Perform scaffolding" was not checked. The optimal assembly results were chosen according to an evaluation of the assembly encompassing the total number of contigs, the distribution of contig lengths, the N50 statistic and the average coverage. The assembled transcripts were based on the main isoform of each transcript, and only contigs with lengths of greater than 200 bp were included in the downstream analysis. Mapping each sample RNA-seq reads to the reference sequence using the RNA-Seq Analysis tool implemented in CLC. The parameters for mapping reads back to the contigs were: the option of Map reads back to contigs (slow) were selected (Mismatch cost, 2; Insertion cost, 3; Deletion cost, 3; Length fraction, 0,5; Similarity fraction, 0,8) and the option of Update contigs were chosen.

Annotation
In this study, we checked the redundant contigs using CD-HIT program (http://www. bioinformatics.org/cd-hit/) while the assembly contigs (!200bp) were using for annotation, and just found about 4.7% (211,836/211,976) is redundant. Thus, we decided to analyze the assembly results without removing redundant contigs instead of restarting the annotation. The assembly RNA-Seq contigs (!200bp) (S1 File) were used for a similarity search program against reference protein sequences including zebrafish (Danio rerio), tilapia (Oreochromis niloticus), cichlid (Haplochromis burtoni), amazon molly (Poecilia formosa), guppy (Poecilia reticulata), and medaka (Oryzias latipes). The similarity searches were performed using the BLASTX program [14] with the parameters of e-value cutoff 1e-10, best hit score edge 0.05, best hit overhang 0.25 and max target seqs 1. The combined blast results were used to Gene ontology (GO) annotation analysis using Blast2GO [44]. The level two GO terms associated with transcriptome contigs were retrieved and then the annotation results were categorized as biological process, molecular function, and cellular components. KEGG (Kyoto encyclopedia of Genes and Genomes) pathways were assigned to those assembled contigs using online KEGG Automatic Annotation Server (www.genome.jp/tools/kaas) [45].

Sequence data mapping
The trimming reads from six samples were mapped to the reference sequence assembled from the de novo assembly above, respectively. Sequence data mapping was performed using the tool of RNA-Seq Analysis in CLC. CLC Reference Mapper was run with default settings (mismatch cost = 2, insertion cost = 3, deletion cost = 3, length fraction = 0.8 and similarity fraction = 0.8).

Identification of expressed genes related to sex and tissue
The reads per kilobase of exon per million mapped reads (RPKM) values were calculated from the mapping performance by the tool of RNA-Seq Analysis in CLC. Then the mapping reads were used to construct different groups of comparison using the Set Up Experiment in CLC. The Kal's test was used to identify the differentially expressed genes with p-value <0.05 based on the transformed data with log2. ORFs were searched using TransDecoder (http:// transdecoder.github.io/).

SNPs, InDels and microsatellite discovery
The pair-end reads above from each sample were merged together (S1 File) and SNPs and InDels were called using diploid model. The reads were mapped using BWA-MEM version 0.7.13 [46]. The alignment file was sorted and marked duplication using SortSam and MarDu-plicatesWithMateCigar programs from the software package (http://broadinstittute.github.io/ picard). The RealignerTargetCreator and IndelRealigner programs from the GATK version 3.5 [47] was used to realign the alignments around InDels. The UnifiedGenotyper function in GATK was used to call SNPs and InDels with read mapping quality at least 60 and base quality at least 30. The SNPs and InDels with a depth higher than 20, a quality score greater than 60 and without significantly different (P = 0.95) mapping quality, base quality and read position between reference and alternative base were treated as putative mutations.
The assembly unigene dataset (S1 File) and the sex-specific expressed unigene datasets were used for detecting microsatellite loci by a Perl script known as MicroSAtellite (MISA, http:// pgrc.ipk-gatersleben.de/misa) [48]. Six types of microsatellites were identified with criteria of di-to hexa-nucleotides motifs, and the minimum repeat unit was defined as 6 for di-, 5 repeats for tri-, tetra, penta-, and hexa-nucleotides. The sequences composed of two or more repeat units with motifs separated by >100 bp were considered to be two or more microsatellites. And the sequences detected as microsatellites were collected for future primer designing if the microsatellite sequences have flanking sequences of !50 bp on both sides.
The 48 microsatellite primers were designed by the Primer3 (http://biotools.umassmed. edu/bioapps/primer3_www.cgi) [49] and using the parameters as following: (1) PCR product size ranging from 100 to 300; (2) primer length range of 18 to 27 bases; (3) primer Tm of 53-63˚C, with 60˚C as the optimum Tm; (4) GC content between 20% and 80%, with an optimum of 50%. The primers were synthesized by Eurofins U.S. Company (http://www.eurofins.com). PCR was performed in 10 μL reaction volumes containing 5 μL JumpStart™ REDTaq 1 Ready-Mix™ Reaction Mix (SIGMA), 0.5 μL of each primer (10 μM) and 0.5 μL DNA (100 ng/μL) as template. A polymerase chain reaction was performed as following: 95˚C for 5 min, followed by 35 cycles of denaturation at 95˚C for 30 sec, 57˚C for 45 sec, and elongation at 72˚C for 45 sec, followed by a 10 min extension at 72˚C. The PCR products were detected by 1.2% gel agarose electrophoresis. The eight primers selected from 48 primers were synthesized with labelled by FAM, ROX or HEX for further analyzing polymorphism assessment in one hatchery population of P. flavescens (n = 32). Polymerase chain reaction of 6 uL contained 1.5 pmoL of both nontailed and labelled primers, and 0.1 pmoL of the tailed primer, 3 uL of JumpStart RedMix (Sigma), 25 ng DNA, in the presence of 100μM spermidine. Amplification was conducted in PTC-200 thermal cyclers (MJ Research) using an initial denaturation at 95˚C for 3 min, followed by 35 cycles of 30 s denaturation at 95˚, 30 s annealing at a locus-specific temperature (Table 4), 30 s extension at 72˚, and a final 7-min extension at 72˚. Amplification products DNA fragment analysis using ABI 3130 genetic analyzer.
Supporting information S1