Functional Marker Detection and Analysis on a Comprehensive Transcriptome of Large Yellow Croaker by Next Generation Sequencing

Large yellow croaker (Larimichthys crocea) is an important economic fish in China and Eastern Asia. Because of the exhaustive fishing and overdense aquaculture, the wild population and the mariculture of the species are facing serious challenges on germplasm degeneration and susceptibility to infectious disease agents. However, a comprehensive transcriptome from multi-tissues of the species has not been reported and functional molecular markers have not yet been detected and analyzed. In this work, we applied RNA-seq with the Illumina Hiseq2000 platform for a multi-tissue sample of large yellow croaker and assembled the transcriptome into 88,103 transcripts. Of them, 52,782 transcripts have been successfully annotated by nt/nr, InterPro, GO and KEGG database. Comparing with public fish proteins, we have found that 34,576 protein coding transcripts are shared in large yellow croaker with zebrafish, medaka, pufferfish, and stickleback. For functional markers, we have discovered 1,276 polymorphic SSRs and 261, 000 SNPs. The functional impact analysis of SNPs showed that the majority (~75%) of small variants cause synonymous mutations in proteins, followed by variations in 3' UTR region. The functional enrichment analysis illuminated that transcripts involved in DNA bindings, enzyme activities, and signal pathways prominently exhibit less single-nucleotide variants but genes for the constituent of the muscular tissue, the cytoskeleton, and the immunity system contain more frequent SNP mutations, which may reflect the structural and functional selections of the translated proteins. This is the first work for the high-throughput detection and analysis of functional polymorphic SSR and SNP markers in a comprehensive transcriptome of large yellow croaker. Our study provides valuable transcript sequence and functional marker resources for the quantitative trait locus (QTL) identification and molecular selection of the species in the research community.


Introduction
Large yellow croaker (Larimichthys crocea) is one of the most important commercial marine fish species and is widely aqua-farmed in China [1]. The wild population of the species, mainly distributing in coastal regions of East Asia, underwent a severe collapse in 1970s due to overfishing and marine habitat destruction [2]. Recently, the aquaculture of large yellow croaker has become a crucial source for the fish market. However, because of the loss of genetic diversity, over-dense aquaculture and environment degradation, the wild population and aquaculture industry of large yellow croaker are facing tremendous challenges from germplasm degeneration and poor diseases resistance [3,4]. Therefore, there is a great demand to breed new strains with superior economic traits, as well as better resistances against pathogens. In spit of that several progresses have been achieved for genetic improvements of the species, the underlying genetic bases of economic traits and mechanisms to infection resistance are still not thoroughly understood [5], largely due to a lack in both the amount and the systematical studies of genomic data for the species. Trascriptome sequences and functional markers are highly valuable resources for us to understand molecular genetic mechanisms and to perform molecular aided selection of large yellow croaker.
With the advent and rapid development of Next-Generation-Sequencing (NGS) technologies, RNA-seq has become a reliable and powerful tool to probe gene expressions in living cells [6]. Given its ability to sequence all transcript fragments in one experiment, scientists have recently extended the application of the high-throughput method to organisms in agriculture, including aquaculture animals [7,8]. Nowadays, RNA-seq has been employed in a wide range of aquaculture studies, such as transcriptome survey [9,10], molecular marker development [11][12][13], genetic map construction [14,15] and gene differentially expression analysis [16]. As one of the most important economic marine fish in China, the transcriptome studies for large yellow croaker is still lacking. In 2010, Zhou et.al [17] performed Expressed Sequence Tags (ESTs) sequencing on gonad-specific and gonad-related genes for large yellow croaker and clustered 2,916 unique cDNAs from 3,535 ESTs. This would be an early large-scale investigation of transcripts in large yellow croaker. To obtain more genes, high-throughput RNA-seq was also carried out for the species in 2010 [5] and 2014 [18]. However, since those studies mainly focused on micro-organism infection of the species, the reported sequence data only from spleen tissue are hardly representative of a comprehensive transcriptome of large yellow croaker. Meanwhile, effective and stable markers, such as Simple Sequence Repeats (SSR) and Single-nucleotide polymorphism (SNP), play a crucial role in association studies and molecular aided selections [14,19]. Although SSR have been developed from the limited number of expression sequence tags (ESTs) [20] and small fraction of genome sequences [21] in large yellow croaker, high-throughput polymorphic SSR and SNP marker development and analysis on large yellow croaker transcriptome are still missing. Given the now-widespread application of NGS, there is an outstanding need for systematic detection and investigation of the functional SSR and SNP markers within a comprehensive transcriptome of the species.
Here, we present a study of mRNA sequences from a tissue-mixed sample extracted from multi large yellow croaker at different developmental stages. After assembly and annotation, the transcriptome sequences were compared with public protein sequences of zebrafish, medaka, pufferfish and stickleback to identify shared proteins in large yellow croaker. Transcript expression level and open reading frame (ORF) have also been analyzed to provide systematic information for each transcript. In addition, to provide abundant functional marker resources for the research community, we have detected polymorphic SSRs and SNPs and annotated SNP mutations in protein coding transcripts. This work provides valuable transcriptome sequences and functional polymorphic SSR and SNP markers for the following experimental validation and artificial breeding programs of large yellow croaker.

Ethics Statement
This study was approved by the Animal Care and Use committee of Fisheries College of Jimei University.

Tissue sampling and RNA isolation
Large yellow croaker samples were obtained from the breeding base of Jimei University in Ningde, Fujian, China. To cover expressed transcripts in various developmental stages, the sample was collected from embryos cells, larval, 11 juvenile and 2 adult (one male and one female) fish. Ocular, skin, muscle, gonadal, intestinal, liver, kidney, blood, gall and air bladder tissues from juvenile and adult large yellow croaker were mixed and stored in RNA later for following experiments. Total RNA was extracted from *50 mg composite sample with TRI-ZOL Reagent (Invitrogen, USA) and incubated for 1 h at 37°C with 10 units of DNaseI (Takara, China) to eliminate genomic DNA. The absorbance of 1.91 at 260 nm/280 nm and the RIN of 9.8 were obtained for the purified RNA sample by Nanodrop ND-1000 spectrophotometer (LabTech, USA) and 2100 Bioanalyzer (Agilent Technologies, USA), respectively.

Library preparation and sequencing
The extracted mRNA were fragmented using divalent cations after the purification process. The first-strand cDNA was synthesized using random hexamer primers and SuperscriptTM III (Invitrogen TM, Carlsbad, CA, USA), followed by the second-strand cDNA synthesis using buffer, dNTPs, RnaseH and DNA polymerase I. Short fragments were then purified with a Qia-Quick PCR extraction kit (Qiagen) and resolved with EB buffer for end reparation and poly(A) addition. After connecting to adapters, fragments with suitable size were used as templates for the following PCR amplification. The paired-end library was prepared following the manual of the Paired-End Sample Preparation Kit (Illumina). Finally, the library with a insert length of *150 bp was sequenced by Illumina HiSeqTM 2000 in 100PE mode (Illumina Inc., San Diego, CA, USA). The short reads were deposited in the NCBI Sequence Read Archive (SRA) under Accession number of SRR1509885.

Transcriptome de novo assembly and annotation
The quality of the sequenced reads was assessed by FastQC v1.10.1 [22]. The transcriptome was de novo assembled from short reads using a de Bruijn graph method by Trinity r20130814 [23]. Before transcriptome assembly, all reads were normalized in silico by the Perl script nor-malize_by_kmer_coverage, with the maximum coverage for reads set to 50. The assembled transcripts were then subjected to cd-hit 4.5.4 [24] with a sequence identity threshold of 0.9 to eliminate sequence redundancy. Sequence length statistics of the assembled transcriptome were performed by our own python scripts. The final transcriptome sequences were searched against National Center for Biotechnology Information (NCBI) non-redundant nucleotide sequence (nt) and non-redundant protein (nr) database by local blastn and blastx programs [25] with an E-value threshold of 1 × 10 −5 . After that, the transcripts were then further annotated in InterPro, Gene Ontology (GO), EC (Enzyme Code), and Kyoto Encyclopedia of Genes and Genomes (KEGG) database with Blast2GO [26].
Transcript expression profile analysis and ORF identification RSEM 1.2.9 [27] was employed to estimate the expression level of each transcript. The alignment of the sequenced reads to the assembled transcriptome was performed by bowtie 1.0.1 [28], followed by the core process of RSEM [27], estimating the count of RNA-seq reads to each transcript [29]. We used FPKM (Fragments Per Kilobase of transcript per Million fragments mapped) [6] to measure the normalized expression value. The length of transcripts with FPKM larger than 1 were analyzed and compared with those of the total transcripts.
TransDecoder version rel16JAN2014 (http://transdecoder.sourceforge.net/) was used to identify transcripts with potential protein coding regions. We extracted all transcripts that satisfied the following criteria: 1) having an open reading frame (ORF) larger than 300 bp, and 2) the putative peptide having at least one hit to Pfam domain database [30]. The length of transcripts with predicted ORF were analyzed and compared with those of the total transcripts. Transcripts with potential ORF were uploaded into WebMGA [31] (http://weizhong-lab.ucsd. edu/metagenomic-analysis) and protein classifications were searched against Eukaryotic Orthologous Groups (KOG) database.

Functional markers development and analysis
To develop polymorphic SSR markers, we performed variable number tandem repeat analysis on the transcriptome. First, all tandem repeats were scanned using Tandem Repeats Finder (TRF) [32]. A candidate repeat unit was scored as match (+2), mismatch (-7) and InDel (-7) for the base alignments with an average matching probability of 80%, InDel probability of 10%, alignment score 50. To eliminate non-genomic polyA tails, a number of stretches of (A/T) n at transcript end were excluded in this work. Repeat numbers for each tandem repeat were rounded to the nearest whole number for distribution analysis. LobSTR was used for polymorphic SSR detection because of its good performance in previous reports [33]. In this work, only polymorphic SSR with a quality higher than 30 and covered by more than 10 reads were obtained to ensure the quality of the polymorphic SSR calling.
The assembled transcripts were scanned for SNP by BWA 0.7.6a [34], samtools 0.1.19 [35] and GATK 2.8-1 [36]. The SNP calling was proceeded by a short reads alignment carried out with BWA-MEM algorithm, followed by coordinate sort and duplicate marking by SortSam and MarkDuplicates programs in Picard tools 1.107 (picard.sourceforge.net), respectively. To improve the quality of SNP calling, we applied four steps to reduce the false SNP: 1) a local realignment process was performed after short read mapping; 2) base Quality Score Recalibration (BQSR) [36] was employed to adjust the accuracy of the Phred quality scores; 3) alignments with paired information and with a mapping quality higher than 30 were used for SNP calling with the UnifiedGenotyper function in GATK; 4) SNPs and InDels with a depth higher than 8 and a quality score higher than 60 were obtained for the next functional analysis.

Functional enrichment for high SNP-rate transcripts
GO functional enrichment analysis of the top highest SNP-rate (SNP number/transcript length) transcripts were performed by two-tailed Fish's exact test [37] with Benjamini & Hochberg false discovery rate (FDR) [38] in Blast2GO [26]. The top 800 SNP-rate transcripts were compared with the whole transcriptome to highlights over-and under-represented GO-terms. Statistical significance was qualified by the FDR of 0.05 in the study.

Results and Discussions Transcriptome assembly and assessment
With the Illumina HiSeq2000 platform, we generated 32 million paired reads with a total of over 6.4 billion bases information. The cleaned paired-end reads, after trimming adapter and barcode, were used for the following analysis. Phred quality scores for each base in cleaned reads were assessed by FastQC [22] and more than 95%(90%) of the overall bases were found to have Phred quality larger than 20 (30). The high sequencing quality ensured reliable transcript fragments for the transcriptome assembly. To reduce memory and processor requirements during transcriptome assembly, sequence normalization was performed to reduce redundancy in the cleaned reads. The cleaned and normalized reads were then subjected to de novo transcriptome assembly using Trinity package [23], resulting into 107,406 transcripts. An additional step with cd-hit was followed to remove redundancy in the assembled contigs, reducing the transcriptome to 88,103 transcripts.
The lengths of the assembled sequences in the transcriptome range from 201 to 28,304 bp, with a total and a N50 length as 86.9 Mb and 1,934 bp (Table 1), respectively. All of top 10 longest transcripts (range from 15.2 kb to 28.3 kb) were aligned to known genes from close fish species in the NCBI nt/nr database, such as titin, plectin and MYC binding proteins, which are traditionally recognized as huge proteins in organisms. The validation of those long transcripts demonstrates that the huge contigs in the transcriptome are not from the chimerical assembly of repeated fragments in genes.

Functional annotation
To functionally annotate the transcriptome, we aligned the transcripts against NCBI nt/nr database to search for homologous sequences with BLAST applications. 45,733 (51.9%) and 42,360 (48.1%) transcripts showed significant similarity (E value 1 × 10 −5 ) to nucleotide and protein sequences in NCBI nt/nr database, respectively. As shown in Fig 1, the top hits for a large fraction of transcripts (*26.9%) were from Nile tilapia (Oreochromis niloticus), which can partially be explained by abundant sequence resources of that species and its close evolutionary relationship to large yellow croaker.
In order to further annotate transcripts from gene products, the InterPro, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases were used to identify functions of transcripts. We searched the transcriptome in 16 InterPro member databases, including HMMTigr, HMMPfam, FprinfScan, SuperFamily and SignalPHMM. 25,566 sequences have been annotated with InterPro accession numbers, accounting for *29% of total transcripts (Table 2). With parameters introduced in the Method section, we annotated transcripts to three major GO categories: Cell Component (CC), Molecular Function (MF), and Biological Compare with other fish protein database To assess the assembled genes in our transcriptome with respect to close species, we aligned the transcriptome to the public zebrafish (Danio rerio) protein database v9.75 by blastx with the E-value threshold of 1 × 10 −3 . 38,943 (44.2%) transcripts searched hits. Detailed analysis found that many transcripts shared the same homologous targets, implying alternative splicing variants or assembly fragments from one gene. Omitting transcripts to the same target protein, we identified 34,330 protein-coding genes in large yellow croaker; the most abundant zebrafish protein found in large yellow croaker was titin protein domains (ENSDARP00000099532). By the same method, all sequences were also compared with the medaka (Oryzias latipes) 1.75, pufferfish (Takifugu rubripes) 4.75 and stickleback (Gasterosteus aculeatus) 1.75, respectively. 34,576 transcripts fall into the overlap of four comparisons, showing those genes that are shared those species (S1 Fig). Note that there are still many transcripts do not hit any target protein in the database, implying either non-coding RNA or species-specific genes. Therefore, the evaluation of transcript expression and ORF in sequences is crucial to further identify transcript functions.

Transcripts expression analysis
In transcriptome, gene expression level is considered to be a valuable metrics for assessing the quality of transcript assembly, because transcripts with low expression level are more likely to be from immature primary transcripts or artificial transfrags [40]. Therefore, we estimated overall transcript expression level for each transcript to facilitate functional gene identification in large yellow croaker. To this end, the sequenced reads were mapped back to the transcriptome by Bowtie [28], and the count based method with RSEM [27] was applied to evaluate transcript expression. we found that FPMK of transcripts concentrated around 0.5*2, as shown in S2 Fig. Expressions (FPKM) of top 50 most abundant transcripts were all higher than 2,500, and up to 25,000 for the highest one. Detailed functional analysis found that most of genes (> 40%) were 40S/60S ribosomal proteins, which was consistent with previous studies [41,42]. Besides, many prevalent and crucial proteins were also discovered in the analysis, for examples apolipoprotein, chemotaxin, cytochrome c oxidase, pepsinogen, trypsinogen, lectin, fibrinogen.

Protein coding region identification
Researchers need a completed transcriptome to facilitate further qualitative and quantitative genetic studies. One method to assess the completeness of transcripts is to identify likely protein coding regions based on the already-existing protein sequence knowledge [43]. In this work, ORF sequences of each transcript were identified by searching Pfam database for homologous protein sequences with the TransDecoder package [30]. For single sequence with several possible ORFs, we only retained the longest one. By this method, we identified 34,025 reliable transcripts with significant ORF, with an N50 length of 2,633 bp (Table 1). Notably, we found the mean expression level for those transcripts to be as high as 20, which was significantly higher than the mean expression level (9) for the whole transcriptome. Those transcripts with potential ORFs were further annotated in WebMGA [31] (http://weizhong-lab.ucsd.edu/ metagenomic-analysis) to predict Eukaryotic Orthologous Groups (KOG). More than 23,753 sequences were assigned to 26 KOG categories as depicted in Fig 3. As in the largest cluster, signal transduction mechanisms include 7518 transcripts, which is followed by general function prediction only (3,786) and transcription (2,874). We note that some of the identified bioprocesses are directly related with the nutrition, meat quality and disease resistance of large yellow croaker, such as coenzyme transport and metabolism (169), lipid transport and metabolism(759) and amino acid transport and metabolism (622), secondary metabolites biosynthesis, transport and catabolism (314) and defense mechanisms (216).
To evaluate the representativeness and completeness of the transcripts with ORF comparing to the whole transcriptome, we used the zebrafish protein database as the benchmark for homologous search. As a result, 33,674 protein sequences from the zebrafish protein database were hit. A related question is to what extent the completeness of transcripts is comparable to their homologous counterparts in zebrafish. To answer this question, the ortholog hit ratio [44] was calculated by dividing the aligned length of a transcript by the total length of the best zebrafish protein hit. By the definition, a transcript with a ortholog hit ratio of 1 implies a full length gene. As shown in Fig 4(A), the ortholog hit ratios of transcripts concentrate around 1.0 and is slightly decreased for long genes, implying good full length assembles for both short and long genes. Of 33,674 transcripts, we found that 10,564 and 17,990 sequences had ratios higher than 0.9 and 0.5, respectively. Notably, we identified 2,830 genes with the ratio larger than 1.0 (Fig 4(B)), suggesting possible elongations in those genes.

Functional polymorphic SSR detection and analysis
Because of their stability and prevalent distribution along genomes and transcriptomes, SSRs are still widely used in agriculture and population studies [45][46][47]. SSRs in genes are particularly valuable due to their direct function associations. To develop functional markers, all tandem repeats in the transcriptome were identified using Tandem Repeats Finder (TRF) [32], resulting in the detection of 599,920 tandem repeats in 79,237 transcripts (89.9% of total). Of the total tandem repeats, 494,611 ones were found with two end flanking regions larger than 50 bp for designing PCR primers. CA/TG and CT/AG are the top two most abundant SSR units in the transcriptome of large yellow croaker, followed by two trinucleotide repeats of CTC/GAG and CTG/CAG (S3 Fig). Interestingly, the distribution of dinucleotide repeat units was identical to that in catfish EST analysis [48], however, large yellow croaker and catfish exhibted different trinucleotide unit distributions. This SSR type difference may imply possible marker diversification among species.
To further investigate SSR types in large yellow croaker, we analyzed SSR unit length and repeat number distribution. The overall SSR markers were comprised of 112,349 (19.0%) mononucleotide motifs, 77,056 (13.0%) dinucleotide motifs, 139,707 (23.6%) trinucleotide motifs, 65,392 (11.1%) tetranucleotide motifs, 53,463 (9.0%) pentanucleotide motifs and 65,973 (11.2%) hexanucleotide motifs (S4 Fig). *86% (513,940/599,920) SSRs have a unit length no more than 6 bp. The most frequent SSR type in large yellow croaker is trinucleotides (23.6%); more interestingly, 38.4% of overall SSRs are the multiples of trinucleotides and those SSRs are more prominent than their neighbors (S4 Fig). The SSR type distribution is consistent with previous studies that unit repeat number variations of those SSRs are less likely to cause frameshift errors in coding regions [49]. For repeat number, as shown in Fig 5, a huge number of SSRs (78%) have a repeat number lower than 5 and mononucleotide SSRs dominate in high repeat ones (5).
It is essential to detect SSR polymorphism in silico and to investigate variants distribution within genes, especially for those transcripts associating with important economic traits [45]. Several methods have been proposed for SSR variant detection from EST sequences; however, it is still challenging to detect SSR polymorphism from short sequence reads [50]. Recently,  Functional Marker Analysis in Large Yellow Croaker Transcriptome much effort has been devoted to analyzing repeat variants in genomes of model organisms [49], but studies of non-model organism on transcriptomes have rarely been reported. In this work, lobSTR [33] was employed to call SSR polymorphism in large yellow croaker transcriptome for the first time. With lobSTR pipelines, we identified 1,276 polymorphic SSRs from 872 transcripts. We found that unit lengths for all polymorphic SSRs were less than or equal to 6 bp. Colm et. al have reported that around 1 in 20 human proteins are likely to contain SSR polymorphism within coding regions [49]; the frequency is comparable with our case in large yellow croaker. Notably, the majority (57.8%) of variable SSRs are unit mutations, shown as green dots in Fig 6. Repeat number variations by 1 and 2 units are remarkably predominant among all SSR variants, suggesting polymorphism of SSR in transcriptome is likely to be limited to short condense and elongation of proteins.

Functional SNP/InDel discovery and analysis
Because of their abundance, SNP/InDel offers a useful alternative to SSR in high density marker studies, such as in a fine genetic mapping and quantitative trait locus (QTL) identification [8,14,51,52]. With the rapid cost reduction for NGS, sequence resources are increasing at unprecedented rate for more and more species, enabling much easier and cheaper to develop and to analyze genome-and transcriptome-wide SNP/InDel [11]. Since the sequencing in this work is based on a composite sample of multi individuals, the data provide a valuable resource to investigate the small variant distribution and frequency within the large yellow croaker transcriptome. To this end, we aligned the short sequencing reads to the assembled transcriptome, resulting in alignment for SNP/InDel detection in GATK pipeline [36]. More than 237,000 high-quality SNPs and 24,000 InDels were identified from 30,378 (34%) transcripts, resulting the density of polymorphic sites per 1,000 bp (PS/kb) as 4.28. Of the total detected SNP loci, 666 polymorphic sites multiple alleles because of the composite sample.
In transcripts with ORF, we identified 181,000 (69% of total) SNPs and 17,000 (62% of total) InDels over 18,377 transcripts. The results show that 54% of transcripts with ORF contains SNP/InDel markers, which is significantly higher than that of the total transcriptome (34%). The putative SNPs include 124,079 transitions and 59,636 transversions. The Ts/Tv (transitions / transversions) ratio is estimated as 2.08, which is consistent with previous reports of other aquaculture animals [53]. The detail analysis showed that the transition of G/A and C/ T accounted for 33.6% and 33.8%, respectively, and each transversion in G/C, G/T, A/C and A/ T accounted for 7.9%*8.4% of all SNP types. The percentage of base substitutions is also comparable with previous studies [54], implying the reliability of SNP detection and impact analysis in the present work. We note that missense/silent mutation ratio of 0.45 is lower than that reported as *1 in Nile tilapia [7]. In protein coding sequences, missense mutations likely cause more deleterious effects to silent ones, therefore the lower ratio implies that transcripts with ORF are under rigorous natural selection. Whether the distinct ratio stems from the species difference or the incomplete genome annotation of large yellow croaker needs further investigations.
To analyze how SNPs influence gene product, SnpEff 3.6 [55] was employed, as well as the sequence annotation of transcripts, to assess amino acid changes in untranslated region (URT) and ORF. The majority (*75%) of polymorphic loci are synonymous mutations, followed by changes in 3' UTRs. SNP and InDel mutations leading to codon insertions, frame-shifts, coding deletions are all below 1,000, implying those mutation types are rather rare comparing to synonymous ones. More interestingly, there are more than 1,000 SNP that potentially lead to a novel initiation codons in the 5' UTR (start-gained mutations) and the frequency of startgained SNP is higher than that of stop gained, stop loss and start loss. This suggests that there Functional Marker Analysis in Large Yellow Croaker Transcriptome might be weaker natural selections against start-gained mutations. The recent studies have revealed that many synonymous mutations in genes may also change protein expression levels and structures and thus also be subject to natural selections [56]. However, many scientists believe that more synonymous mutations are nearly neutrally or weakly selected [57]. One piece of evidences is that more synonymous mutations in exon regions are reported than non-synonymous ones, which is also observed in Fig 7. In addition, from Fig 7 we observed that SNP sites and SNP-taking gene numbers are divergent in synonymous, 3" UTR, non-synonymous and 5" UTR mutations, suggesting genes may simultaneously have several SNPs and InDels in those categories. We analyzed the correlation between gene length and SNP site number and observed a weak correlation coefficient of 0.36 between transcript lengths and SNP sites (S5 Fig), suggesting SNP density within genes were divergent. To investigate functions for SNP-prone transcripts, we performed a GO-enrichment study for the top 800 highest SNP-rate transcripts. We obtained 79 reduced significant GO-terms, with 47 over-represented and 32 under-represented terms (Fig 8). Including the most significant (FDR of 2.2 × 10 −7 ) GO-term of motor activity (GO:0003774), we found massively over-represented items associated with the muscular tissue and the cytoskeleton, such as myosin filament (GO:0032982), skeletal muscle thin filament assembly (GO:0030240), slow-twitch skeletal muscle fiber contraction (GO:0031444), structural constituent of cytoskeleton (GO:0005200), cardiac myofilbril assembly (GO:0055003), myofibril (GO:0030016), tublin complex (GO:0045298) and actin filament binding (GO:0051015), actomyosin (GO:0042643) and striate muscle myosin thick filament assembly (GO:0071688). A two-fold explanation is proposed to understand the enrichment result. First, muscular tissues have been reported to contain elastomeric proteins with disordered regions, such as PEVK domains in titin and twitchin [58]. Sequence polymorphism in a disordered region is unlikely to destabilize the whole structure and stability, and thus under weak natural selection. Second, many immunoglobulin domain families in muscle exhibiting similar structures [59] might also show sequence polymorphism. Meanwhile, we also observed several over-represented terms linked with immunology mechanisms, including chemokine activity (GO:0008009), MHC class I protein complex and binding (GO:0042612, GO:0042288), positive regulation of T cell mediated cytotoxicity (GO:0001916), cellular response to interleukin-4 (GO:0071353), peptide antigen binding (GO:0042605) and antigen processing and presentation (GO:0019882). However, GO-terms enriched on sequence-specific DNA bindings, enzyme activities, cell communications and signal pathways were significantly under-represented. Less sequence polymorphisms on those GO-terms may associate with the elaborate regulation and important function of the proteins in those processes of living cells.

Conclusion
Large yellow croaker is an important economic species in China and Eastern Asian. Based on next-generation sequencing with the Illumina Hiseq2000 platform, we have sequenced mRNA fragments from a composite sample of multiple large yellow croaker individuals from distinct development stages and assembled the transcriptome into 88,103 transcripts. Combining a series of packages and databases, including nt/nr, InterPro, GO and KEGG, 52,782 transcripts have been annotated. 35,321 transcripts without any homologous hits suggest possible noncoding RNA, new genes or species-specific sequences. The gene expression calculation has shown that transcripts exhibit wide expression level distribution. 34,025 transcripts have been identified with potential ORF longer than 300 bp, which have been further annotated with 26 KOG categories.
The polymorphic functional markers have been detected and analyzed on large yellow croaker transcriptome. We have scanned 1,276 polymorphic SSRs, 237,000 SNPs and 24,000 InDels and found a number of polymorphic SSRs and SNPs may be related with essential metabolic pathways and economic traits of large yellow croaker. we have evaluated functional impacts of SNPs on protein sequence and shown that about *75% of SNPs lead to synonymous mutations. The functions of high SNP-rate transcripts are enriched in GO-terms related with the muscle tissue, the cytoskeleton and the immunological system, but are under-represented in sequence-specific DNA binding, enzyme activity and cell communications. The detection and functional analysis of polymorphic SSR and SNP in the transcriptome not only sheds light on functional impacts of markers on economic related genes, but also provide the community valuable resources for the further functional marker studies and molecular aided selection of large yellow croaker.