Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery

Background The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction. Results The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats. Conclusions This dataset represents the first transcriptome analysis of the endangered mollusc, C. plicata. The transcriptome provides a comprehensive sequence resource for the conservation of genetic information in this species and enrichment of the genetic database. The development of molecular markers will assist in the genetic improvement of C. plicata.


Introduction
Cristaria plicata (Leach, 1815), a well-known "freshwater pearl bivalve", belongs to the order Unionoida and family Unionidae under the phylum Mollusca. The species has restricted geographic distribution in Russia, Japan, Vietnam, Laos Republic, Thailand, Cambodia, and a wider presence in China, where it is used for freshwater pearl farming, medicinal purposes, and as a model for aquaculture industries [1][2][3]. In South Korea, C. plicata is found in the middle and lower sections of the Nakdong River, in Asan Lake in Chungcheongnam-do, and in Gosean Lake in Chungcheogbuk-do. The species has been classified as vulnerable owing to loss of natural habitats caused by river development, reduced host fish populations, and indiscriminate collection. Due to a rapid decrease in its population in recent years, C. plicata has been listed in the Korean Red List of Threatened Species under the endangered wildlife category by the Ministry of Environment and is protected by law. Under the International Union for Conservation of Nature and Natural Resources (IUCN) Red List of Threatened species, C. plicata has been assessed as data deficient with indications of localized decreases in the population [4].
Due to limited sample resources and genomic information, an exhaustive survey of novel candidate genes involved in local adaptation, the immune system, and reproduction for C. plicata is absent. The complete mitochondrial genome sequence and functional analysis of a few oxidative stress and immunity-related genes are the only available reports of C. plicata genetic information [5][6][7][8][9]. Although the available information increases our understanding of the phylogeny and molecular basis of the innate immune response in C. plicata, it is insufficient to address the sustainability and conservation of the species. To identify strategies for the local adaptation of the species, knowledge of the genes and pathways involved in the immune system and reproduction are required. C. plicata molecular markers, which are required for markerassisted selection programs in aquaculture, remain poorly explored. The discovery of molecular markers generally acts as a catalyst for the study of genetic diversity and population structure. The identification of novel genomic resources using a rapid and cost-efficient approach for the conservation of C. plicata in its natural habitat is important.

Construction of the mRNA-seq library and Illumina Sequencing
An mRNA-seq library was constructed using the mRNA-seq sample preparation kit (Illumina, San Diego, CA) following the manufacturer's instructions. Briefly, poly (A) + mRNA was purified from total RNA samples with oligo(dT) magnetic beads and fragmented using an RNA fragmentation kit (Ambion, Austin, TX) prior to cDNA synthesis. The short mRNA fragments were reverse-transcribed into first-strand cDNA using reverse-transcriptase (Invitrogen, Carlsbad, CA) and random hexamer-primers. Second-strand cDNA synthesis was accomplished using DNA polymerase I (New England BioLabs, Ipswich, MA) and RNase H (Invitrogen). The double-stranded cDNA was end-repaired using T4 DNA polymerase (New England Bio-Labs), the Klenow fragment (New England BioLabs), and T4 polynucleotide kinase (New England BioLabs). The end-repaired cDNA fragments were ligated with PE Adapter Oligo Mix using T4 DNA ligase (New England BioLabs) at room temperature for 15 min. The ligated products were purified and separated by size on an agarose gel. DNA fragments of the desired size (200 ± 25 bp) were excised and, after validation, were sequenced on the Illumina HiSeq 2500 sequencing platform.

De novo assembly
Before de novo transcriptome assembly, the raw reads were cleaned by removing adaptor-only reads (recognized adaptor length 13 nucleotides and remaining adaptor-excluded length 35 nucleotides), repeated reads, and low-quality reads (Phred quality score < 20) using the Sickle software tool (http://github.com/najoshi/sickle) [31] and Fastq_filter software (part of the Galaxy toolshed) [32]. The remaining high-quality reads were assembled using the short read assembling program Trinity with 100 GB of memory and a path reinforcement distance of 50 [33]. The Trinity program (the default options and a minimum allowed length of 200 bp) first assembles reads of a certain length that overlap to form longer fragments without gaps called contigs. The total number of contigs and the mean length, N 50 length, and GC% were recorded for the de novo assembly. These contigs were connected using the sequence clustering software TGICL [34] to obtain sequences that could no longer be extended on either end. Such sequences were defined as unigenes. They represent expressed assembled sequences, but are not characterized sufficiently to be represented as a gene.

Transcriptome annotation and discovery
For the annotation profile of C. plicata unigenes, we first constructed a unique reference dataset that combined protein sequence data of Arthropoda, Nematoda, and Mollusca downloaded from the Taxonomy browser of the NCBI nr database. The sequences were converted to multi-FASTA format and stored in the PANM reference database (PANM DB) [35] using the formatdb program (downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.26/ ). PANM DB is freely downloadable from the amino acid database BLAST web-interface of the Malacological Society of Korea (http://malacol.or.kr/blast/aminoacid.html). The assembled C. plicata unigene sequences were searched against the PANM DB reference database using the BLASTx algorithm [36] with an E-value threshold of 1.0E -5 to identify putative functional mRNA transcripts. Subsequently, the BLASTx hits of the assembled sequences against the Uni-Gene DB were also recorded. The BLAST2GO software suite [37] was used to predict Gene Ontology (GO) terms [38], assign the assembled sequences to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [39], and to identify protein domains against the Inter-Pro databases using the InterProScan tool [40]. Annotations using BLAST2GO were conducted with 1.0E -6 as the E-value hit filter, 55 as the annotation cut-off and 5 as the GO weight. No HSP-hit coverage cut-off was considered. The GO terms were classified into three categories, hence, we generated separate graphs (pie chart at level 2) for biological process, cellular component, and molecular functions. The unigenes were also predicted by a query of the NCBI Clusters of Orthologous Groups (COG) DB (BLASTx, E-value cutoff of 1E -5) [41].

Identification of candidate genes related to immune responses and reproduction
Identification of candidate C. plicata genes involved in immune responses, sex-determination, and reproduction was performed using a keyword search of our BLASTx annotation results in the PANM DB. A set of keywords, composed of a series of representative innate immunity and oxidative stress genes, was used to predict immune response genes based on annotation results. Similarly, representative sex-determination and sex-differentiation genes were used to search for reproduction-related unigenes from the annotation results. In addition, the GO terms and KEGG classification information were required to identify important candidate genes. GO terms such as "immune system process", "response to stimulus", "signaling" under the biological process domain and "anti-oxidant activity" under the molecular function domain were used to scan for the immune response genes. Similarly, the GO terms "reproduction" and "reproductive process" were used to select candidate genes involved in reproduction of the mollusc. In addition, the KEGG category "immune system" was also used to identify candidate immune response genes.

Microsatellite marker discovery
The assembled C. plicata unigenes were searched for simple sequence repeat (SSR) motifs using the MIcroSAtellite (MISA) software [42]. For SSR identification only >1 kb sequences of unigenes were considered. All types of SSRs, from dinucleotide to hexanucleotide repeats were examined. The searches were run with the minimum repeat number of six for dinucleotide repeats and five for all other repeat motifs.

Results and Discussion
Illumina Hi-Seq 2500 sequencing and assembly evaluation Transcriptome information for the endangered freshwater mollusc, C. plicata, was characterized by constructing a cDNA library prepared from purified mRNA isolated from the visceral mass tissue. The Illumina Hi-Seq 2500 sequencing platform generated a total of 286,152,584 raw reads with 36,055,225,584 bases. The raw reads were filtered to remove adaptor sequences, low quality reads (reads with more than 50% of bases having a Q-value 20), and ambiguous bases. A total of 281,322,837 clean reads (Phred quality Q20) with an average length of 124.1 bases was obtained. The transcriptome sequencing, assembly, and annotation scheme used for C. plicata are depicted in Fig 1. The clean reads obtained using the Illumina Hi-Seq 2500 transcriptome sequencing platform constituted 98.31% of the raw reads. The C. plicata transcriptome sequencing and assembly statistics are shown in Table 1. The Illumina sequence data for C. plicata were deposited in the NCBI Sequence Read Archive (SRA) under accession number SRP062467. The raw, untrimmed Illumina HiSeq2500 data along with the transcriptome assembly have been included as NCBI BioProject PRJNA293023.
High-quality reads generated from the transcriptome sequencing of the C. plicata visceral mass tissue were subsequently assembled using the Trinity program because assembled and annotated genomic sequence information for the Cristaria species were not available. The Trinity de novo assembler is used for the assembly of trimmed reads with an optimal K-mer length of 25. Trinity is the first program designed specifically for de novo transcriptome assembly and utilizes a novel method for the reconstruction of transcriptomes from RNA-seq data using three sequential software modules; namely, Inchworm, Chrysalis, and Butterfly [43]. Other short-read transcriptome assemblers such as Oases [44], Trans-ABySS [45], SOAPdenovo-Trans [46], and Rnnotator [47] are available, which are essentially modifications from the genome assembly. The Trinity assembler generated a total of 453,931 contigs with 331,930,879 bases (N 50 length, 1,254 bp; mean length, 731.2 bp) and a GC% of 36.62%. The contigs were assembled into a total of 374,794 unigenes with 276,264,683 bases. The N 50 length and the mean length of unigenes produced were 1,262 and 737.1 bp, respectively, with a GC% of 36.47. The lengths of the smallest to largest unigenes in the C. plicata transcriptome ranged from 212-68,788 bp. Among these unigenes, 125,484 unigenes (33.48%) were no more than Schematic representation of transcriptome assembly and annotation. A C. plicata visceral mass transcriptome was obtained using an Illumina HiSeq2500 NGS platform. The raw reads obtained were preprocessed using the Sickle software tool (quality: 20, length: 40) and Fastq_filter software to obtain clean reads. Trinity assembly (K-mer, 25; minimum contig length; 200) and TGICL clustering (Identity; 94%; overlap; 30 bp) generated 374,794 unigenes. The unigenes were used for functional annotation using the PANM, Unigene, COG, GO, and KEGG databases and structural annotation for SSR detection. 300 bp, 188,221 unigenes (50.22%) were 301-1,000 bp, 32,700 (8.72%) unigenes were 1,001-2,000 bp, and 28,389 unigenes (7.57%) were greater than 2,000 bp (Fig 2).
Complete coverage of the C. plicata mitochondrial transcriptome was demonstrated with the Trinity assembler (S1 Table). The 100% sequence coverage of the 13 protein-coding genes of the C. plicata mitochondrial genome using only two assembled unigenes (unigene Cp_000887 and unigene Cp_009974) demonstrated the integrity and completeness of Trinity de novo assembler. Several assemblers have been tested to map mitochondrial protein-coding genes in assembled contig sequences with a lesser degree of coverage [25,27]. The coverage of mitochondrial DNA genes is a direct measure of the quality of assembled sequences. An earlier study reported that the Oases analysis pipeline with a K-mer size of 23 was the best program for assembling the de novo transcriptome of Crassostrea virginica compared to the SOAPdenovo-Trans (K-mer sizes of 41 and 51) and Trinity (K-mer size of 25) programs based on the N 50 length of contigs, the number of contigs longer than 500 bp, and the alignment coverage [25]. The Trinity program used for the transcriptome assembly of C. plicata RNA-seq reads with average contig and unigene lengths of 731.2 bp and 737.1 bp, respectively, was found better than or similar to most other Illumina sequenced assemblies: 260 and 434bp average contig lengths in H. midae [26] and R. balthica [27], respectively, and 706 and 580bp average unigene lengths in the endangered species Chinese sturgeon, (Acipenser sinensis) [30] and Chinese salamander (Hynobius chinensis) [29], respectively. A comprehensive summary of molluscan transcriptomes in the last three years using NGS platforms (Table 2) shows a preference for Illumina technology combined with the Trinity assembly process. Transcriptome data on endangered or endemic molluscs obtained using NGS platforms would increase our understanding of their genomic attributes and provide information for species conservation in their natural environment. Thus, the C. plicata transcriptome and annotation of valuable genes will be useful for functional genomics research and the development of molecular markers, and will serve as reference information for closely related species.

Sequence annotation of unigenes
The assembled unigenes in the C. plicata transcriptome were used to conduct a BLASTx search (E-value 1E-5) against the curated PANM DB, UniGene DB, and COG DB for validation and annotation of genes. Of 374,794 unigenes, 79,960 (21.33%), 40,196 (10.72%), and 13,934 (3.72%) unigenes were similar to sequences in the PANM, COG, and UniGene DBs, respectively ( Table 3). The majority of unigenes annotated to homologous sequences in the DBs had lengths 1000 bp. A total of 39,682 (10.59%) and 11,368 (3.03%) unigenes had common homologous matches in the PANM with COG DBs, and the PANM with UniGene DBs, respectively. A total of 9,820 (2.62%) unigenes were annotated simultaneously by all three DBs. In total, 84,274 (22.49%) annotations were found within the clustered unigenes of the C. plicata transcriptome. The non-annotated unigene sequences were less likely to produce BLAST hits in the protein databases possibly due to their shorter sequences and their lack of a representative protein domain.

Homology characteristics and functional annotation of unigenes
Characteristics of the homology search of assembled unigenes against the PANM DB are summarized in Fig 3. The score distribution, which represents the quality of the BLAST alignment, showed that 38,142 (47.70%) unigenes had scores between 50 and 100 and 27,457 unigenes had scores between 100 and 500 ( Fig 3A). Only 4,809 (6.01%) unigenes had a score < 50 reflecting the quality of the assembly and sequence annotation process. The E-value distribution revealed that 55,740 unigenes (69.71%) showed significant homology to deposited sequences (1E-50 to 1E-5, Fig 3B). The identity distribution showed that 33,781 (42.25%) and 16,884 (21.12%) unigenes showed identities of 40-60% and > 60%, respectively, to deposited sequences ( Fig 3C). In addition, the similarity distribution showed that 47,027 (58.81%) unigenes had similarities greater than 60% ( Fig 3D). The lengths of unigenes were directly related to the presence or absence of BLAST hits ( Fig 3E). This is understandable since the longer sequences are more likely to contain protein domain characteristics and are more likely to have BLAST matches in the protein database. Another basis for understanding unigene characteristics is the BLAST top-hit species distribution which shows putative homology of the annotated sequence across species in the PANM DB (Fig 4). Based on our analysis, the highest homology was observed with the oyster, Crassostrea gigas (32,609 unigenes, 40.78%), followed by the owl limpet, Lottia gigantea (12,065 unigenes, 15.09%). As expected, the majority of unigene hits belonged to molluscan and other arthropod proteins. A summary of the top-hit Inter-Pro domains identified 1,374 unigenes with zinc finger, C2H2-like domains. Zinc finger domains participate in important cell processing functions including signal transduction and transcriptional regulation and are a common feature in molluscs, insects, and other crustacean  Table 4. We subjected all unigenes to a search against the COG DB to make functional predictions. The unigenes were distributed among 25 functionally classified categories (excluding the multi category) (Fig 5). Among the 25 COG categories, the "general function prediction" cluster constituted the largest group (9,549; 23.75%), followed by "signal transduction mechanisms" (4,916; 12.23%), "post-translational modification, protein turnover, chaperones" (3,291; 8.19%), "function unknown" (2,557; 6.4%), "transcription" (1,610; 4%), "cytoskeleton" 1,344;   3.3%), and "RNA processing and modification" (1,002; 2.5%). A greater number of unigenes (6,107; 15.19%) were also allocated to the multi assignment category. Gene Ontology (GO)-based annotation is an internationally standardized gene functional classification system that describes gene products in terms of their associated biological processes, cellular components, and molecular functions. To make functional predictions for the C. plicata unigenes, we mapped the associated GO terms to 79,960 unigenes that had BLAST matches. The GO annotation was based on the BLASTx results against the nr database. Protein domains and motif information were retrieved using the InterProScan sequence search tool via BLAST2GO and the annotation was merged with already existing GO terms. After merging, the 23,246 unigenes (11,419 for biological process, 6,391 for cellular component, and 21,189 for molecular function) were assigned one or more GO terms based on sequence similarity. Furthermore, 10,016 (43.09% of the 23,246 unigenes) were annotated with both biological processes and cellular components, 5,058 (21.76%) were annotated with both cellular components and molecular function, 4,484 (19.29%), annotated with both biological process and cellular components, while 3,805 (16.37%) were assigned to all three categories (Fig 6A). A calculation of the number of unigenes associated with GO terms suggested that 7,892 and 7,781 sequences were associated with two or one GO term annotations, respectively ( Fig 6B). As expected, the evidence code distribution showed an over-representation of electronic annotations that have not been created manually and may contain higher false positives. The evidence code 'inferred from electronic annotation' (IEA) like others, such as 'inferred from sequence or structural similarity' (ISS), 'inferred from reviewed computational analysis' (RCA), and 'inferred from genomic context' (IGC) belong to computational source of evidence; those that constitute over 95% of the total GO annotation analysis [63]. Hence, with respect to our GO term annotation results, all of the GO terms are not of equal validity and, based on this, the interpretation of unigenes relates to only the predicted function. Among the 23,246 unigenes for which we obtained GO terms, we observed a wide diversity of functional categories represented on level 2 of the GO database. Fig 7 shows a total of 19 biological process, 10 cellular components, and 12 molecular function GO level-2 classes in which the unigenes were predicted to function. Within the biological processes category,   genes involved in cellular processes (GO:0009987) and metabolic processes (GO:0008152) were represented prominently, followed by single-organism processes (GO:0044699) and biological regulation (GO:0065007) (Fig 7A). In the cellular component category, membrane (GO:0016020), cell (GO:0005623), and organelle (GO:0043226) represented the majority of terms (Fig 7B), while in the molecular function category, the top-represented GO terms included binding (GO:0005488) and catalytic activity (GO:0003824) (Fig 7C). The GO classifications suggested for C. plicata showed similarities with those for sequenced P. yessoensis [21] and C. hongkongensis [48].
We also performed a search of all unigenes against the KEGG database to identify the active biological pathways in C. plicata Using BLASTx, we found 4,776 unigenes that shared homology with known enzymes in the KEGG database (Fig 8). The unigene sequences mapped to 123 KEGG pathways. Among them, 709 unigenes possessing an Enzyme Commission number were assigned to these pathways. The KEGG pathways were related to metabolism (4,315 unigenes), genetic information processing (53 unigenes), environmental information processing (80 unigenes), and organismal systems (328 unigenes). Predominantly, the unigenes were enriched in "nucleotide metabolism pathway" (1,147 unigenes of the 4,776 unigenes) followed by "metabolism of co-factors and vitamins" (951 unigenes), "xenobiotic biodegradation and metabolism" (554 unigenes), and "immune system pathways" (328 unigenes) categories. The C. plicata unigenes annotated to KEGG pathways are presented in S2 Table. Overall, both GO term and KEGG analyses identified, based on similarity to sequences already identified, transcribed region potentially involved in C. plicata stress responses as well as immunity and reproduction that could be crucial for determining adaptation in the natural environment and promoting conservation by means of genetic improvement programs. Genes and pathways related to the innate immune system A combined annotation profile of the various methods allowed for an analysis of the C. plicata unigene sequences involved in immunity and stress mechanisms. We identified unigene sequences associated functionally with innate immunity and oxidative stress mechanisms. A summary of unigene sequences list of immunity and defense mechanisms is presented in S3 Table. A majority of unigenes showed homology to lectins, toll-like receptors (TLRs), and cathepsin, followed by complement C1q, scavenger receptor, caspase, and heat shock proteins (HSP). Overall, the C. plicata unigene sequences covered the major pathways and provided an extensive coverage of the immune gene repertoire in the species. The innate immune system of molluscs consists of cellular and humoral components that operate in a coordinated manner to defend against a multitude of pathogens [64]. Pattern recognition proteins (PRPs) recognize microbial surfaces and initiate a signaling cascade which culminates in the regulatory release of antimicrobial effectors and forms the humoral defense response. The role of lectins as PRPs and in phagocytosis mechanisms has been characterized in molluscs [65,66] and is attributable mainly to the presence of carbohydrate-binding domains (CRDs) [67,68]. The C. plicata transcriptome data shows the presence of putative lectin sequences including tandem-repeat galectin, C-type lectin, sialic-acid binding lectin, fucolectin, and immulectin-3. Tandem-repeat galectins are known to act as an acute-phase protein implicated in the immune defense of R. philippinarum, pearl oyster, and Pinctada fucata against Vibrio species [69,70]. C-type lectins have numerous roles in bivalve organisms such as nonself recognition, microbe agglutination, induction of phagocytosis and encapsulation, and antibacterial properties [71]. Transcriptome analysis of the related species, C. virginica, identified 8 galectins and 140 C-type lectin domain proteins [25]. A cross-species comparison analysis of C-type lectin domain proteins and CRD proteins showed an overrepresentation of such innate immune factors [25]. A comprehensive repertoire of carbohydrate-binding molecules has been analyzed in the common periwinkle, Littorina littorea, with unigene sequences corresponding to C-type lectins, fucolectins, galectins, chitinase-like lectins, and I-type lectins [72]. Fucolectins formed the largest group of CRD proteins in the C. plicata transcriptome. Fucolectins have been characterized in the mussel Mytilus galloprovincialis [73] and the sea cucumber Apostichopus japonicus [74], and have further characterized in detail [75]. Peptidoglycan recognition proteins (PGRPs), apolipophorin, and CD63, which are known for their recognition-mediated immune functions, were also identified among the C. plicata unigene sequences and may confer a selective advantage to the species under threatening conditions. Among the PGRPs, we identified both the short and long forms that bind to and hydrolyze bacterial peptidoglycans and activate the Toll or IMD signal transduction pathways in invertebrates [76,77]. PGRP homologs have been identified in squid (Euprymna scolopes) cDNA library [78] and the assembled transcriptomes of C. virginica [25] and Octopus vulgaris [50]. Apolipophorin III has been implicated in pattern recognition of beta-1, 3 glucans and responds to intracellular pathogens in insects [79,80], but requires a more extensive understanding in molluscs.
TLRs are membrane-bound pathogen recognition receptors implicated in intracellular signaling and they regulate the production of the effector antimicrobial peptides [81]. TLRs contain leucine-rich repeat motifs, a transmembrane region, and a cytoplasmic Toll/interleukin-1 receptor (TIR) domain which interacts with myeloid differentiation factor 88 (MyD88) or other adaptors, leading to activation of intracellular Toll signaling cascades. We identified TLR-2, -3, -4, -6, -7, -13, and other TLR precursors in the C. plicata transcriptome. In M. edulis, 27 TLRs have been described, indicative of diversity and an advanced immune system [19]. With a comprehensive array of TLRs identified from the transcriptome, it would be interesting to explore the innate immunity functions using a targeted gene approach. In addition, genes encoding intracellular Toll pathway proteins including MyD88, IRAK1 protein, relish, Janus kinase (JNK), and p38 have been identified from the rich transcriptome datasets, suggesting that the TLR pathway is conserved in C. plicata. Our results are consistent with TLR pathway genes identified in M. galloprovincialis (Illumina reads) [82], M. edulis (454 contigs and new assemblies) [83], and C. gigas (Illumina reads) [84]. Based on the TLR pathway in the related species C. virginica, the association of the TIR domain of TLR and MyD88 releases signals through IRAK and TRAF6 to induce the p38 signaling pathway (mediated by MEKK1, MKKs, or JNK) or the NF-κB-like relish protein. The identification of these intracellular components will increase our understanding of the TLR mediated immune system in C. plicata. MyD88 is an adaptor in the TLR/IL-1R signaling pathway and is highly expressed in bivalves in response to both Gram-positive and Gram-negative challenge. Recently, five genes encoding MyD88, which acts as an acute phase protein after infection with microbes, were characterized from P. yessoensis [85]. In an oyster model, TLR-based intracellular signaling is well represented, with mediators including MyD88, TNF (tissue necrosis factor) receptor associated factor 6 (TRAF6), and nuclear factor-κB (NF-κB) factors involved in the regulation of antimicrobial function [86]. Many gene families encoding proteins involved in immune responses to biotic and abiotic challenges have been identified from the C. gigas genome project including TLRs, MyD88, C-type lectins, fibrinogen-related proteins (FREP), superoxide dismutase (SOD), and globular head C1q domain containing protein (C1q) [87]. Transcriptome analysis of C. virginica [25] also revealed a rich set of genes related to the TLR pathway including MyD88, SARM, IRAK, TRAF6, MKKs, JNK, p38, AP-1, and NF-κB. We identified immune signaling candidates and immune effectors such as big defensins and defensins in the C. plicata transcriptome. Defensins are effectors of innate immunity present in marine bivalves [88] and some freshwater bivalves such as Lamellidens marginalis [89], Hyriopsis schlegelii [90], and Hyriopsis cumingii [91]. The evolution of antimicrobial peptide genes including defensins has been rapid in Crassostrea species, suggestive of molecular diversification of the effectors to cope with environmental challenges [92].
Cathepsins identified in this study are involved in the maintenance of homeostasis, regulation of antigen presentation and degradation, immune responses, and intracellular protein degradation. Eight putative cathepsin sequences (cathepsin B, C, D, F, I, L, S, and Z) present in the transcriptome may be required for several of highly regulated life processes. Multiple homologs and cDNA sequences of cathepsin L have been identified from C. gigas and P. fucata as well as a cloned cDNA from C. plicata [9,93,94]. The identification of candidate genes for environmental adaptation indicates that the dynamics of species survival should be further explored. Some unigene sequences in the C. plicata transcriptome showed homology to oxidative stress enzymes such as SOD (Mn-SOD and Cu-Zn SOD), glutathione-S-transferase (GST alpha, -mu, -omega, -pi, -sigma, -theta), catalase (CAT), glutathione peroxidase (GPX), and glutathione synthetase. The transcriptional stability of these genes is crucial under anthropogenic and pathogenic stresses and regulates cell homeostasis, as demonstrated in the mussel M. galloprovincialis [95]. SOD, CAT, and GPX were also identified in the C. virginica [25] and L. fortunei transcriptomes [17]. The heat shock protein (HSP) genes are key indicators of the physiological robustness of an organism and provide molecular insights into the response to environmental challenges [96]. The identification of HSP70 multigene family members and small HSP and HSP90 class genes reveals bivalve specialization for environmental sustainability. In the C. plicata transcriptome, we identified unigenes related putatively to HSP60, HSP70, and HSP90, and the small HSPs (HSP10, HSP20, and HSP40). HSP70 is associated with acclimation robustness in molluscs and is found to be highly expressed in response to biotic stressors [17,97]. Differential, tissue-and time-specific expression of HSP70 in C. hongkongensis [98] and HSP60 in the mussel Perna viridis [99] has been reported. HSP chaperones are important factors for the establishment of mussels such as M. galloprovincialis and M. trossulus in North America, and are proposed to serve as a biochemical marker [100].
Analysis of the C. plicata transcriptome also revealed several key genes encoding proteins related to apoptosis and programmed cell death including caspases (caspase-1, -2, -3, -7, -8, -9 like isoform, -10), Bcl-2, and Bax, suggestive of the significance of apoptosis in the immune response of the organism. The abundance of unigenes showing homology to caspases was apparent from the identification of initiator caspase group 2 and 8 and executioner caspase group 3 and 7. The appearance of multiple executioner caspases and the divergence in their sequences compared to the conserved initiator caspases reflects the importance of the executioner phase of apoptosis in bivalves [101]. A similar set of key apoptosis genes was identified in the related species, C. virginica, suggestive of a versatile apoptosis-pathway mediating system [25]. In accordance with an earlier report of the presence of C1q domain proteins in the C. virginica, M. galloprovincialis, and M. edulis transcriptomes, we identified a large number of C1q domain proteins in the C. plicata transcriptome, which supports the expansion of such proteins in bivalve molluscs, possibly in support of adaptation strategies [19,25,102].

Sex-determination and reproduction-related genes
Genes involved in the sex determination process determine the sex of an organism by directing the development of gonadal structures, such as the ovary or testis. Sex-differentiation genes regulate the development of the ovaries and the testis from the undifferentiated gonad. Gonad transcriptome analysis studies have taken advantage of the diverse reproduction strategies of molluscs to identify a large number of sex determination/differentiation genes [48,[103][104][105][106]. The economic advantages of C. plicata for aquaculture industries and its endangered status make exploring the molecular mechanisms underlying gametogenesis important. We identified unigene sequences showing homology to sex determination, sex differentiation, and reproduction-related genes based on a keyword search of the PANM-DB using our BLASTx annotation results (S4 Table). The C. plicata unigenes showed homology to sexdetermination genes including SRY-related HMG-box domain (SOX) family members (SOX5, SOX6, SOX9, SOX11, SOX15, and SOXB2), Doublesex, and mab-3 related transcription factors (DMRT3), WNT member 4 (WNT4), Fem-1 like protein (FEM1), steroidogenic factor 1 (SF1), sex-determining region on the Y-chromosome (SRY), dosage-sensitive sex reversal, adrenal hypoplasia critical region, and on chromosome X (gene 1, DAX1). Sex determination and differentiation genes that were conspicuously absent from the C. plicata transcriptome included Wilm's tumor suppressor gene-1 (WT-1), forkhead box L2 (FOXL2), aromatase (CYP19A1), and anti-mullerian hormone. While WT-1 homologs have been reported in fish (Danio rerio), sturgeons (A. sinensis and A. naccarii), amphibians (H. chinensis), and mouse (Mus musculus), they have not been reported in oyster (C. hongkongensis) [48]. Homologs of aromatase and anti-mullerian hormone from oyster species have also not been reported. FOXL2 is an ovarian determination gene in vertebrates and it functions to suppress genes involved in testis differentiation. Homologs of FOXL2 show higher expression in ovaries of invertebrates, including molluscs [25,105,107]. Homologs may be present in C. plicata, suggestive of a role in the earlier stages of gonad development. In C. hongkongensis, sex determination genes such as DMRT1 and SRY were absent while homologs of other DMRT and SOX family members were present [48]. Genes including DMRT, SOX9, Fem1, and FOXL2 are known to participate in the regulatory processes underlying sex determination/differentiation [108,109]. SOX family proteins are a conserved group of transcription regulators involved in development and differentiation which possess a high mobility group (HMG)-box domain [29,110]. SRY (founding member of the SOX genes and the master switch in sex determination) with SOX9 activates the male determining pathway by inhibiting ovary development through the induction of anti-mullerian hormone in Sertoli cells [111,112]. A DMRT-1 testis-specific (zinc finger DM domain protein) unigene sequence was identified among the C. plicata unigenes, and may (along with other members of the family) promote male-specific development. This finding is supported by reports of testis-specific DMRT genes in other molluscs [105,113,114]. Other genes such as FEM1 and WNT4 are not involved in the sexdetermining pathway or maintenance of mature gonads but may have a role at an earlier stage of sex-specific expression [25].
For the conservation of endangered species such as C. plicata, it is important to understand reproduction-specific unigenes from the transcriptome. We identified unigene sequences homologous to spermatogenesis-associated protein (SPATA1, 4, 5, 6 and 7), sperm flagellar protein 1/2, sperm motility kinase, nuclear autoantigenic sperm protein, spermidine synthase, and spermine oxidase potentially expressed in the male reproductive tissues. As expected, the majority of male reproduction-related genes are associated with sperm motility, which is crucial for the reproductive success of an organism. Moreover, we identified C. plicata unigenes showing homology to oocyte zinc finger protein, ovary development protein, and vitellogenin (Vg) which are other putative reproduction-related genes. Vg has a strong transcript presence in the ovary compared to the testis and is an important protein for oocyte maturation in molluscan species such as C. gigas [115], P. yessoensis [116], and Argopecten purpuratus [117].
Overall, the C. plicata transcriptome was useful for the discovery of homologs related to sexdetermination and reproduction from among the assembled unigene sequences.
Characterization of cSSR markers in the C. plicata transcriptome SSRs are tandem repeated motifs characterized by high-levels of polymorphism and are commonly used as marker systems in genetic diversity assessments, population structure dynamics studies, conservation genomics, and genetic linkage mapping for a variety of organisms [118][119][120][121]. These microsatellite sequences can be important for the characterization of invasive species with reduced genetic diversity [122][123] and can resolve structural dynamics of closely related populations [124]. Unfortunately, due to limitations of time and cost of development, microsatellite isolation from non-model organisms remains limited. Recently, with the availability of genomic and transcriptome sequences using the high-throughput and cost-efficient NGS platforms, de novo screening of large sets of microsatellites such as SSRs and single nucleotide polymorphism (SNPs) has become more common in non-model organisms [125][126], including mollusc species [48,127]. Due to the extensive use of C. plicata as a pearl culture resource in China, few polymorphic microsatellite loci were developed using the dinucleotideenriched genomic library for the improvement of species [128]. To address the conservation efforts of C. plicata, which is endangered in Korea, the development of SSRs is highly desirable. We obtained a set of SSRs from the transcriptomic dataset using the MISA program. A total of 61,141 unigene sequences (>1 kb) containing 17,251 SSRs were identified, with 3269 sequences containing more than one cSSR. We screened for dinucleotide repeats with a minimum of six iterations, and all other repeat types (tri-to hexanucleotides) repeated at least five times. The most abundant SSRs included dinucleotide repeats (11,302), followed by tri-(4381), tetra-(1527), penta- (40), and hexanucleotide repeats (1). Consistent with our results, dinucleotide repeats were the most abundant repeats in the cSSR profiles of M. rosenbergii [16], endangered A. sinensis [30], and Chinese salamander (H. chinensis) [29]. Conversely, in the invasive L. fortunei, tetranucleotide repeats were more common while dinucleotide repeats were less common [17]. A summary of SSRs based on the number of repeat is presented in Table 5. Six tandem repeats (3635) were predominant, followed by five (2479) and seven (2111) tandem repeats. The cSSRs also contained repeats with 20 (216) and 21 (1595) random reiterations consisting mostly of dinucleotide repeats (89.66%). An analysis of the frequency distribution of cSSRs based on motif sequence types is shown in Fig 9. The C. plicata transcriptome is rich in AC/GT (4992; 28.94%), AT/AT (3536; 20.50%) and AG/CT (2720; 15.77) repeats. The abundant repeats in the cDNA-associated SSRs of C. hongkongensis were identified as AG followed by AT, AC, and ATC [48]. The prominent trinucleotide repeat types common among cSSRs of C. plicata included AAT/ATT (1779; 10.31%), followed by ATC/ATG (693; 4.02%), AAC/GTT (647; 3.75%), ACT/AGT (361; 2.09%), and AAG/CTT (295; 1.71%). A tetranucleotide repeat ACAT/ATGT (771; 4.47%) was also identified within the top prominent SSR types. Generally, microsatellite repeat types exhibit species-specific differences, as has been reported for Crustacean species [16,129]. SSRs identified in this study can be valuable for genetic improvement programs and for the quantification of genetic diversity within and among populations of endangered C. plicata.

Conclusions
This study is the first exhaustive investigation of the transcriptome of the endangered freshwater pearl bivalve, C. plicata. Using the Illumina HiSeq 2500 NGS platform and the Trinity assembler, we assembled approximately 79,960 unigenes and assigned them to 23,246 GO and 4,776 KEGG pathway annotations. A number of candidate unigenes involved in immunity, sex determination/differentiation, and reproduction were identified. Transcriptome sequence and annotation data are valuable because C. plicata has been listed as endangered in the Korean Red List of Threatened Species due to a decline in their natural habitat over time. Genetic markers in the form of cSSRs may assist in the development of genetic improvement programs for C. plicata. Table 5. Summary of simple sequence repeat (SSR) types based on the number of repeat units.

Repeat numbers
Motif length Total di-tri-tetra-penta-hexa-Supporting Information S1 Table. Mapping of the annotated unigenes against C. plicata mitochondrial protein genes using BLASTn at a cutoff E-value of 1E-5.