De novo Transcriptome Analysis of Chinese Citrus Fly, Bactrocera minax (Diptera: Tephritidae), by High-Throughput Illumina Sequencing

The Chinese citrus fly, Bactrocera minax (Enderlein), is one of the most devastating pests of citrus in the temperate areas of Asia. So far, studies involving molecular biology and physiology of B. minax are still scarce, partly because of the lack of genomic information and inability to rear this insect in laboratory. In this study, de novo assembly of a transcriptome was performed using Illumina sequencing technology. A total of 20,928,907 clean reads were obtained and assembled into 33,324 unigenes, with an average length of 908.44 bp. Unigenes were annotated by alignment against NCBI non-redundant protein (Nr), Swiss-Prot, Clusters of Orthologous Groups (COG), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database. Genes potentially involved in stress tolerance, including 20 heat shock protein (Hsps) genes, 26 glutathione S-transferases (GSTs) genes, and 2 ferritin subunit genes, were identified. These genes may play roles in stress tolerance in B. minax diapause stage. It has previously been found that 20E application on B. minax pupae could avert diapause, but the underlying mechanisms remain unknown. Thus, genes encoding enzymes in 20E biosynthesis pathway, including Neverland, Spook, Phantom, Disembodied, Shadow, Shade, and Cyp18a1, and genes encoding 20E receptor proteins, ecdysone receptor (EcR) and ultraspiracle (USP), were identified. The expression patterns of 20E-related genes among developmental stages and between 20E-treated and untreated pupae demonstrated their roles in diapause program. In addition, 1,909 simple sequence repeats (SSRs) were detected, which will contribute to molecular marker development. The findings in this study greatly improve our genetic understanding of B. minax, and lay the foundation for future studies on this species.

Introduction generate large amounts of sequence data for characterizing genomes and transcriptomes, which have facilitated the studies on biological processes in organisms [1][2][3]. The transcriptome sequencing could serve as an efficient approach to obtain the genetic information of nonmodel species that lack genome database. Therefore, the transcriptomes of several insect species, such as Bemisia tabaci [4], Liposcelis entomophila [5], Bactrocera dorsalis [6], Monochamus alternatus [7], Blattella germanica [8], and Chrysomya megacephala [9], have been sequenced using next-generation sequencing to identify interesting genes and reveal gene expression patterns.
Another major benefit brought about by next-generation sequencing technology is the discovery of microsatellite markers simple sequence repeats (SSRs). Given the properties of high polymorphism and ease of scoring [10], the molecular marker SSRs have widely been used in population genetic and conservation studies, such as population size, kinship, bottlenecks, and migration rate [11]. The large-scale screen of SSRs necessitates the availability of abundant genetic information. The transcriptome sequencing meets this requirement and thus greatly facilitates the discovery of SSRs.
The Chinese citrus fly, Bactrocera minax (Enderlein), has been recognized as one of the most devastating pests of citrus in the temperate areas of Asia, including Nepal, India, Bhutan, and China [12][13][14]. The oligophagous B. minax specifically damages cultivated and wild species of citrus [15], causing the fruits to ripen prematurely and drop to the soil [16][17][18][19]. Given the economic importance, B. minax has increasingly aroused concerns in citrus-growing regions in China, thus the pertinent studies have been carried out. However, the researches were mainly focused on ecology, biology, and management of this pest [12,[19][20][21][22][23][24][25][26]. Recently, studies involving molecular biology and physiology have increasingly been conducted [27][28][29][30][31][32], but they are still scarce at present partly due to the inability to rear this insect in the laboratory and the lack of genetic information.
One of the limiting factors to rearing B. minax in the laboratory is the long-lasting pupal stage, in which the diapause occurs. It has previously been shown that application of ecdysone 20-hydroxyecdysone (20E) on B. minax pupae can significantly advance the adult emergence [28]. However, the underlying mechanisms remain unknown. In insects, the 20E is synthesized from cholesterol under the manipulation of a set of genes, including Neverland, Spook, Phantom, Disembodied, Shadow, Shade, and degraded to 20-hydroxyecdysonoic acid by catalysis of Cyp18a1 [33,34]. The dynamics of 20E is essential for proper development of insects. Generally, the ecdysone signal in insect is amplified by a cascade of primary and secondary response genes that are activated by rising ecdysone levels [35]. The ecdysone receptor (EcR) and ultraspiracle (USP) have been demonstrated to be ecdysone receptors in the form of heterodimeric complex and mediate the biological activity of ecdysone [36][37][38]. In B. minax, these genes may also regulate the roles of 20E in diapause termination.
To survive the unfriendly environment during diapause, insects have to be tolerant to various abiotic and biotic stresses. Heat shock proteins (Hsps) are a group of well described proteins that are commonly expressed in response to these stresses [39]. Glutathione Stransferases (GSTs) comprise a family of enzymes best known for their ability to catalyze the conjugation of the reduced form of glutathione to substrates for the purpose of detoxification [40]. Ferritins are iron-binding proteins which play key roles in iron transport and storage to prevent oxidative damage caused by iron [41]. It has been found that Hsps [42], GSTs [43], and ferritins [44] involved in diapause programs in insects, which may also be the case for B. minax.
In this study, the Illumina sequencing was conducted to generate transcriptome dataset of B. minax. All assembled unigenes were annotated by BLASTx against databases of NCBI nonredundant protein (Nr), Swiss-Prot, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, Cluster of Orthologous Groups (COG), and gene ontology (GO). Subsequently, the genes encoding Hsps, GSTs, Ferritins, enzymes in ecdysone biosynthesis pathway, and ecdysone receptors were identified. The expression patterns of ecdysone-related genes were investigated across the developmental stages, and were compared between 20E-treated and untreated pupae to reveal the roles of these genes in 20E signaling. In addition, a large-scale screen of SSRs in B. minax was conducted based on the transcriptome data. Undoubtedly, the transcriptome dataset will be an invaluable resource for future studies on B. minax.

Ethics Statement
The owner of the orchard in Wulong County, Chongqing Municipality, China, provided permissions to collect the samples for our scientific research.

Insects
To obtain a comprehensive transcriptome dataset of B. minax, samples at various developmental stages were prepared, including eggs, early-and late-instar larvae, early-, middle-, and latestage pupae, and female and male adults. All samples were collected from an orchard (Latitude: N29°344', Longitude: E107°546') in Wulong County, Chongqing Municipality, China, and were stored in liquid nitrogen for subsequent RNA extraction.

RNA isolation, library construction and Illumina sequencing
Total RNA was extracted from each sample using TRIZOL Reagent (Life technologies, Carlsbad, CA, US) according to the manufacturer's instructions. RNA quantity was assessed with NanoVue spectrophotometer (GE Healthcare Bio-Science, Uppsala, Sweden). The purity and integrity of RNA was checked on 1% agarose gel electrophoresis. Equal amount of RNA isolated from all samples was pooled for constructing cDNA library.
Poly (A) mRNA was isolated from total RNA using oligo (dT) magnetic beads. Mixed with fragmentation buffer, the mRNA was fragmented to 200-700 bp. These short fragments were transcribed to the first-strand cDNAs with random hexamer primers, followed by the secondstrand cDNAs synthesis using DNA polymerase I (New England BioLabs, Ipswich, MA) and RNase H (Invitrogen, Carlsbad, CA). These cDNA fragments were purified and resolved with EB buffer for end repair, single nucleotide A (adenine) addition, and ligation of adaptors. The suitable fragments judged by agarose gel electrophoresis was collected and used as templates for PCR amplification. The cDNA library was sequenced on Illumina HiSeq™ 2000 using paired-end technology in a single run.

Transcriptome de novo assembly and bioinformatics analysis
The raw reads produced by sequencing instrument were filtered to remove adaptors sequences, low-quality sequences with unknown nucleotides N, and reads with more than 20% low quality bases (base quality < 10) using the NGS QC toolkit package (version 2.3) [45], and to remove rRNA sequence using SortMeRNA [46]. The clean reads data has been deposited in the NIH Short Read Archive (SRA) database (Accession No. SRR1272962). The de novo assembly of clean reads was conducted using the short reads assembling program Trinity [47]. Briefly, clean reads with overlapping sequence were combined to form contigs. The reads were then mapped back to the contigs. The contigs from the same transcript were detected with pairedend reads and assembled using paired-end joining and gap-filling method. The sequences that cannot be extended on either end were defined as unigenes.
All assembled unigenes were first aligned to NCBI Nr database and Swiss-Prot with a cutoff E-value of 10 −5 using BLASTx (http://www.ncbi.nlm.nih.gov/). Subsequently, GO annotation was performed using Blast2GO program, a universal tool for annotation and analysis in functional genomics research [48]. Unigenes were also aligned to the COG database to predict and clarify the gene functions [49]. Lastly, unigenes were assigned to KEGG pathways using the online KEGG Automatic Annotation Server (KAAS)(http://www.genome.jp/kegg/kaas/). The sequence direction of the unigenes was determined by the best alignment results from databases in the priority order of Nr, Swiss-Prot, KEGG, and COG.

Identification and analysis of interesting genes
Unigenes putatively encoding Hsps, GSTs, enzymes in 20E biosynthesis pathway, and 20E receptors were identified by alignment against databases with a cut-off E-value < 10 −5 . The full open reading frames of interesting genes were then determined using DNAMAN version 6 (http://www.lynnon.com) and were further verified by protein BLAST results. The deduced protein sequences of interesting genes were aligned with their counterparts of other insect species. Subsequently, the phylogenetic tree was constructed based on the amino acid sequence alignment using Neighbor-joining (NJ) method in software MEGA4 [51]. The reliability of the branching was tested by performing bootstrap analysis with 1,000 replications.

Expression patterns of identified 20E-related genes
To compare the expressions of identified 20E-related genes among developmental stages, the second-and third-instar larvae, the pre-, early-, middle-, late-, and post-diapause pupae, and the adults were collected. In addition, 20E-treated and untreated pupae were obtained through the method described by Wang et al. [28] to investigate the effect of 20E application on expression patterns of identified 20E-related genes. Total RNA was extracted from each sample using TRIZOL Reagent and the cDNA was synthesized using PrimeScript™ RT Master Mix (Perfect Real Time) Kit (Takara, Shiga, Japan). Subsequently, real-time PCR was conducted using SYBR Premix Ex Taq™ II Kit (Takara) to calculate the relative expression of each gene. The specific primers for real-time PCR were shown in S1 Table. Three biological and technical replicates were performed for each treatment. Relative expression of each gene among developmental stages was compared by one-way analysis of variance (ANOVA) with Tukey's post hoc test for difference among all pairs of test variants, while that between 20E-treated and untreated individuals was compared by two tailed, unpaired t-test. All data was analyzed using software SPSS version 22 (IBM Corp., Armonk, NY).

Results and Discussion
Illumina sequencing and de novo assembly About 5.07 Gb raw data, including 25,088,946 raw reads, was generated from Illumina sequencing platform in a single run. After filtration, a total number of 20,928,907 clean reads, encompassing 4.227 Gb sequencing data, were assembled into 941,005 contigs with a mean length of 80.97 bp. The contigs were further assembled into 33,324 unigenes with a mean length of 908.44 bp (Table 1), which is much longer than that obtained from many other species, such as L. entomophila [5], B. dorsalis [6], and M. alternatus [7], indicating the efficient performance of sequencing and assembly in this study. The saturation curve illustrated that the sequencing data were saturated and sufficient for subsequent analysis (S1 Fig). This Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GBEY00000000. The version described in this paper is the first version, GBEY01000000. Among all unigenes, 8,357 (25.08%) and 3,795 (11.39%) unigenes are longer than 1,000 and 2,000bp, respectively, and only 9960 (29.89%) unigenes are shorter than 300bp (S2 Fig), demonstrating the effectiveness of Illumina sequencing technology in rapidly capturing a large portion of the transcriptome and providing a sequence basis for future studies.

Sequence annotation
A total of 15,735 (47.22%) unigenes were successfully aligned to public protein databases with a cut-off E-value of 1.0E -5 . As a close species to B. minax, the model species fruit fly Drosophila melanogaster has been subjected to transcriptomic analysis and 17,564 genes were annotated [52]. The similar number of annotation between these two studies implies that the high-quality B. minax transcriptome data was obtained. The remaining unigenes failed to acquire annotation, probably because they specifically express in B. minax, correspond to untranslated regions, or were assembled incorrectly.
GO is a standardized gene functional classification system that provides a structured and controlled vocabulary to predict gene functions [53]. Based on their similarity to genes with known functions, 9,950 unigenes were assigned at least one GO term, and 76,291 GO terms in total (Table 2). These GO terms fall into three main categories, 'Cellular component', 'Molecular function', and 'Biological process', which include 18, 18, and 23 subcategories, respectively (Fig 2). Many unigenes were assigned more than one GO terms, indicating they could be involved in various physiological and biochemical processes. Among three main categories, 'Biological process' accounts for the largest proportion of GO terms (38,159, 50.02%), followed by 'Cellular component' (25,451,33.36%), and 'Molecular function' (12,681, 16.62%). In 'Cellular component' category, the 'cell part' (5,116, 6.71%) and 'cell' (5,089, 6.67%) are the most abundant. In 'Molecular function' category, the 'binding' (5,030, 6.59%) and 'catalytic activity'  (4,295, 5.63%) are highly represented. In 'Biological process' category, the 'cellular process' (6,726,8.82%) and 'metabolic process' (5,439,7.13%) are assigned the most frequently (S2 Table). To program diapause in insect, several signaling pathways are activated. For example, the insulin signaling plays vital roles in cell cycle and developmental regulation, lifespan extension, suppressed metabolism and fat hypertrophy, and enhanced stress tolerance in insect diapause [54]. In the present study, 45 unigenes were assigned with 12 insulin-related GO terms (S3 Table). These genes may participate in the programming of diapause in B. minax. Moreover, the 'response to stimulus' is processes that causes changes in state and activity of cells or organisms as a result of stimulus, such as cold or hypoxia/anoxia stress. Genes assigned with this GO term may help diapausing B. minax pupae survive the harsh environment in winter.
In addition, all unigenes were subjected to alignment against the COG database for functional prediction and classification. In total, 4,145 unigenes could be assigned to COG classification and were classified into 24 COG categories ( Table 2; Fig 3). Among all these categories, 'General function prediction only' (1,329, 32.06%) represented the largest group, followed by 'Replication, recombination and repair' (493, 11.89%), 'transcription' (426, 10.28%), and 'Translation, ribosomal structure and biogenesis' (402, 9.70%). Interestingly, no unigene was assigned to 'Extracellular structures', and only one unigene was assigned to 'Nuclear structure'.

Simple sequence repeats discovery
Currently, only a few SSRs of B. minax were isolated in the previous study [58]. In this study, the SSRs were detected in unigenes which are longer than 1 Kb. A total of 1,909 SSRs were identified in B. minax transcriptome. Among all SSRs, trinucleotide repeats accounted for largest proportion (63.12%), followed by dinucleotide repeats (32.48%), trannucleotide repeats (3.46%), hexanucleotide repeats (0.63%), and pentanucleotide repeats (0.31%). Most of these SSRs have the number of repeats under 8 times (Table 3). Among all repeat types, AAC/GTT (29.18%), AC/GT (15.92%), AT/AT(12.94%), AGC/CTG (10.58%), and ACC/GGT (6.29%) are the most abundant ones (S4 Table). The SSRs identified in this study considerably enlarged the SSRs dataset of B. minax and they would be invaluable for the future studies related to population genetic structure, such as genetic variation and gene flow.

Identification of heat shock protein genes
The heat shock proteins (Hsps) are known as stress proteins and molecular chaperones with functions of preventing irreversible denaturation of substrate proteins and promoting protein  folding, degradation, disaggregation, and cell localization. They are important elements in stress response system at the cellular level when exposed to a wide variety of abiotic and biotic stressors, such as heat shock, cold, desiccation, starvation, anoxia, oxidation, osmotic stress, environmental contaminants, bacteria, and virus [39,42,59]. Moreover, Hsps have also been found employed during diapause, but the expression patterns of various Hsps in different species are inconsistent, even opposite [30,42,60]. Hsps represent a super gene family and can be divided into several families based on the molecular weight and homology [39]. In this study, a total of 23 unigenes putatively encoding Hsps were identified by alignment against databases. After manually removing short sequences, 20 unigenes containing full open reading frame were selected for subsequent analysis (S5 Table). Phylogenetic analysis of these Hsps indicated that they were divided into 6 families, Hsp90, Hsp70, Hsp60, Hsp40, Hsp10, and small Hsps (sHsps), encompassing 3, 2, 1, 3, 2, and 9 unigenes, respectively (Fig 4). The features and functions of Hsps vary with families. Briefly, Hsp90s activate and stabilize a wide variety of cytosolic proteins, which involve in important cellular pathways, such as signal transduction, intracellular transport, and protein degradation [61]. Hsp70s are structurally and functionally conserved and respond to stress by tightly binding its protein substrates and preventing them from denaturation or aggregation [62]. Hsp60 family is a group of multi-functional proteins implicated in several cellular processes, including stress response, amino acid transport, signal transduction, replication and transmission of mitochondrial DNA, and cellular metabolism [63][64][65][66]. Hsp40s, also known as DnaJs, interact with Hsp70 in J domain and regulate the ATPase activity of Hsp70s in several cellular processes [67]. Hsp40s usually work in conjunction with Hsp60s [68]. sHsps, with molecular weight ranging from 12 to 43 kDa, are known to bind to the non-native substrate proteins and prevent them from forming irreversible aggregations under stress conditions [69]. The roles of identified Hsps in the stress tolerance of diapausing B. minax remains largely unknown and necessitates further investigation.

Identification of glutathione S-transferase genes
GSTs are a family of enzymes that involved in many cellular physiological activities, such as detoxification of endogenous and xenobiotic compounds, intracellular transport, biosynthesis of hormones and protection against oxidative stress [40,70]. A total of 27 unigenes encoding GSTs were identified in B. minax by alignment against databases, and 26 putative GSTs genes were manually selected for analysis after removing short unigenes (S6 Table). In insects, GSTs fall into several major subclasses: delta, epsilon, omega, sigma, theta, zeta, microsomal and others. Delta and epsilon are two unique classes to insects [70,71]. In this study, 26 unigenes were assigned to seven classes, including delta (6), epsilon (11), omega (1), sigma (1), theta (3), microsomal (3) and others (1). No GST belonging to zeta class was identified (Fig 5). Delta and epsilon occupy over 50% of the entire GSTs, which is consistent with previous studies from other dipteran insects [72]. It has been demonstrated that delta and epsilon classes correlate with detoxification and adaptation to environmental selection pressures [72,73]. The expansion of these classes may help B. minax survive the poisonous or harsh environments. Moreover, the GST levels have been found up-regulated in diapausing insects, which is speculated to protect individuals from oxidative damage as diapausing insects commonly experience hypoxia/anoxia stress [43,74,75]. The identification of genes encoding GSTs is conducive to understanding their potential roles in stress tolerance in B. minax diapause.

Identification of ferritin genes
Iron is not only an indispensable micronutrient, but also a potential toxin in organisms. On one hand, organisms must intake sufficient iron for various physiological processes [76]. On the other hand, organisms must avoid oxidative damage to biomolecules caused by the potential toxic properties of iron [77]. Ferritins are iron-binding proteins and play key roles in iron transport and storage to achieve iron homeostasis in organisms [41]. The active insect ferritin is a complex formed between two types of secretory subunits, a mammalian heavy chain homologue (HCH) that preserves the ferroxidase centers, and a mammalian light chain homologue (LCH) that contains the nucleation center for the formation of ferrihydrite iron core [78]. Insect HCH and LCH genes are arranged 'head to head" and transcribed in opposite directions [79]. It has been found that ferritin genes expressed higher in diapause-destined insects, probably to resist oxidative stress caused by iron [44]. That may also be the case for B. minax. In this study, both HCH and LCH ferritin subunits of B. minax were identified by alignment against databases (S7 Table) and verified by phylogenetic analysis (Fig 6). Both HCH and LCH preserve signal peptide, C residues involved in inter-and intra-subunit disulfide bonds, and

Identification of 20E-related genes
Ecdysone 20E is the key regulator of molting, reproduction, and diapause in insects. The biosynthesis of 20E is mediated by Neverland, Halloween genes, and Cyp18a1. Neverland, a conserved Rieske oxygenase, is responsible for the conversion of cholesterol to 7-dehydrocholesterol, which is the first critical catalytic step in the 20E biosynthesis pathway [80]. Halloween genes are a set of genes encoding cytochrome P450 enzymes, including Spook/ Cyp307A1, Spookier/Cyp307a2, Phantom/Cyp306a1, Disembodied/Cyp302a1, Shadow/ Cyp315a1, and Shade/Cyp314a1. Halloween genes are responsible for 20E biosynthesis from 7-dehydrocholesterol [33]. Cyp18a1 encodes a cytochrome P450 enzyme with 26-hydroxylase activity, catalyzing the conversion of 20E to 20-hydroxyecdysonoic acid. The degradation of 20E is essential for proper development of insects [34]. Unigenes encoding Neverland, Halloween genes, and Cyp18a1 in B. minax were identified by alignment against databases (S8 Table), and verified by phylogenetic analysis (Fig 7).
The ecdysone receptor is a heterodimeric complex consisting of two proteins, EcR and USP, which are members of nuclear receptor superfamily, and the insect orthologs of the mammalian farnesoid X receptor (FXR) and retinoid X receptor (RXR) proteins, respectively [36][37][38]. Upon binding to the ecdysone receptor, the 20E/receptor complex binds to ecdysone-response elements in the promotor region of target genes and activates transcription [81]. Unigenes encoding EcR and USP in B. minax were identified by alignment against databases (S8 Table) and verified by phylogenetic analysis (Fig 8). We previously found that 20E application on B. minax pupae advanced adults emergence [28]. Therefore, the identification of these 20Erelated genes would contribute to revealing the mechanisms underlying 20E induced diapause termination in B. minax.

Expression patterns of identified 20E-related genes
The expression patterns of identified 20E-related genes across the developmental stages were investigated to understand the potential roles of these genes in B. minax development (Fig 9). All genes expressed higher in the third-instar larvae than did in the second-instar larvae, which is expected to meet the requirement of metamorphosis. After pupation, the expressions of almost all genes decreased to lowest levels prior to diapause occurrence. Accordingly, the 20E titer declined to the lowest level at this stage, implying that the low 20E titer may be the prerequisite for diapause occurrence as the injection of exogenous 20E at this stage averted diapause and advanced adult emergence [28]. Once diapause initiated, the expressions of all genes, except Shade and Cyp18a1, significantly increased to highest levels, and the 20E titer elevated as well [28]. In addition, injection of exogenous 20E at this stage did not advance adult emergence (data not shown), indicating that once initiated, the diapause is barely affected by 20E titer. Then, the expressions of these genes, except Shade and Cyp18a1, decreased again at the middle-diapause stage and did not show significant variation thereafter. Interestingly, the expression of Shade, which is responsible for the last step in 20E biosynthesis, remained relatively lower across the diapause stages, suggesting that Shade may control the rate-limiting step in the 20E biosynthesis pathway. Similarly, the expression of Cyp18a1 also remained relatively lower across the diapause stages, probably contributing to the accumulation of 20E, which reaches the highest level at late pupal stage [28].
The expressions of these genes were compared between 20E-treated and untreated pupae to understand the effects of exogenous 20E application on 20E biosynthesis and signaling ( Fig  10). One day after injection, the expressions of EcR and USP were significant higher in 20Etreated pupae compared to those in untreated ones, implying that ecdysone receptors are involved in 20E signaling and thus averting the diapause. Likewise, the expressions of most genes in 20E biosynthesis pathway were up-regulated in 20E-treated pupae. The shade, however, was down-regulated, albeit not significant. As Shade is presumed to control rate-limiting step in 20E biosynthesis pathway, exogenous 20E application may suppress endogenous 20E biosynthesis. Interestingly, the expression of Cyp18a1 was sharply up-regulated in 20E-treated pupae, probably in order to degrade the surplus exogenous 20E. Forty days after injection, the effect of 20E disappeared as all genes did not present significantly different expression.