The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus

Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera) remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults) of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant) database and 13,382 were assigned to 34,100 Gene Ontology (GO) terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG) and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally.


Introduction
Beetles (Coleoptera) are by far the most species rich taxonomic order on our planet, containing some 25% of all known animal species, and they show a tremendous range of adaptations to different food resources and habitats [1]. Beetles are generally keystone species in terrestrial ecosystems and many are serious pests in agriculture and forestry and are thus of tremendous economic importance. Yet, only a few beetle genomes have been sequenced so far [2,3], and the genomic resources available to study molecular evolution in beetles are very limited relative to many other groups. Efforts to improve this situation are complicated by the fact that many beetle genomes are fairly large (average C value = 0.7; range 0.2-5.0) and show a high repeat content [4]. Here, transcriptome assembly provides a cost effective remedy [5].
The seed beetle Callosobruchus maculatus Fabr. (Coleoptera: Bruchidae), also known as the cowpea beetle, is a major cosmopolitan pest of a variety of legume crops. It causes an estimated annual crop loss of > 30 million US dollars in Nigeria alone [6]. It is also an emerging model system in several areas of evolutionary biology [7][8][9][10]. Seed beetles have a very rapid life cycle (about three weeks). Females lay their eggs on the external surface of seeds. After a few days, the larvae hatch and burrow inside the seed. They feed and pupate inside the seed, before emerging as adults. Adults requires neither water nor food to reproduce successfully [11].
The genome of C. maculatus is fairly large (1.2 Gb) and shows a very high repeat content [12] and we thus decided to assemble its transcriptome. Over the past few years, there has been a marked improvement in sequence technologies (increasing the sequence reading length) and in de novo transcriptome assembly software tools (assembling transcriptomes without a reference genome). This approach has recently been used to assemble a few beetles transcriptomes de novo [13][14][15]. In the current transcriptome assembly effort, we sequenced samples from C. maculatus larvae, pupae and adults using an Illumina paired-end sequencing platform. Sequences were then assembled using Trinity, a de novo assembly software [5]. More than 145 thousand genes were generated and annotated using the Trinotate pipeline (https://trinotate.github.io/) and Blast2GO [16]. Our aim was to provide the first inclusive annotated transcriptome of C. maculatus. We also assess the degree to which the transcriptome is shared across developmental stages and ask how the size of the transcriptome changes over ontogeny, which is important both for studies of differential expression and comparative purposes. Our results will provide an important resource for gene discovery and comparative genomics of beetles and for future applied and fundamental scientific studies of the seed beetle C. maculatus.

Materials and Methods Samples
Callosobruchus maculatus from the South India SI4 reference population were used here. This is an isogenic stock produced by 5 consecutive generations of full-sib mating to reduce the level of heterozygosity. They were reared on mung beans in laboratory climate cabinets at 29°C, 60% RH and a 12 L: 12 D light cycle. RNA material was extracted and prepared from beetles of several developmental and physiological stages to increase the diversity of expressed transcripts, namely larvae, pupae and mated and virgin adults of both sexes. In total, we prepared 11 different types of samples. A larval sample was prepared by pooling 6 individuals of different larval instars. A pupal sample was prepared by pooling 2 pupal individuals. An adult mix sample was prepared by pooling 4 adult individuals, 2 males and 2 females, randomly chosen from a population containing mated young and old beetles, fed with 5% glucose-water and dried yeast supplement. For the other 8 sample types, adult beetles were collected immediately after emergence from beans and were isolated individually without access to food. Pairs were allowed to mate, after which the sexes were separated and kept with beans for 24h. The virgins were treated in the same precise manner, but were not mated. From these beetles, we created samples representing, in an orthogonal manner, males vs. females, mated vs. virgins and abdomen vs. head/ thorax (i.e., 2 ×2 ×2 = 8 sample types, each replicated three times). All beetles were snap-frozen with liquid nitrogen. For the adult samples, the abdomen was separated from the head and thorax on ice, making sure that the tissues did not thaw. Each sample for RNA extraction was prepared by pooling body parts from 6 individuals.

RNA extraction and purification
RNA was extracted using RNAeasy Mini Kit (Qiagen), following the manufacturer's protocol. DNase digestion was applied using DNase I (RNase-Free DNase set by Qiagen). The RNA quality and quantity was assessed and affirmed using NanoDrop, Qubit and Bioanalyzer.
The RNA-seq libraries were prepared from 1μg total RNA using the Illumina TruSeq stranded mRNA sample preparation kit. At a first step, Poly-A RNA was purified from total RNA using poly-T oligo attached magnetic beads. After the purification step, mRNA is fragmented into small pieces. Fragmented RNA is then reverse transcribed to first strand cDNA using random primers. A second strand cDNA synthesis step with the incorporation of dUTP instead of dTTP is realized to achieve strand specificity. cDNA fragments are then ligated to adapters. At the final step, cDNA are purified and enriched with PCR to create a cDNA library. All sequencing was performed using Illumina HiSeq 2500 sequencing technology with a maximum read length of 2x100 bp. The paired-end library was prepared using the TruSeq stranded mRNA Sample Preparation kit according to the manufacturer's guidelines [17]. The library generation and sequencing were performed by the SNP&SEQ Technology Platform at Uppsala University.

Bioinformatic analyses
Quality trimming. RNA reads obtained from sequencing where quality assessed using FastQC v.0.11.2 [18]. Illumina adapter sequences left in reads were removed using cutadapt v.1.2.1 [19]. Cutadapt will search for a supplied list of adapters in all the reads, a minimum overlap of 15bp between the adapter and the read is required. The adaptor search is done twice in each read to remove adaptors in tandem. Low quality reads towards the 3' and 5' ends of the reads were trimmed with Trimmomatic v.0.3; reads were scanned with a 4 base wide sliding window, and leading or trailing bases with average phred quality score lower than 20 were dropped. Reads with a length lower than 50bp were also discarded [20].
Transcriptome de novo assembly. Digital normalization and transcriptome de novo assembly was conducted using the Trinity 2.0.6 software with a default k-mer size of 25. Trinity is composed of three different modules: Inchworm, Chrysalis and Butterfly. Inchworm builds a K-mer dictionary from the reads, which will lead to the construction of contigs. Chrysalis connects all overlapped contigs into components using a de Bruijn graph approach. In a final step, Butterfly simplifies all the generated graphs to report full-length transcripts and their alternatively spliced form [21]. DeconSeq standalone version 0.4.3 [22] was used to detect and remove sequence contaminations from the assembled transcriptome, using bacterial, fungal, plant, virus and other databases. DeconSeq was run with alignment identity threshold of 95% (-i 95) and alignment coverage threshold of 90% (-c 90). CD-HIT-EST version 4.6.1 (2012-08-27) was subsequently used for clustering of assembled transcripts with the default parameters at two different sequence identity thresholds (100% and 98%).
In order to statically assess the quality of the assembled transcriptomes, we assessed the number of paired-end reads that were present in the assembled transcripts. To achieve this, we used Bowtie (version 0.12.6) [23] to align all raw reads back to the assemblies. In order to avoid an overestimation of transcriptome quality during mapping, only one position was reported for reads that mapped to several locations.

Availability of supporting data
Raw RNA-Seq data is deposited in FASTQ format to the NCBI Sequence Read Archive database (SRA) under the BioProject accession number PRJNA309272. The three separate as well as the combined assembly have been deposited at DDBJ/EMBL/GenBank under the accession numbers GEUD00000000, GEUH00000000, GEUE00000000 and GEUF00000000. The versions described in this paper are the first versions (GEUD01000000, GEUH01000000, GEUE01000000 and GEUF01000000).

Experiment design
For the transcriptome assembly and analysis, RNA material was extracted and prepared from beetles in different ontogenetic stages (larvae, pupae or adults), from both sexes in different reproductive conditions (virgin or mated), using different body parts (head and thorax, abdomen or whole body). In total, 27 different samples (libraries) of 11 different types were sequenced (for details, see material and methods). For adults, technical and biological replicate samples were sequenced separately to allow assessment of sex-specific differential expression, which will be analyzed and reported in a separate study. The RNA-seq libraries were generated using the Illumina TruSeq stranded mRNA sample preparation kit, generating 101 base paired-end reads [34].

Sequencing and de novo assembly
In total, 492 million pairs of 101 base length paired-end reads were generated using Illumina HiSeq sequencing platform. Table 1 summarizes sample statistics for all reads.
All sample reads were combined to generate a single combined reference transcriptome assembly. Before assembling reads into transcripts, however, raw reads were quality filtered. To that aim, we first assessed read quality using the FastQC software. This showed a high Phred quality score (average score over all sequences > 38 Phred) but indicated a small bias in the first 10 bases in reads. To check if this potential bias was due to adaptor sequences being left from the sequencing step or to low quality bases, we applied the Cutadapt and Trimmomatic softwares to clean the reads. As a result, Cutadapt trimmed 42 adaptors sequences from several reads and Trimmomatic removed low quality bases and dropped reads with length lower than 50 bases. In our case, single-end reads were few and we therefore only used paired-end reads for the transcriptome assembly. In the end, a total of 474,9 million pairs of reads (96.6%) where retained and used for the assembly [35].
Due to the high number of reads, a digital normalization step was needed prior to the assembly, to down-scale reads occurring at high coverage and discard reads with an aberrant k-mer abundance profile. This step reduces the number of reads, but increases the quality of the assembly by removing non-uniform and redundant k-mer reads, thus improving the assembly process time. All combined filtered reads were thus digitally normalized and de novo assembled with a default k-mer of 25, using the Trinity assembler [36]. Trinity assembled transcripts were then decontaminated by discarding contaminant sequences (e.g. bacterial, fungal, plant) from the transcriptome, using a combined approach: we first ran DeconSeq to remove contaminant transcripts and then relied on annotation to remove additional contaminant transcripts (for details, see "Annotation and functional assignment"). A total of 9632 (4.2%) transcripts were considered as probable contaminant and discarded. Trinity generated 218,192 contamination-free transcripts which corresponded to 145,883 genes, with a N50 length of 1,818 bases and a mean transcript size of 914 bases. Transcript length ranged from 224 bases to 26,805 bases, with 54,358 transcripts being >1 kb and 283 being >10 kb. These primary assembly statistics imply that transcripts were well assembled and could potentially code for fulllength proteins sequences. GC content of the final assembly was 39%, which is close to the GC content of the raw reads (41-43%). An in-house Perl script was used to extract the longest transcript for each gene, which were considered for the downstream analysis as representatives for each cluster of transcripts. A detailed summary of the assembly statistics is provided in Table 1.
It is well known that de novo transcriptome assemblies produce many more transcripts, especially at high coverage, than a normal annotation [37,38]. De novo transcriptome assemblies are faced with a number of issues, such as coverage variation between highly and low expressed transcripts, polymorphism, alternative splicing, chimeric transcripts and identical sequences repeated in different genes, making the reconstructing of full-length transcripts without redundancy computationally challenging. One possibility is to assess the level of redundancy in a given transcriptome assembly is to perform a cluster analysis based on sequence similarity. In our case, however, clustering (using CD-HIT-EST) reduced the number of transcripts marginally: by 1% and 20% with a sequence identity threshold set to 100% and 98%, respectively. Considering the fact that clustering did not much reduce the number of transcripts, in combination with the fact that clustering risk collapsing valid isoforms, paralogs, and may introduce chimeras, we decided to not rely on clustering but to retain all transcripts [39,40]. Our analysis showed that 84% of the 145,883 genes are represented by one isoform, 7% of the genes are expressed with two isoforms and 9% of the genes are expressed with three or more isoforms. Fifty isoforms was the maximum number of isoforms for a single gene. The high number of genes and isoforms may be due to the fact that the C. maculatus transcriptome was generated by combining reads originating from different development stages. To test this, we generated (in parallel with the main combined transcriptome assembly) separate assemblies for the different development stages (e.g. larvae, pupae or adults) ( Table 2). These more homogenous assemblies showed higher N50 values and mean transcript length, but showed a much lower number of transcripts and genes compared to the combined assembly. We also note that the higher number of isoforms per gene in the combined assembly suggest that a number of genes are expressed with different isoforms in different samples (i.e., in different ontogenetic stages). Finally, the raw paired-end reads of each sample were mapped back to the assembled transcriptome to assess read content, because fragmented or short transcripts may only align to one fragment read of a pair. We found that more than 82% of the reads were correctly mapped as proper pairs (S1 Table). This shows that most genes were properly assembled.

Assessment of completeness
As a complementary approach to assess the quality of the C. maculatus transcriptome, besides statistics such as the N50 value and the number of genes longer than 1kb, we assessed transcriptome completeness in terms of gene content. We searched the transcriptome for the presence or absence of a list of conserved orthologous genes. We used the BUSCO (Benchmarking Universal Single-Copy Orthologs) library of Metazoa orthologous genes [32]. This represents a collection of 843 single-copy metazoan orthologs, well-annotated and conserved.
We obtained 760 (90%) complete BUSCO hits, and duplicate hits to 428 (50%) genes. We found that another 27 (3.2%) were fragmented and 56 (6.5%) were missing. The relatively high number of duplicates may, in theory, represent allelic variation (heterozygosity) in the sample used to construct the assembly, gene duplication and/or mechanisms such as alternative splicing. The fact that heterozygosity is very low indeed in our stock population suggest that allelic variation should contribute little. In an attempt to better understand the origin of our duplicates, we ran the same BUSCO analysis using the three individual assemblies. This analysis still showed a high number of completely recovered genes (87% for larvae, 88% for pupae and 89% for adults), but a markedly lower number of duplicates (26%, 28% and 25% respectively). This suggests that gene duplication and/or alternative splicing, with stage-specific expression of paralogs and/or isoforms, may contribute importantly. In any case, the high number of complete and duplicate genes that was recovered provides an important validation of the depth and completeness of the assembly.

Annotation and functional assignment
The annotation is arguably the most important part of our analysis, as it enables us to evaluate and interpret the content of the C. maculatus transcriptome assembly. We initiated the annotation by blasting the transcriptome, using BLASTx, against the Nr (ncbi non-redundant) database with a cut-off E-value set to 10 −5 . In total, 37,990 (26%) genes showed a significant hit in the Nr database. Almost 90% of the blast hits belonged to Metazoan taxa, the rest representing hits with viruses, bacteria, fungi, and viridiplantae [41]. These genes were considered as likely contaminants and were dropped. In total, we retained 33,216 genes with significant blast hits in our downstream analysis. The blast hits distribution in the Nr database showed most hits with Tribolium castaneum (Coleoptera, Tenebrionidae). More than 40% of the genes showed a similar sequence in T. castaneum. The second top blast hit taxa was Dendroctonus ponderosae (Coleoptera, Curculionidae), with 12% similar genes. This is reassuring, considering that these three beetle species belong to the same infraorder (Cucujiformia). The remaining blast hits showed similarities to other insects in the majority of cases (Fig 2A). The E-value and sequence similarity distribution of the top blast hits for each gene add strength to the blast analysis and to the general quality of the assembled genes; more than 16% of the genes have a blast hit E-values equal to zero (Fig 2B), and more than 59% of genes showed sequence identity higher than 65% with the best Nr database hit (Fig 2C).
We next evaluated the ability of the predicted genes to reconstruct full-length proteins. We scanned all gene sequences for open reading frames (ORFs), using Transdecoder [21]. We obtained 28,744 genes (17,455 with complete ORFs) with ORFs longer than 100 amino acids, with the longest ORF being 8,593 amino acids and the average ORF length being 404 amino acids. No less than 87% of the genes with ORFs show blast hits in the Nr database and these have an average length of 448 amino acids. The large number, and high sequence length, of the predicted ORFs provided a further validation of the quality of the assembly.
Further functional annotation and GO term assignments was conducted using the Blast2Go software and the Trinotate pipeline [16,21]. The Trinotate pipeline incorporates several annotation methods: Blast homology search against SwissProt and Uniref90 databases, Pfam domain prediction, protein signal peptide (SignalP), transmembrane domain (tmHMM) prediction, and comparison to the EggNog database of orthologous group, which also includes the COG database [31]. Blast2GO apply a similar approach, where best Nr (NCBI non-redundant) database blast hits are used to map GO terms, retrieve KEGG pathways and to scan InterProScan signatures for protein domains detection. Blast2GO and Trinotate annotation files are supplied in S1 and S2 Files.

Gene Ontology
Gene ontology (GO) (http://geneontology.org/) is an internationally standardized functional classification system for genes, aimed to describe the properties of genes and their product within an organism using a dynamic-updated controlled vocabulary. GO comprise three main categories: molecular function, cellular component and biological process [42]. Blast2GO mapping was used to obtain the GO annotation based on the gene blast hits from the Nr database, and then completed using Inteproscan and ANNEX annotation. All GO terms were then functionally classified using the WEGO (Web Gene Ontology Annotation Plot) software [43].
Almost all of the GO terms were inferred from electronic annotation (IEA), using gene annotation that had blast matches to proteins belonging to the UniprotKB database [46]. Uni-protKB is a well-annotated database, composed by manually and automatically annotated records sections. These results imply that GO annotation is mainly driven by blast matches found in the UniprotKB databases. Similar results were obtained for GO annotation using Trinotate pipeline.

COG classification
The Clusters of Orthologous Groups (COG) [47], is a database where orthologous gene products are classified into 25 functional categories. It is based on the principle that conserved genes should be classified according to their homologous relationship. Each COG consists of individual orthologous proteins, typically sharing the same general function. COG screening was performed using the EggNog database [31], integrated within the Trinotate pipeline. All genes where aligned to the COG database to predict and classify their functions. In total 5,475 (13% of the genes with Nr blast hits) genes (14,060 transcripts) were assigned to 25 COG functional categories (Fig 4). The largest group is represented by the cluster for general function prediction (1,596, 29%), followed by signal transduction mechanisms (729, 13%), amino acid transport and metabolism (510, 9%), posttranslational modification, protein turnover, chaperones (497, 9%), carbohydrate transport and metabolism (437, 8%), and translation, ribosomal structure and biogenesis (411, 7%). A few other clusters, such us chromatin structures and dynamics, RNA processing and modifications, cell motility, extracellular structures, and The Annotated Transcriptome of Callosobruchus maculatus nuclear structure, are underrepresented or absent. Similar gene COG classifications distributions has been found in other beetles (e.g. pine sawyer beetle, pine shoot beetle) [44,48].

KEGG pathway analysis
To better understand functions and interactions, all annotated genes were mapped against the KEGG database for a pathway-based analysis. As a result, a total of 3,127 genes were assigned to a KEGG pathway. This relatively low number of genes assigned to a pathway is likely the result of imperfect annotation caused by Blast2go, although genes were present in 116 different KEGG pathways. KEGG pathways distribution is summarized in Fig 5. The top 5 pathways are purine metabolism (769, 24.6%), thiamine metabolism (528, 16.9%), pyrimidine metabolism (190, 6%), biosynthesis of antibiotics (162, 5.2%), Aminobenzoate degradation (139, 4.5%).

Transcriptome sharing
Using our inclusive combined assembly, we quantified transcript abundance using the RSEM package through the Trinity pipeline, where all reads were re-aligned to each transcript with Bowtie (short read aligner). Relative abundance of each transcript or gene was reported as Fragments Per Kilobase per Million mapped reads (FPKM). Using this route, we identified 114,249 (78.3%) genes as being actively expressed in all of our samples (FPKM > 0.5). To assess the degree to which the transcriptional profile is shared or private across ontogentic stages (larvae, pupae and adults), we compared the presence and absence of genes between the three of them. For example, genes with a FPKM value > 2 in the larvae stage and with FPKM = 0 in pupae and adults, were considered private to the larvae stage. This comparison identified 51 annotated genes specific to larvae, 63 genes specific to pupae and 171 genes specific to adults (Table 3) (S3 File). Overall, the comparison of the three assemblies suggests that adults shows the highest, and larvae the lowest, transcriptional diversity and privacy. This is also consistent with the fact that the adult assembly contained the largest number of long  Table 2). The increase in the number of expressed genes during ontogeny illustrates the importance of including samples from several developmental stages, of which adults should be one, in order to construct inclusive de novo assemblies. In Table 4, we summarized a selection of the top blast hit genes specific to larvae, pupae, and adults.
Larval genes. Digestive enzymes dominated here. Cathepsin b-like cysteine protease 2 is one of the genes specific to the larval stage. This gene is a critical component of digestive pathways in larvae, and was also found highly expressed in the larval stage of Tribolium castaneum [49]. Other highly expressed genes private to larvae were Glycoside Hydrolase family 1 (FPKM 149.37) and Glycosyl Hydrolase (FPKM 22.89). The Glycoside Hydrolase family comprises other known enzymes, which were also present in larvae such us the Beta-galactosidase-1-like protein 2 (FPKM 3.52). These genes are generally present in gut tissues and are implicated in the chitin degradation process [50]. We also found a larval cuticle protein expressed only in larvae, which is an important gene for the development of the cuticle of the larval body wall [51]. The Annotated Transcriptome of Callosobruchus maculatus Pupal genes. In pupae, we found two private cuticle class genes; Cuticle protein 8-like (FPKM 52.12) and Cuticle protein 7 (FPKM 3.36). These genes are important for the formation and development of insect cuticle [52], a key process during the mid-and late stages of pupation. Other private pupal genes are the Resilin isoform x1 (FPKM 164.68) and myosin-VIIa (FPKM 5). Resilin is an elastomeric protein and a crucial component for wing movement in insects [53] and Myosin is a ATP-dependent motor protein that plays a fundamental role in muscle contraction [54].
Adults genes. The adult stage showed private expression of the Odorant-binding protein 4 gene, which is involved in adult olfaction. Although olfaction is important for most of insects The Annotated Transcriptome of Callosobruchus maculatus [55], odorant binding proteins are known to be important for both host-and mate-finding in adult seed beetles. Further, Tektin-2 and Tubulin alpha-1 chain were not expressed in larvae and pupae but were expressed in adults. Tektin is an essential protein for the development of cilia and flagella. Both Tektin and Tubulin are important components of the cytoskeleton doublet microtubule in insect [56], suggestion that they are important during gamete production in adults. We also found a private bone morphogenetic protein, a member of a group of proteins known to be involved in neural signaling in Drosophila [57].

Conclusions
In this study, we provide a comprehensive assembly of the C. maculatus transcriptome, based on deep Illumina sequencing of diverse samples. The transcriptome covers a large number of genes expressed in all developmental stages. In total, 492 million paired-end reads were assembled into a high number of genes (145,883), of which 33,216 were annotated. We found that including samples from several developmental stages was crucial in order to provide a maximally integrative transcriptome. We believe that this data will provide a valuable resource for future studies of the seed beetle C. maculatus as well as for comparative gene expression and genomic analyses of beetles more generally.