Detailed Transcriptome Description of the Neglected Cestode Taenia multiceps

Background The larval stage of Taenia multiceps, a global cestode, encysts in the central nervous system (CNS) of sheep and other livestock. This frequently leads to their death and huge socioeconomic losses, especially in developing countries. This parasite can also cause zoonotic infections in humans, but has been largely neglected due to a lack of diagnostic techniques and studies. Recent developments in next-generation sequencing provide an opportunity to explore the transcriptome of T. multiceps. Methodology/Principal Findings We obtained a total of 31,282 unigenes (mean length 920 bp) using Illumina paired-end sequencing technology and a new Trinity de novo assembler without a referenced genome. Individual transcription molecules were determined by sequence-based annotations and/or domain-based annotations against public databases (Nr, UniprotKB/Swiss-Prot, COG, KEGG, UniProtKB/TrEMBL, InterPro and Pfam). We identified 26,110 (83.47%) unigenes and inferred 20,896 (66.8%) coding sequences (CDS). Further comparative transcripts analysis with other cestodes (Taenia pisiformis, Taenia solium, Echincoccus granulosus and Echincoccus multilocularis) and intestinal parasites (Trichinella spiralis, Ancylostoma caninum and Ascaris suum) showed that 5,100 common genes were shared among three Taenia tapeworms, 261 conserved genes were detected among five Taeniidae cestodes, and 109 common genes were found in four zoonotic intestinal parasites. Some of the common genes were genes required for parasite survival, involved in parasite-host interactions. In addition, we amplified two full-length CDS of unigenes from the common genes using RT-PCR. Conclusions/Significance This study provides an extensive transcriptome of the adult stage of T. multiceps, and demonstrates that comparative transcriptomic investigations deserve to be further studied. This transcriptome dataset forms a substantial public information platform to achieve a fundamental understanding of the biology of T. multiceps, and helps in the identification of drug targets and parasite-host interaction studies.


Introduction
Taenia multiceps is a taeniid cestode, which inhabits the small intestine of dogs and other canids (foxes, wolves, and jackals), making these definitive hosts a widespread infection reservoir [1]. The coenurus (larva of T. multiceps) parasitizes the central nervous system (CNS) of sheep, occasionally goats, deer, antelopes, chamois, rabbits, hares and horses, and less commonly, cattle [2][3][4][5]. It frequently causes the death of infected animals, and can lead to huge economic losses of sheep/goats, predominantly in developing countries, such as those in Africa and southeastern Asia [6]. The parasite can also cause zoonotic infections in humans, leading to serious pathological conditions in humans, which occur more commonly than previously assumed [6][7][8][9][10][11].
The gravid proglottids of T. multiceps are discharged from infected dogs and are ingested by intermediate hosts (including humans, especially in rural grazing areas where people raise sheep or other ungulates, and keep guard dogs in close proximity) through contaminated food or water [12]. The proglottids then release oncospheres in the intestine and penetrate the intestinal mucosa and blood vessels. After reaching the brain through the bloodstream, they will take 2-3 months to grow into a coenurus causing increased intracranial pressure. This will lead to the onset of clinical signs, such as ataxia, hypermetria, blindness, head deviation, headache, stumbling and paralysis [13][14][15]. Once the tissue of infected sheep or other livestock has been ingested by a definitive host, the lifecycle is completed, and the parasites develop into adult tapeworms in the small intestine of the host [5].
Together with this complex lifecycle, the specific immune evasion traits of parasites and even the randomness of the infection make research and drug or vaccine programs for Taenia species very difficult; consequently, new methods to control this parasite are required. Although traditional methods of control (such as burning out the infected brain and spinal cord of sheep, or deworming infected dogs with anti-parasitic drugs) help to disrupt the lifecycle of this parasite, its global distribution still includes Europe [16][17][18][19], North America [8,9], Africa and Asia [6,20].
Despite its global importance, available gene sequences for T. multiceps remain scarce. Currently, only 101 nucleotide sequences and 103 proteins have been published on NCBI and only the mitochondrial genome of T. multiceps has been sequenced [21]. To improve the control of this parasitic cestode, the identification of molecular targets to develop new effective antiparasitic drugs are necessary. Recent developments in next generation sequencing (NGS) technologies [22][23][24][25] and recent progress in bioinformatics, such as the new Trinity de novo assembling program [26], make it possible to explore the fundamental biology of cestodes in far greater detail than currently available information, and are cheaper than the commonly used Sanger sequencing technique [27]. To date, a fractionated transcriptome of the human parasite, Taenia solium cysticerca, has been revealed by the ORESTES method [28], and a cDNA library has been constructed [29] with 30,700 ESTs available in GenBank. Some vaccine and diagnostic targets were proposed from this study. However, further datasets generated by highthroughput sequencing and comparative transcriptome analysis could bring a more comprehensive understanding of parasite biology. To our knowledge, the transcriptomes of Taenia pisiformis, T. solium, Echinococcus granulosus, Echinococcus multilocularis and Hymenolepis microstoma by NGS technology have already been studied, but only T. pisiformis transcriptome dataset has been published [30,31]. Compared with nematodes [32][33][34][35][36][37][38] and trematodes [39][40][41][42], published transcriptome data of cestodes by NGS remain scarce. An improved understanding of the entire molecular transcriptome of the adult stage of this cestode is necessary. This can provide a platform to execute the identification or validation of required genes and gene products in the design of cestocides aimed at controlling the infection in dogs and disrupting the lifecycle of T. multiceps [19,32].
Here, we used the novel assembler Trinity and Illumina sequencing technology to gather initial insights into the transcriptome of the adult stage of T. multiceps. In addition, a comparative analysis was performed against the transcriptomes of other cestodes, including T. pisiformis, T. solium, E. granulosus and E. multilocularis, and intestinal parasites, including T. spiralis, A. caninum and A. suum, in order to help us to discover essential biological pathways and pathway-related genes for intestinal parasites or cestodes specifically. These would be essential for parasite development and survival, or for parasite-host interactions [33,34,43]. Therefore, we anticipate that a better understanding of parasite development or parasite-host interaction at a molecular level may help to develop new anti-cestode drug targets or candidate vaccines [35].

Illumina Sequencing and Assembly
In order to generate a broad survey of genes involved in T. multiceps survival and development, the RNA of T. multiceps adults was extracted. Using Illumina (paired-end) sequencing technology, we obtained a total of 28.3 million raw reads with an average length of 90 bp, a total of approximately 2.55 Gigabase pairs (Gbp). After the removal of raw reads that only had 3' adaptor fragments, ambiguous reads and low-quality reads, 27.4 million (2.47 Gbp, 96.92% of the raw reads) clean reads with a Q20 percentage of 89.92% and a GC content of 49.04% remained. All clean reads were assembled de novo by Trinity, generating 53,568 contigs without gaps longer than 300 bp (5,218 Mbp), with a mean contig length of 974 bp and a N50 of 1,268 bp (Table 1). A total of 36,538 contigs (68.21%) were longer than 500 bp. To join further sequences and remove any redundant sequences, contigs were clustered using the TIGR Gene Indices clustering tools (TGICL). A total of 31,282 unigenes were produced by the clustering, with an average length of 920 bp and a N50 of 1,206 bp (Table 1). Of these, 20,123 unigenes (64.33%) were longer than 500 bp, and 9,504 unigenes (30.38%) were longer than 1,000 bp. Short contigs and unigenes (less than 300 bp) were removed, the maximum length of both contigs and unigenes was 11,875 bp ( Table 1). The length distribution of these contigs and unigenes is shown in Figure 1A. The gap distribution of contigs and unigenes was analyzed to identify the data quality. All contigs and unigenes showed no gaps, thus demonstrating the high quality of Trinity assembling ( Figure 1B).

Annotation Against Public Databases
In order to obtain and validate sequence-based annotations for all assembled unigenes, we employed Blastx for a sequence similarity search against the Nr database (animal protein of Nr and Drosophila protein of Nr databases), in addition to the Uni-ProtKB/Swiss-Prot, UniProtKB/TrEMBL, COG, and KEGG databases, with an e-value threshold of 1.0 25  When the unigenes were first searched against the animal proteins of the Nr database [44,45], the e-value distribution of the top hits in the animal proteins indicated that 36% of the mapped sequences had a significant similarity with a stringent threshold of less than 1.0 250 , while 64% of the similar sequences ranged from 1.0 25 to 1.0 250 ( Figure S1A). The similarity distribution revealed that 20% of the sequences had a similarity higher than 60%, whereas 80% of the hits had a similarity between 18% and 60% ( Figure S1B). The species distribution showed that 48.73% of the unigenes had top matches and first hit against the sequences of Schistosoma mansoni, followed by Schistosoma japonicum (16.73%) and Danio rerio (2.09%), respectively ( Figure S1C).
The potential involvement in biological pathways of T. multiceps sequences was revealed by mapping against known proteins (equal to enzyme commission/EC number) of the KEGG database. Out of 31,282 assembled unigenes, a total of 11,645 homologous sequences were grouped into six categories, including 'Metabo- To identify domain-based annotations, unigenes were used to search the domain/families according to the InterPro and Pfam databases (e-value ,1.0 25 ). As a result, 25,457 (81.38%) sequences obtained the entry description against InterPro and were categorized into 4,562 domains/families. Most domains/ families were found to contain more than one unigene.

GO Functional Classification and GO Terms Comparison among Taeniidae Cestodes
Of the 17,618 Nr hits, a total of 4,706 sequences were assigned to 2,360 non-redundant GO terms according to the Blast2GO program. All GO terms were allocated into three main GO categories (including three categories: biological process, cellular component and molecular function) and 48 sub-categories (Table 1 and Figure 3, left). Biological process made up the majority (1,578 GO terms, 2,315 unigenes) followed by molecular function (512 GO terms, 2,809) and cellular component (270 GO terms, 3,354). Among the 10 subcategories of cellular components, GO terms were predominantly associated with 'cell' (3,324, 31.51%), 'cell part' (2,999, 28.43%) and 'organelle' (2,036, 19.3%) (Figure 3, left). Furthermore, a comparative analysis of GO terms based on the transcriptome unigenes of T. multiceps and T. pisiformis is shown in Figure 3 [37]. The major parts of each category in GO classification between T. multiceps and T. pisiformis showed the same sub-categories and a similar percentage, but still revealed their qualitative and quantitative differences [34].
To enabled a further analysis of the comparative GO classification with other cestodes, we searched the ESTs from the two homologous species, E. granulosus and E. multilocularis (genus Echinococcus, family Taeniidae), on NCBI. The 9,701 ESTs of E. granulosus received 5,876 GO annotations, whereas

Common Genes Found Among Selected Zoonotic Intestinal Parasites
To find essential genes and understand the biology of intestinal helminths further, we compared four selected zoonotic intestinal parasites (Taenia multiceps, Ancylostoma caninum, Trichinella spiralis and Ascaris suum). The available transcript datasets of four species of zoonotic intestinal parasites were chosen to detect common genes that may be involved in adapting to intestinal parasitic life based on KEGG annotations. Of the 32,895 annotated nucleotide ESTs (40,66% of 80,905) of A. caninum, 11,645 annotated unigenes (37.26% of 31,282) of T. multiceps and 2,110 annotated CDSs (12.88% of 16,380) of T. spiralis, 145 common genes were found from the overlapping annotated part (identity $80%) of A. caninum, T. spiralis and T. multiceps ( Figure 5A). Together with 12,703 annotated contigs (19.26% of 65,952) of A. suum, 109 common genes were found from overlapping parts ( Figure 5B) of the four species of intestinal parasites, which could constitute essential genes or potential drug targets. The 109 common genes IDs with corresponding annotations among T. multiceps, T. spiralis, A. caninum and A. suum are shown in Table S3. Here, one (unigene 18109) of the common genes was identified using RT-PCR, with the accession number GU205474 in GenBank.

Conserved Genes between Taenia and Echinococcus Tapeworms
In order to obtain a more detailed understanding of Taeniidae cestode biology and reveal potential drug target genes, we compared five important cestodes by making use of currently available datasets. After comparing three cestodes of the Taenia genus (31,282 unigenes of T. multiceps, 72,957 unigenes of T. pisiformis and 30,700 ESTs of T. solium), we obtained 5,100 common genes ( Figure 5C). Of these, 3,000 were annotated by the KEGG database (common genes are shown in Table S4). When the 5,100 Taenia common genes were combined with the 1,058 ESTs of Eg+Em (conserved ESTs between E. granulosus and E. multilocularis) [30], 261 conserved genes were detected ( Figure 5D), and 204 obtained KO annotations (Table S5). Some of the common genes were involved in T. multiceps survival and parasite- host interactions, supporting the development of drug targets/ vaccines and phylogenetic relationship analyses between Echinococcus and Taenia tapeworms. In addition, we validated one promising drug target (unigene 1299) using RT-PCR to support the further analysis of its protein structure and function (GenBank accession number: GU205473).

Discussion
In this study, a reliable and substantial transcriptome dataset of the adult stage of T. multiceps was produced by Illumina sequencing and Trinity assembling. The percentage of both contigs ($500 bp) and unigenes ($500 bp) was higher than 60%, and the gap rate of all contigs and unigenes was as low as 0%. Unlike previous studies, our results showed that the mean length (920 bp) of T. multiceps unigenes was shorter than the contigs (974 bp), which might be due to the higher frequency of long sequences in contigs than in unigenes, whereas the number of contigs in each length extent was higher than that of unigenes. The reason for this is that one unigene has multiple transcripts, due to alternative splicing in eukaryotes. Of 31,282 unigenes, 26,110 (83.47%) could be annotated by seven public databases, whereas the remaining 16.53% unaligned unigenes are most likely contain Taenia-or cestode-specific genes [46]. Compared with our previous study on T. pisiformis [30], in which 35.23% of all distinct sequences assembled by de novo SOAP had an annotation against the Nr database, the higher percentage of 56.32% found in this study might be partially due to the higher percentage of long sequences distributed in our unigenes (mean length of 920 bp versus a mean length of 398 bp in T. pisiformis). This is in accordance with a previous report that showed longer contigs are more likely to obtain BLAST matches in protein databases [47]. Moreover, a total of 20,896 CDS for T. multiceps was predicted from this transcriptome dataset, with an approximate 200-fold coverage of the available proteins (for this stage/species) in the public databases. All of these results demonstrate the high quality and effectiveness of Illumina paired-end sequencing technology and the Trinity assembler.
Throughout the biological pathway sub-categories, 'signal transduction' (1,974 unigenes) of the environmental information processing (EIP) category was the most highly represented in the transcriptome of T. multiceps (6.3%). In this sub-category, the 'Wnt signaling pathway' was one of the most abundant pathways and was particularly interesting. A comparative analysis of Wnt signaling components in parasitic and free-living flatworms (including H. microstoma, E. multilocularis, E. granulosus, S. mansoni and Schmidtea mediterranea) has been conducted and a hypothesis of Wnt gene loss in flatworms has been proposed [48]. It was found that Wnt1 was expressed only in adults of H. microstoma, whereas Wnt2 was expressed only in larvae. Wnt1 was known to play a role as a segment polarity gene in adult worms [48]. Surprisingly, neither Wnt1 nor Wnt2 were found in this adult T. multiceps transcriptome annotation. This may have been due to the limitation of Wnt annotations in the currently available databases, or may reveal that the presence of Wnt1 and Wnt2 differ between species. Further research is necessary to validate the role of Wnt genes in the development of T. multiceps and their loss in cestodes.
After the transcript comparison of four species of zoonotic intestinal helminths, we narrowed the scope of the gene group to find genes that might be used for adapting to intestinal parasitic life. The 109 genes in the common part were still imperfect due to the limited available dataset. However, they did contain conserved genes, essential genes for parasite survival or genes related to parasite-host interactions. The majority of these common genes were a-tubulin and actin (see Table S3 for the corresponding unigene number), which were conserved structural proteins and were essential in parasites.
Interestingly, we found that phosphoenolpyruvate carboxykinase (PEPCK: unigene 17356) might be involved in parasite-host interaction. PEPCK is a key enzyme in malic acid disproportionation, which is the principle process for the continuous metabolism of phosphoenolpyruvate (PEP). PEP is generated from anaerobic glycolysis, in order that intestinal parasitic helminths can obtain ATP in anaerobic conditions [49]. Therefore, the lack of this important enzyme would interrupt parasite glycometabolism. As PEPCK has significant differences in the molecular properties between host and parasites, the parasite PEPCK molecular characteristics need further identification as a possible drug target [50]. The 3D structure of A. suum PEPCK was predicted by Verma et al. [50], but further analysis of PEPCK in intestinal helminths will be necessary to investigate whether a parasite-specific drug that is minimally toxic to the host can be found.
T. multiceps, T. solium, E. granulosus, E. multilocularis and T. pisiformis are five important Taeniidae cestodes, of which the former four species are zoonotic parasites that cause huge economic losses and threaten human health [12,29]. The currently available sequences for T. multiceps are poor and less than 10 candidate antigens exist as drug targets, including Tm18 ). In this study, we used relatively large datasets that contained 31,282 unigenes of T. multiceps and 72,597 unigenes of T. pisiformis obtained by highthroughput sequencing, and 30,700 ESTs of T. solium obtained by cDNA library construction, to bring to light the scope of potential drug targets/candidate vaccines. Conserved Taenia genes and essential genes for Taenia survival were contained by this scope of common genes. There might be other new Taenia-specific genes in the remaining 2,100 un-annotated common genes.
Together with the Eg+Em common ESTs, 261 conserved genes were obtained between Taenia and Echinococcus tapeworms. Among 204 annotated conserved genes, we also found PEPCK (unigene 17356). As the five Taeniidae cestodes are also intestinal parasites, the PEPCK might play the same role as discussed above and could be a promising drug target if it is minimally toxic to the host. Furthermore, PEPCK has been considered as a new marker of the phylogenetic relationship within Echinococcus and Taenia tapeworms [51], and this current finding of PEPCK in T. multiceps will help in the study of cestode phylogenetic relationships.
The 261 conserved genes were believed to contain necessary genes for the five Taeniidae cestodes that could help drug target/ vaccine finding. Unigene 1299 (see Table S5) of these conserved genes was particularly important and valuable. Unigene 1299 was annotated as FABP3. FABPs functions to enable these five cestodes to obtain long-chain fatty acids and cholesterol from the host to substitute for their lack of de novo synthesis of most lipids [52]. Two further reasons support the use of FABP as a promising candidate vaccine: 1) all five Taeniidae cestodes, which live in the intestinal tract, simulate the host intestinal mucosal immunity to secrete IgA; and 2) FABP of E. granulosus can induce the host to produce IgA, IgG1 and IgG2a [53]. As a result of the potential of this unigene, we amplified the full-length CDS of T. multiceps FABP3 using RT-PCR.
In this in-depth study, we obtained a broad transcriptome dataset of the adult stage of T. multiceps using Illumina paired-end sequencing technology and a Trinity de novo assembler without a reference genome. A total of 31,282 unigenes was produced with 26,110 sequences having annotations against seven public databases. We have demonstrated the feasibility and advantage of using a Trinity assembler. The common genes found among four zoonotic intestinal parasites (T. multiceps, T. spiralis, A. caninum and A. suum) and the comparative transcript analysis with T. pisiformis, T. solium, E. granulosus and E. multilocularis established a substantial platform for the better understanding of T. multiceps survival and development, further study of parasite-host interactions, and the development of future drug targets/vaccines and phylogenetic relationship analyses within Echinococcus and Taenia tapeworms.

Sample Preparation
The larvae (coenurus) (kindly provided by Yingdong Yang, Panzhihua, China) were collected from the brain of a naturally infected goat at an organic farm from Sichuan, China. The infection experiment was followed by larval morphological identification, and was performed by administering 20 larvae of T. multiceps into two parasite-free beagle dogs. Once the gravid proglottid expelled in the feces of infected dogs appeared on day 28 post-infestation, adult T. multiceps were immediately removed from the small intestine and washed thoroughly in physiological saline solution (37uC) to avoid host contamination; they were then transferred into liquid nitrogen and stored at 280uC until further use. All animals from which specimens were collected were handled in accordance with the animal protection law of the People's Republic of China (a draft of an animal protection law in China was released on September 18 th , 2009). This study was approved by the National Institute of Animal Health Animal Care and Use Committee at Sichuan Agricultural University (approval number 2010-018).

RNA Isolation and Illumina Sequencing
A paired-end transcriptome sequencing (RNA-Seq) [22] was employed. Total RNA was isolated from adult T. multiceps (n = 6) using Trizol (Invitrogen, Carlsbad, CA), following the manufacturer's instructions. The integrity of total RNA was verified using Agilent 2100 with the RNA integrity number (RIN). Polyadenylated (polyA) RNA was purified from 40 ml of total RNA using Sera-Mag oligo (dT) beads, fragmented into small pieces by fragmentation buffer, reverse-transcribed using random hexamers and reverse transcriptase, and then end-repaired with adapter primer attached and adaptor-ligated by the addition of a specific adapter, according to the manufacturer's protocol (Illumina). These ligated products were purified and amplified with PCR to create the final cDNA library [45]. The cDNA library was sequenced by Beijing Genomics Institute (BGI)-Shenzhen, Shenzhen, China, on a HiSeq TM 2000 (Illumina), according to manufacturer's instructions. The transcriptome raw reads dataset has been submitted to the NCBI Short Read Archive (http:// www.ncbi.nlm.nih.gov/Traces/sra_sub/sub.cgi) with the accession number: SRA048944.

Bioinformatic Analysis
A new Trinity de novo transcriptome assembler [26] was selected to assemble the sequence data from Illumina sequencing for T. multiceps. Reads that have a certain length overlap area were joined (the best method of joining was chosen by Trinity) into longer fragments, which are called contigs without gaps. TIGR Gene Indices clustering tools (TGICL) [54] were used to splice sequences and remove redundant sequences, and then unigenes without gaps could be obtained until they could not be further elongated. Reads per kb per million reads (RPKM) [55] were used to show the expression quantity, thus avoiding the influence of sequencing length and differences. The assembled unigenes (longer than 300 bp) are available from the Transcriptome Shotgun Assembly Sequence Database (TSA) at the NCBI with the following accession numbers: JR916739 -JR948020.
Unigene sequences were first aligned to the protein databases Non-redundant (Nr), UniProtKB/Swiss-Prot, UniProtKB/ TrEMBL, Cluster of Orthologous Groups (COG) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases by Blastx, using an e-value threshold of 1.0 25 . InterProScan (http:// www.ebi.ac.uk/InterProScan/) [56] and HMMER (http:// hmmer.janelia.org/) were used to obtain domain-based annotation by InterPro (http://www.ebi.ac.uk/interpro/) [57] and Pfam version 25.0 (March 2011, 12,273 families) (http://Pfam.sanger. ac.uk) [58] terms, as previously described (Shi et al. 2011). The unigenes were tentatively annotated according to the known sequences with the highest sequence similarity. The annotated unigenes direction and CDSs were identified by the best alignment results. ESTScan [59] was used to predict the coding sequences (CDS) and the sequence direction when unigenes were unaligned to any of the databases.

Common Genes Found between Taenia and Echinococcus
Based on the previous study of the T. pisifomis transcriptome, we choose 31,282 and 72,597 unigenes from T. multiceps and T. pisiformis [30], and 30,700 ESTs available from T. solium [29] to find the common genes among Taenia spp. When the common Taenia genes were combined with 1,058 ESTs from Eg+Em (see previous study [30]), we obtained the common genes shared between Taenia and Echinococcus. Finally, the common genes found among Taenia (T. multiceps, T. pisiformis and T. solium) and conserved genes between Taenia and Echinococcus (E. granulosus and E. multilocularis) were aligned to the KEGG database using the Blastx algorithm.

Validation of the Transcriptome CDS of T. multiceps
The full-length CDS of two potential unigenes were amplified by RT-PCR using cDNA of adult T. multiceps (primers and annealing temperature are shown in Table S6). Primers of the two unigenes were designed using Primer premier 5.0.