Transcriptome Analysis and Comparison of Marmota monax and Marmota himalayana

The Eastern woodchuck (Marmota monax) is a classical animal model for studying hepatitis B virus (HBV) infection and hepatocellular carcinoma (HCC) in humans. Recently, we found that Marmota himalayana, an Asian animal species closely related to Marmota monax, is susceptible to woodchuck hepatitis virus (WHV) infection and can be used as a new mammalian model for HBV infection. However, the lack of genomic sequence information of both Marmota models strongly limited their application breadth and depth. To address this major obstacle of the Marmota models, we utilized Illumina RNA-Seq technology to sequence the cDNA libraries of liver and spleen samples of two Marmota monax and four Marmota himalayana. In total, over 13 billion nucleotide bases were sequenced and approximately 1.5 billion clean reads were obtained. Following assembly, 106,496 consensus sequences of Marmota monax and 78,483 consensus sequences of Marmota himalayana were detected. For functional annotation, in total 73,603 Unigenes of Marmota monax and 78,483 Unigenes of Marmota himalayana were identified using different databases (NR, NT, Swiss-Prot, KEGG, COG, GO). The Unigenes were aligned by blastx to protein databases to decide the coding DNA sequences (CDS) and in total 41,247 CDS of Marmota monax and 34,033 CDS of Marmota himalayana were predicted. The single nucleotide polymorphisms (SNPs) and the simple sequence repeats (SSRs) were also analyzed for all Unigenes obtained. Moreover, a large-scale transcriptome comparison was performed and revealed a high similarity in transcriptome sequences between the two marmota species. Our study provides an extensive amount of novel sequence information for Marmota monax and Marmota himalayana. This information may serve as a valuable genomics resource for further molecular, developmental and comparative evolutionary studies, as well as for the identification and characterization of functional genes that are involved in WHV infection and HCC development in the woodchuck model.


Introduction
It is estimated that more than 248 million people are chronically infected with Hepatitis B virus (HBV) worldwide [1]. Every year, over 500,000 people die due to HBV-associated liver diseases, such as cirrhosis and hepatocellular carcinoma (HCC) [2,3]. Pegylated interferon-α (IFN-α) and various nucleos(t)ides are currently licensed for the treatment of chronic hepatitis B (CHB), but they rarely lead to cure [4]. Research on the development of new treatment options for CHB is urgently needed.
The progress in HBV research is strongly dependent on the availability of suitable animal models. Except humans, HBV can infect only apes, tree shrews and macaques [5]. However, due to various restraints encountered using these primate animal models for HBV infection, animal models based on a series of HBV-related hepadnaviruses were discovered in the past decades (ducks, geese, herons, woodchucks, squirrels and woolly monkeys) [6]. Among them, the Eastern woodchuck (Marmota monax, M. monax) represents a well established animal model for investigating HBV-related disease and served in preclinical efficacy studies to investigate therapeutic strategies for CHB [7][8][9]. M. monax is naturally infected with woodchuck hepatitis virus (WHV), which is genetically closely related to human HBV, and has a disease course similar to that in HBV-infected patients. Recently, we found that Marmota himalayana (M. himalayana), an Asian animal species closely related to M. monax, is susceptible to WHV infection and can be used as a new mammalian model for HBV infection [10]. It is proposed that M. himalayana and M. monax are derived from a common ancestor, and belong to the same subgenera Marmota [11,12]. Our previous genetic and immunogenetic analysis also indicated that the M. himalayana are closely related to M. monax, as the genes of the immune system of both species are nearly identical in their nucleotide sequences, as analyzed so far [13][14][15]. However, the transcriptional comparison analysis between the two species was restricted to only a few sequenced gene segments due to the lack of accessible genomic resources for both species. The lack of transcriptional sequence information has also hampered the application of the woodchuck model for functional genomics analysis and immunological mechanism studies in HBV and HCC.
In recent years, the use of deep-sequencing technologies has offered a powerful tool for a global and rapid transcriptome analysis via a high-throughput approach [16]. For example, RNA-Seq can generate millions of short cDNA reads from the transcriptome of an organism. The resulting reads are either aligned to a reference genome or reference transcripts, or assembled de novo (without the genomic sequence) to produce a genome-scale transcription map that consists of gene expression profiling, genomics function annotation, single-nucleotide polymorphism (SNP) and simple sequence repeats (SSR) marker identification, as well as protein coding sequence (CDS) prediction [17][18][19].
The first woodchuck transcriptome analysis was carried out by S.P.Fletcher et al using deepsequencing technology (the Roche 454 Genome Sequencer FLX). In the study, the authors briefly described the sequencing, assembly and annotation of the woodchuck transcriptome, and focused on the characterization of the transcriptional response to persistent WHV infection and WHV-induced HCC (S.P.Fletcher et al., Hepatology 2012) [20]. By taking the power of deep-sequencing technologies, the authors were able to establish the translational utility of the woodchuck model and provide new insights into various immune pathways which may play a role in HBV persistence or associated liver diseases. However, the data of basic woodchuck transcriptome analysis of this report, such as CDS information, were still inaccessible. Besides, to our knowledge, there was still no transcriptome information of M. himalayana available so far.
In this study, we report and focus on the sequencing, assembly, annotation, and comparison of the transcriptomes of M. monax and M. himalayana by using Illumina RNA-Seq technology. This transcriptomic resource may serve as a valuable foundation for further molecular, developmental and comparative evolutionary studies, as well as for the identification and characterization of functional genes that are involved in WHV infection and HCC development in the woodchuck model.

Materials and Methods
Animals M. monax were purchased from North Eastern Wildlife (Harrison, ID, USA) and M. himalayana were captured in the area of Tongren county (N35°31 0 25.42@, E102°01 0 33.42), Qinghai province, China. The capture of M. himalayana for this study was performed in 2009 and 2010, and no specific permissions for the capture activities were needed in Qinghai province at that time. Each animal was housed in a single cage with two separate chambers. One chamber was for food and drink, and the other one was for living with its floor covered by sterilized hay. The conditions of experimental facility were kept as follows, the room temperature: 25 ± 1°C, the air humidity: 60% ± 10%, the light cycle: 12:12 hours. The animals were euthanized by intracardiac injection of 1ml T-61. Abdomen incision was made to sufficiently expose the abdominal organs of the sacrificed animal, and the sections of spleen and liver were quickly collected and frozen in liquid nitrogen. This study was carried out in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health. The protocol was approved by the Committee on the Ethics of Animal Experiments of the University of Tongji Medical College, Huazhong University of Science & Technology, China (Permit Number: 491). All surgery was performed under sodium pentobarbital anesthesia, and all efforts were made to minimize suffering. The information on age, gender, body weight, duration in the laboratory environment and the date of euthanasia of used animals were provided in S1 Table. RNA Extraction, cDNA Library Synthesis and RNA Sequencing Total RNA was extracted from the liver and spleen samples of two M. monax and four M. himalayana (Table 1). RNA was extracted from frozen tissue samples using Trizol total RNA extraction reagent according to the manufacturer's instructions (TaKaRa, Dalian, China). Briefly, 60mg tissue sample was ground with liquid nitrogen into powder and transferred the powder sample into the 2 ml tube containing of 1ml Trizol reagent, and then the total RNA was extracted by protein depletion processing with Chloroform/isoamyl (24:1) alcohol and washing with isopropyl alcohol. At last, the dry RNA was dissolved in 25μL RNase-free water and measuring concentration with NanoDrop Spectrophotometer (Thermo Scientific Inc). RNA-seq library preparation and sequencing was carried out by Beijing Genomics Institute BGI (Hong-Kong, China). In brief, DNase I digestion was performed to remove any traces of genomic DAN and magnetic beads with oligo (dT) were used to isolate mRNA. Following purification, the mRNA was fragmented using divalent cations at elevated temperature. Taking these short fragments as templates, the first-strand cDNA was synthesized using random hexamer primers and Superscript TM III (Invitrogen™, Carlsbad, CA, USA). The second strand cDNA was synthesized using buffer, dNTPs, RNaseH and DNA polymerase I. Short fragments were purified with a QiaQuick PCR extraction kit (Qiagen) and resolved with EB buffer for end reparation and poly(A) addition. The short fragments were then connected using sequencing adapters. The suitable fragments were selected as templates for the PCR amplification. For the quality control steps, Agilent 2100 Bioanaylzer and ABI StepOnePlus Real-Time PCR System were used in quantification and qualification of the sample library. The libraries were sequenced using Illumina HiSeq™ 2000 (Illumina Inc., San Diego, CA, USA).

Filter RNA-Seq raw reads and sequence assembly
Raw reads generated from the RNA-seq were filtered by removing reads with adaptors, reads with unknown nucleotides larger than 5% and low-quality reads which the percentage of lowquality bases (base quality 10) was more than 20%. The filtered reads were assembled into transcripts using Trinity (version: release-20121005) [21], a reference genome-independent assembler which combines three independent software modules: Inchworm, Chrysalis, and Butterfly. In brief, Inchworm assembled the RNA-seq reads into the unique sequences of transcripts (contigs). Chrysalis clustered the Inchworm contigs and constructed complete de Bruijn graphs for each cluster. Each cluster represented the full transcriptional complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitioned the full read set among these disjoint graphs. Butterfly processed the individual graphs in parallel, including tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes. The ultimate result sequences of Trinity were called Unigenes. When multiple samples from the same species were sequenced, Unigenes from each sample's assembly were taken into the further process of sequence splicing and redundancy removing with sequence clustering software [22] to acquire non-redundant Unigenes as long as possible.

Annotation of Unigenes
Unigenes were used for BLAST searches and annotation against an NCBI Nr protein database [23] (NCBI non-redundant sequence database) using an E-value cut-off of 0.00001 (E-value 0.00001). Consensus sequences were further aligned by BLASTX to protein databases such as Swiss Institute of Bioinformatics databases (Swiss-Prot) [24], Kyoto Encyclopedia of Genes and Genomes (KEGG) [25] and Clusters of Orthologous Groups of proteins (COG) [26], retrieving proteins with the highest sequence similarity with the given sequences along with the functional annotations for their proteins. If results of different databases conflicted, a priority order of Nr, Swiss-Prot, KEGG and COG was followed. The Blast2GO program [27] (version: release 2012-10-01) was used to obtain Gene Ontology (GO) [28] annotations for the sequences, as well as for KEGG and COG analysis. The WEGO software [29] was then used to perform GO functional classification of all sequences to view the distribution of gene functions of the species at the macro level. The analysis mapped all of the annotated sequences to GO terms in the database and calculated the number of sequences associated with every term.

CDS prediction of Unigenes
Unigenes were firstly aligned to protein databases and then the sequences with the highest rank in blast results were selected to decide the CDS of Unigenes. For Unigenes that could not be aligned to any database were scanned by ESTScan [30] with software ESTScan (version: v3.0.2) producing nucleotide sequence in (5' − 3') direction and amino sequence of the predicted coding region. The numbers of CDS were summarized, and the length distribution of Unigenes blast and ESTScan CDS were added up.

Unigene SNP and SSR analysis
The SNPs were identified on the consensus sequence through the comparison with the Unigenes using software SOAPsnp. The SSR detection was done with software MicroSAtellite (MISA) using Unigenes as reference.

Similarity Analysis of GO Classification
A total number of M. himalayana Unigenes annotated by GO classification were aligned in consistency with M. monax in four range values, compartmentalized by genetic similarity.

Results and Discussion
Sample preparation and Illumina sequencing  Table 1). After cleaning and quality checks, six independent rounds of Illumina sequencing for each sample were performed and generated 1,663,766,16 raw reads in total. The raw reads were preprocessed to eliminate primer/adaptor contamination and low quality sections of reads because these sequences can substantially compromise de novo assembly. At last, in total 1,498,715,88 clean reads were generated, encompassing 13,488,422,920 total nucleotides (nt) ( Table 1). These data were comprised of high quality reads with the Q20 percentages greater than 96%, hardly any N percentages and nearly half GC percentages for each of the six libraries ( Table 1). The raw sequence data sets are available in the Sequence Read Archive (SRA) database (accession number: SRP074653 and SRP075170).
De novo assembly of sequence reads without a reference genome Transcriptome de novo assembly was carried out with short reads assembling program-Trinity [31], which combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Firstly, the RNA-seq data were assembled into the unique sequences of transcripts by Inchworm to generate Contigs for each of the six libraries. Table 2 shows the general information of Contigs, including total number, total length, mean length and N50 Contig length (50% of the total assembled sequence was contained in Contigs of this length or longer). Fig 1 shows  Moreover, gene family clustering divided the Unigenes into two classes: One class comprised the Clusters, for which the prefix CL followed by the cluster ID and the number of contigs in each cluster was given (S2 and S3 Tables). In any one cluster, there were several Unigenes for which similarity between the consensus sequences was more than 70%. The other class comprised the Singletons, for which the prefix Unigene was given. For M. monax, 20,897 Unigenes were grouped into 6,383 different clusters, and 85,599 Unigenes were Singletons. For M. himalayana, 15,125 Unigenes were grouped into 6,882 different clusters, and 63,358 Unigenes were Singletons (Table 2).

Annotation and classification of Marmota Unigenes
For annotation, the Unigenes were aligned by Blastx (e-value < 0.00001) with protein databases such as NR, Swiss-Prot, KEGG, COG and GO. The KEGG pathway database records networks of molecular interactions in cells, and variants of them, specific to particular organisms. Pathway-based analysis helped to understand further the biological functions of genes. Pathway information for all annotated sequences was obtained from KEGG pathway annotations. COG is a database where orthologous gene products are classified. The whole database is built on genes encoding proteins from species with complete genome sequences as well as the evolutionary relationships between bacteria, algae and eukaryotes. All Unigenes were aligned to the COG database to predict and classify possible functions. GO offers three ontologies of genes: molecular function, cellular component and biological process. The basic unit of GO is the GO-term which belongs to a type of ontology. Based on the NR annotation, the Blast2GO program was used to get the GO annotation of all Unigenes. WEGO software was then used for GO functional classification and to understand the distribution of gene functions of the species at a macro level. In each database, two criteria were used, the score and the e-value. Each gene was analyzed independently, and the annotation was made according to these criteria.  (Table 3). The identity distribution of top hits in the NR database indicated that 69.3% of the M. monax Unigenes and 71.3% of the M. himalayana Unigenes have sequence identity higher than 80% (Fig 3A). Among the annotated sequences, the species with the highest number of best hits was Saimiri boliviensis (8.7% matched genes of M. monax and 8.9% of M. himalayana), followed by Homo sapiens (7.9% of M. monax and M. himalayana) and Otolemur garnettii (6.2% of M. monax and 6.6% of M. himalayana) (Fig 3B).
To further differentiate the annotated sequences at the protein level, all Unigenes were aligned to COG database and were grouped into 25 categories based on their putative     To further identify active biochemical pathways and to know about biological complex behaviors in Marmota, the transcript sequences were mapped to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. A Total of 29,156 and 24,258 transcripts were mapped in M. monax and M. himalayana respectively, and 258 pathways were predicted for both species. The data from KEGG annotation demonstrated that the largest predicted category was metabolic pathway (11.5% for M. monax and 10.7% for M.himalayana), which contained 3,347 and 2,591 transcripts for M. monax and M. himalayana respectively, and the second largest category was focal adhesion (4.2% and 4.3%). Other pathways predicted in Marmota included the following: pathway in cancer (3.7% and 3.9%); regulation of action cytoskeleton (4.0% and 3.9%); amoebiasis (3.4% and 3.0%); endocytosis (2.9% and 2.8%); MAPK signal pathway (2.8% and 2.7%); and HTLV-I infection (3.3% and 2.9%), et al (S4 Table).
The results of annotation of transcriptome provide a comprehensive fundamental functional classification and pathway identification for the two Marmota breeds, and will be a valuable resource for future researches involving metabolic processes and pathway function in the Marmota animal models.

Protein Coding Region Prediction
Next, the Unigenes were aligned by blastx (evalue<0.00001) to protein databases to predict their coding DNA sequences (CDS) in the priority order of NR, Swiss-Prot, KEGG and COG. Proteins with highest ranks in blast results were taken to decide the CDS of Unigenes, which were then translated into amino acid sequences with the standard codon table. The orientation and CDS of Unigenes that had no hit in blast were predicted using ESTScan. In total, 41,247 CDS (S1 and S2 Files) of M. monax and 34,033 CDS (S3 and S4 Files) of M. himalayana were obtained by blastx and ESTScan (Table 4). Fig 6 shows the length distribution of CDS by blastx of both woodchuck species ranging from 200 bp to over 3,000 bp, and the most abundant CDS were 300 bp and the least abundant were 3000 bp. The most abundant CDS predicted by ESTscan were also 300 bp in length for both species (Fig 7). Detailed CDS information is available in S1, S2, S3 and S4 Files.

Unigene SSR Analysis
Simple Sequence Repeats (SSRs), also known as microsatellites, are repeating sequences of 1-6 base pairs of DNA with conservative flanking sequence,which can generate SSR markers used for genomic mapping, DNA fingerprinting, and marker-assisted selection in many species [32,33]. To identify the SSR for M. monax and M. himalayana, software MicroSAtellite (MISA) was utilized by using Unigenes as reference. In total, 14,392 SSRs for M. monax and 12,983 for  (Fig 8). These SSRs will be useful as molecular markers for assaying the functional diversity in natural populations or germplasm collections.

Heterozygous SNP Analysis
A single nucleotide polymorphism (SNP) is a variation in a single nucleotide that occurs at a specific position in the genome. SNP can be used to follow the inheritance patterns of chromosomal regions from generation to generation and are powerful tools in the study of genetic factors associated with diseases [34]. Here, the software SOAPsnp was utilized to identify the SNPs of all individual animals. In brief, the consensus sequences for the genome of each individual animal were assembled based on the alignment of the raw sequencing reads with the Unigenes of M. monax or M. himalayana. The SNPs were then identified on the consensus sequences by using the corresponding species Unigenes as references. Table 5 shows the numbers of different types of SNPs identified for all six animals used in this study. In all animals, the most frequent SNPs are A-G transition and the least are A-T transversion.

Transcriptome sequences homology analysis
Next, to characterize the transcriptome sequences homology between the two marmota species, we performed a large-scale sequence comparison of the Unigenes with identical annotation   (Fig 9A). GO functional classification was also performed to further analyze the homology distribution of the Unigenes in different functional categories. The Unigenes in category of biological process showed the highest similarity in sequences between the two Marmota species. In all 57 functional classifications, only the Unigenes of extracellular matrix part, nucleoid, receptor regulator activity and translation regulator activity showed relatively lower homology (Fig 9B). Thus, these results indicated that M. himalayana shares a high similarity in transcriptome sequences with M. monax.

Conclusions
Although both M. monax and M. himalayana have been proven to be useful animal models for studying HBV infection and HCC, the lack of genomic sequence information of them strongly limited their application breadth and depth. In this study, we successfully performed Illumina RNA-Seq technology to sequence the transcriptome of liver and spleen samples of M. monax and M. himalayana. Consensus sequences and Unigenes were generated by de novo assembly and functional annotation for both species. The Unigenes were aligned by blastx to protein databases to obtain the CDS information. The SNPs and SSRs were also analyzed for all Unigenes obtained. Moreover, a large-scale transcriptome comparison was performed and revealed a high similarity in transcriptome sequences between the two marmota species. This is the first time that the transcriptome of M. himalayana was fully sequenced and analyzed. Our study provides an extensive amount of novel sequence information for M. monax and M. himalayana. This information may serve as a valuable genomics resource for further molecular, developmental and comparative evolutionary studies, as well as for the identification and characterization of functional genes that are involved in WHV infection and HCC development in the woodchuck model.