De novo assembly and annotation of the retinal transcriptome for the Nile grass rat (Arvicanthis ansorgei)

Cone photoreceptors are required for color vision and high acuity vision, and they die in a variety of retinal degenerations, leading to irreversible vision loss and reduced quality of life. To date, there are no approved therapies that promote the health and survival of cones. The development of novel treatments targeting cones has been challenging and impeded, in part, by the limitations inherent in using common rodent model organisms, which are nocturnal and rod-dominant, to study cone biology. The African Nile grass rat (Arvicanthis ansorgei), a diurnal animal whose photoreceptor population is more than 30% cones, offers significant potential as a model organism for the study of cone development, biology, and degeneration. However, a significant limitation in using the A. ansorgei retina for molecular studies is that A. ansorgei does not have a sequenced genome or transcriptome. Here we present the first de novo assembled and functionally annotated transcriptome for A. ansorgei. We performed RNA sequencing for A. ansorgei whole retina to a depth of 321 million pairs of reads and assembled 400,584 Trinity transcripts. Transcriptome-wide analyses and annotations suggest that our data set confers nearly full length coverage for the majority of retinal transcripts. Our high quality annotated transcriptome is publicly available, and we hope it will facilitate wider usage of A. ansorgei as a model organism for molecular studies of cone biology and retinal degeneration.


Introduction
Rod and cone photoreceptors are the light sensitive cells of the retina that enable the detection of visual stimuli. Rods are responsible for vision under dim light conditions, whereas cones mediate color and high acuity vision. Cone photoreceptors degenerate in a variety of eye diseases, including age-related macular degeneration (AMD), cone-rod dystrophy, and retinitis pigmentosa (RP). Development of therapeutic strategies promoting the survival of cones in these pathological settings has been challenging, due in part to difficulties inherent in studying cones when using common rodent model organisms. Laboratory mice (Mus musculus) and rats (Rattus norvegicus) are nocturnal and have rod-dominant retinas, with cones comprising only~3% and~1% of M. musculus [1] and R. norvegicus [2] photoreceptors, respectively. Thus, these organisms are not ideally suited for studies of cones. With the goal of developing improved small animal models for the study of cones, efforts have been made to identify rodents that contain more cone-enriched retinas. Among the rodent species that have been identified as having cone-enriched retinas are the African Nile grass rat (Arvicanthis ansorgei) [3] and the 13-lined ground squirrel (Ictidomys tridecemlineatus) [4].
From the experimental perspective, A. ansorgei has the advantage over I. tridecemlineatus in that it can be more easily maintained in laboratory colonies. Until recently, I. tridecemlineatus could not be bred under laboratory conditions and thus had to be caught wild. Although a protocol has since been established for maintaining I. tridecemlineatus in laboratory colonies, there are unique challenges related to caring for animals that undergo months of torpor [5].
A. ansorgei has primarily been studied in the context of circadian rhythms [3]. As its retina is comprised of more than 30% cones, it is suitable as a mammalian model for the study of cone biology and pathology [6]. As one example, the N-methyl-N-nitrosourea (MNU) chemical induced retinal degeneration model has been established in A. ansorgei. Structural and functional studies demonstrate that MNU treatment causes a spatiotemporally reproducible photoreceptor degeneration in the A. ansorgei retina [7]. The pattern of degeneration is one in which rod cell death is followed by cone cell death, a pattern of degeneration that is also seen in human RP. Thus, A. ansorgei is well-suited as a model for molecular studies of cone function and degeneration and for the identification of cone specific genes, pathways, and mechanisms that promote homeostasis and survival. A significant hurdle for molecular studies in A. ansorgei, however, is that there is very limited genome or transcriptome data currently available for this organism. Due to insufficient species specific sequence information, the research community has had to characterize genes of interest one at a time or rely on data from M. musculus or R. rattus, which generally does not completely or accurately represent A. ansorgei. Especially in the context of cone photoreceptor studies, there are likely mechanisms in the diurnal A. ansorgei that would be missed when basing such studies on information gleaned entirely from nocturnal rod-dominant model organisms.
As of February 2017, there were less than 400 ESTs in the NCBI repository for the entire Arvicanthis genus, whereas there are 4.9 million ESTs for M. musculus and 1.1 million ESTs for R. norvegicus. The genus Arvicanthis, which has seven recognized species [8], has no genomic sequencing data, and only one RNA sequencing data set has been published, which was for the species A. niloticus [9]. Phylogenetic analysis based on both mitochondrial and nuclear genes has revealed that within the genus, there are two main clades, where A. niloticus is a member of one, and A. ansorgei is a member of the other [10]. The evolutionary event dividing the genus into these two sister monophyletic subgroups is estimated to have occurred more than 5 million years ago [11]. With respect to diversity at the level of the DNA sequence, analysis of the complete sequence of the highly conserved gene encoding cytochrome b has demonstrated that the average degree of sequence divergence between different species of the Arvicanthis genus is 15.5% [11].
To aid in the further development of A. ansorgei as a useful model for studies of cone development, function, and degeneration, we performed RNA-Seq on retinas from adult animals and de novo assembled and annotated the first transcriptome for this species. The assembled and annotated retinal transcriptome is publicly available and will hopefully serve as a resource for downstream molecular studies.

RNA preparation
All animals were maintained in compliance with the guidelines of the Animal Care and Use Committee of Institut des Neurosciences Cellulaires et Intégratives (Chronobiotron UMR 3415). The protocols used in this study were approved by the Comité Régional d'Ethique en Matière d'Expérimentation Animale de Strasbourg (CREMEAS, ethical license reference AL/ 24/31/02/13). A. ansorgei were housed in 22±2˚C rooms under a 12:12 hour light dark cycle, 100 lux white light with lights on at 7 am and lights off at 7 pm. Animals were fed with standard rat chow supplied ad libitum. Young adult (5-6 months) female Arvicanthis (n = 2) were used for this study. Euthanasia was performed by CO 2 inhalation, and all efforts were taken to minimize suffering. Whole retinas were rapidly isolated by cutting across the cornea with a clean scalpel blade followed by retinal extrusion. They were immediately flash frozen in liquid nitrogen and stored at -80˚C until ready for use. Retinas were independently homogenized in Buffer RLT Plus + 1% β-mercaptoethanol, and total RNA was extracted with genomic DNA removal using the RNeasy Plus Mini Kit according to manufacturer's instructions (Qiagen, Germantown, MD, USA). RNA samples were quantified by the RNA 6000 Nano Kit on the 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA).

RNA-Seq library preparation and sequencing
Two high quality RNA samples were used to prepare independent RNA-Seq libraries using previously described methods [12]. First strand cDNA synthesis was performed with 195 ng total RNA using anchored oligo-dT and SuperScript III First-Strand Synthesis SuperMix (ThermoFisher, Waltham, MA, USA). Second strand cDNA synthesis was peformed using RNase H, DNA Polymerase I, and Invitrogen Second Strand Buffer (ThermoFisher, Waltham, MA, USA). Double stranded cDNA was purified using DNA Clean & Concentrator-5 (Zymo Research, Irvine, CA, USA). Tagmentation was performed using the Nextera DNA Library Preparation Kit (Illumina, San Diego, CA, USA). Tagmented DNA was purified using DNA Clean & Concentrator-5 before Nextera PCR amplification. Libraries were cleaned using Agencourt AMPure XP beads according to manufacturer's instructions (Beckman Coulter, Brea, CA, USA). Libraries were evaluated by the High Sensitivity DNA Kit on the 2100 Bioanalyzer. The average size of the library fragments were 705 bp and 561 bp for samples S1 and S2, respectively. They were then sequenced with 93 bp paired ends on an Illumina HiSeq 2000 in high output mode with V3 chemistry.

De novo transcriptome assembly and quantification
FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to assess the quality of the sequencing data. Trimmomatic was used to trim adapters and leading or trailing bases with quality score less than 30, and resultant reads less than 25 bp in length were dropped [13]. Trimmomatic was invoked using the command java-jar trimmomatic-0. 32 . Cleaned paired reads were combined across both samples and passed to Trinity for de novo transcriptome assembly with in silico normalization [15]. Trinity was invoked using the command Trinity-seqType fq-max_memory 480G -CPU 48-normalize_reads -left left.fastq-right right.fastq-output out_dir-grid_conf trinity_conf.txt. RSEM, with Bowtie alignment, was used to quantify Trinity assembled transcript abundance in each sample [16]. The Trinity utility was invoked using the command align_and_estimate_abundance. pl-transcripts Trinity.fasta-seqType fq-left paired_R1.fastq.gz-right paired_R2.fastq.gz-est_method RSEM-aln_method bowtie-trinity_mode.

De novo transcriptome functional annotation
TransDecoder was used to search the Trinity assembled transcripts for open reading frames (ORFs) encoding peptides of at least 100 amino acids in length [15]. Trinity transcripts of any length with ORFs homologous to known proteins or containing protein domains were identified by BlastP (v2.2.30) [17] queries against the Swiss-Prot database [18] and HMMER3 [19] queries against the Pfam database [20], respectively. The final TransDecoder-predicted coding regions include those meeting the minimum length criteria and those of any length with BlastP or Pfam homology. Trinotate was then used for functional annotation [15]. The TransDecoder-predicted coding regions were searched for Pfam protein domains using HMMER3, signal peptides using SignalP 4.1 [21], transmembrane regions using TMHMM [22], rRNAs using RNAMMER [23], homology to known proteins using BlastP (v2.2.30) (E<10 −5 ) against both the Swiss-Prot and the UniRef90 [24] databases, and annotations from gene ontology (GO) [25] and EggNOG [26]. The Trinity transcripts were also searched for homology by BlastX (v2.2.30) (E<10 −5 ) against the Swiss-Prot and UniRef90 database. All annotations were aggregated in a final report.

De novo transcriptome evaluation
Trinity transcripts were searched by BlastN and BlastX against the NCBI RefSeq mRNA and protein databases, respectively, for both M. musculus and R. norvegicus. The percent coverage along the target transcript or protein was determined using the Trinity provided utility invoked with the command analyze_blastPlus_topHit_coverage.pl blast_result.outfmt6 Trinity. fasta blast_db [15]. For each Blast hit in the target database, the best matching Trinity transcript was selected, and the percent of the Blast hit's length covered by the Trinity transcript was determined.
Trinity transcripts were filtered for those with BlastX hits against the Swiss-Prot database and ranked by TPM. Gene Set Enrichment Analysis (GSEA) was used to identify Gene Ontology (GO) annotations enriched among these Trinity transcripts with mean TPM>1 [27]. Cytoscape [28] and Enrichment Map [28] were used to visualize the results.

Phylogenetic analysis
Multiple sequence alignment of the coding sequence (CDS) of representative genes for A. niloticus and other model organisms was performed with MUSCLE using the UPGMB clustering method [29]. A Neighbor-Joining [30] phylogenetic reconstruction was created using the Maximum Composition Likelihood model [31] to compute evolutionary distances. These analyses were performed in MEGA6 (v6.06) [32].

Availability of data and materials
The datasets supporting the conclusions of this article are available in the NCBI Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra/), accession numbers SRR5190211 and SRR5190212, and within the article and its Supporting Information files.

Results
Two RNA samples, S1 and S2, with RNA Integrity Numbers (RINs) of 9.8 and 9.7 (S1  Table 1). The Trinity transcripts had a mean length of 801 nt and a N50 length of 1,457 nt, meaning 50% of assembled bases are part of Trinity transcripts of length  Functional annotation for the Trinity transcripts was performed using the Trinotate annotation pipeline. TransDecoder was used to predict coding regions, which were then searched for Pfam protein domains, signal peptides, transmembrane regions, rRNAs, homology to known proteins in both the Swiss-Prot and the UniRef90 databases, and gene ontology (GO) annotations. The Trinity transcripts were also searched for BlastX homology against the Swiss-Prot and UniRef90 database. Alignment of the RNA-Seq reads to the de novo assembled transcriptome was evaluated. Expression levels of Trinity assembled genes and transcripts were estimated in units of TPM (transcripts per million) using RNA-Seq by Expectation Maximization (RSEM) with Bowtie as the alignment algorithm. The majority of reads aligned to the transcriptome and in proper pairs ( Table 2). The sequences, full annotations, and expression levels for the Trinity assembled de novo transcripts are available as Supporting Information files (S1 File and S2 File).
We next examined the length and level of expression of the identified transcripts. Of the 400,584 assembled Trinity transcripts, 78,915 (19.7%) were greater than 1kb in length, and of the subset of 63,242 Trinity transcripts with Swiss-Prot BlastX homology, 46,038 (72.8%) were greater than 1kb in length (Fig 1A). The subset with Swiss-Prot BlastX homology was also more highly expressed, with 35.2% of annotated transcripts being expressed at a level greater than 1 TPM, as compared to only 10.0% of all Trinity transcripts being expressed at this level ( Fig 1B). The putative coding transcripts were therefore more likely to be both higher in abundance and longer in length than their non-coding counterparts. Pairwise analysis for the 8,866 orthologous genes expressed at a level greater than 1 TPM (A. ansorgei) or 1 FPKM (A. niloticus) [9] showed that global gene expression levels were moderately correlated (Pearson correlation coefficient = 0.63) between these two members of the Arvicanthis genus (S4 Fig).  Phylogenetic analysis was performed using the Trinity assembled CDS for four retinal genes in order to place A. ansorgei in the context of other common model organisms (Fig 2). The topologies of the phylogenetic trees are broadly comparable and place A. ansorgei in closest proximity to R. norvegicus and M. musculus. Consequently, R. norvegicus and M. musculus were used as the references for estimating the completeness of the A. ansorgei de novo assembled transcriptome. To perform this analysis, Blast was used to query each Trinity transcript against the RefSeq databases of mRNAs and proteins for M. musculus and R. norvegicus, and for each hit in the target database, the length of the hit covered by the best matching Trinity transcript was determined (Fig 3).
There are 8,551 M. musculus and 8,486 R. norvegicus RefSeq mRNAs that have Trinity transcripts which align with at least 80% coverage, and there are 14,397 M. musculus and 13,095 R. norvegicus RefSeq proteins that have Trinity transcripts which translate to cover at least 80% of their length. We chose 33 canonical retinal cell specific markers and identified their corresponding Trinity transcripts. The majority show near full length coverage and sequence identity for the coding sequences of the proteins against which they demonstrate BlastX homology (Table 3). Proteins expected to be highly conserved, for example 40S ribosomal proteins, 60S ribosomal proteins, beta actin, and cytochrome c, all demonstrate 100% identity and 100% full length coverage.
BiNGO was used to identify the GO annotations within the GOSlim subgroup enriched among the Trinity transcripts with Swiss-Prot BlastX homology expressed at greater than 1 TPM using R. norvegicus as the reference (Fig 4A). The number of genes corresponding to GOSlim annotations with greater than 2% coverage was assessed. Although the absolute number of genes is lower for A. ansorgei than for either the M. musculus reference or the R. norvegicus reference, the relative rank order for the GO annotation coverage is similar between A. ansorgei and both references (Fig 4B). The top GO annotations are broadly distributed amongst the molecular function, cellular component, and biological process subgroupings.

Discussion
We performed RNA-Seq on adult retinas from the Nile grass rat A. ansorgei and used Trinity to de novo assemble and functionally annotate the first high quality draft transcriptome of this species. The assembly had an N50 length of 1,457 nt and included 400,584 transcripts, of which 46,038 were greater than 1kb in length and demonstrated Swiss-Prot BlastX homology.  Taken together, the findings suggest that our draft transcriptome is high-quality with respect to diversity, contiguity, and coverage. Global scale species specific sequence information was previously non-existent for A. ansorgei, limiting the capacity for molecular based studies. The A. ansorgei retinal transcriptome has now been made publicly available. Our hope is that it may serve the broader research community and provide a foundation for the use of A. ansorgei as a model organism for future cellular and molecular investigations related to cone biology and retinal degeneration and for comparison to other common model organisms, including M. musculus and R. norvegicus.