Novel Positive-Sense, Single-Stranded RNA (+ssRNA) Virus with Di-Cistronic Genome from Intestinal Content of Freshwater Carp (Cyprinus carpio)

A novel positive-sense, single-stranded RNA (+ssRNA) virus (Halastavi árva RNA virus, HalV; JN000306) with di-cistronic genome organization was serendipitously identified in intestinal contents of freshwater carps (Cyprinus carpio) fished by line-fishing from fishpond “Lőrinte halastó” located in Veszprém County, Hungary. The complete nucleotide (nt) sequence of the genomic RNA is 9565 nt in length and contains two long - non-in-frame - open reading frames (ORFs), which are separated by an intergenic region. The ORF1 (replicase) is preceded by an untranslated sequence of 827 nt, while an untranslated region of 139 nt follows the ORF2 (capsid proteins). The deduced amino acid (aa) sequences of the ORFs showed only low (less than 32%) and partial similarity to the non-structural (2C-like helicase, 3C-like cystein protease and 3D-like RNA dependent RNA polymerase) and structural proteins (VP2/VP4/VP3) of virus families in Picornavirales especially to members of the viruses with dicistronic genome. Halastavi árva RNA virus is present in intestinal contents of omnivorous freshwater carps but the origin and the host species of this virus remains unknown. The unique viral sequence and the actual position indicate that Halastavi árva RNA virus seems to be the first member of a new di-cistronic ssRNA virus. Further studies are required to investigate the specific host species (and spectrum), ecology and role of Halastavi árva RNA virus in the nature.


Introduction
Picorna-like viruses are a loosely defined group of nonenveloped, positive-sense single-stranded RNA (+ssRNA) viruses that are major pathogens of humans, animals, insects and plants [1]. These viruses have similar genome features and several conserved protein domains. In 2009, a new order was detached within picorna-like viruses. The order Picornavirales contains several important viruses grouped into 5 families: Dicistroviridae, Iflaviridae, Marnaviridae, Picornaviridae and Secoviridae [1,2].
Based upon the genome organization non-enveloped, positivesense single-stranded RNA (+ssRNA) viruses have two groups: viruses with mono-or di-cistronic genome characteristics. Iflaviruses, marnaviruses, picornaviruses and secoviruses have monocistronic genome structure with a large open reading frame (ORF) coding a single polyprotein. However, dicistroviruses -the name (dicistro) is derived from the characteristic di-cistronic arrangements of the genome -have two non-overlapping large open reading frames (ORF1 and ORF2). In dicistroviruses, the structural proteins are located at the 39 end (ORF2) rather than the 59 end as found in iflaviruses, marnaviruses, picornaviruses and secoviruses. The most outstanding characteristics of dicistroviruses is that they have two internal ribosome entry site (IRES) elements, one for translation of the ORF1 replicase (including 2C-like helicase (Hel), VPg, 3C-like protease (Pro) and 3D-like RNA dependent RNA-polymerase (Pol) in order Hel-(VPg) x -Pro-Pol module of conserved replication domains) and the other -the IGR (intergenic region) -IRES -for translation of the capsid proteins (VP2-VP4-VP3 and VP1) [3]. Recently, it is proposed that dicistrovirus Plautia stali intestine virus (PSIV) ORF1 polyprotein precursor contains all proteins (2A-C, 3A-D) and in a same order as in picornaviruses [4].
The first report of metagenomic analyis of RNA viruses in fresh water lake was published in 2009 [16]. This study reports a high number and high diversity of RNA viruses in a freshwater pond. Here we report the serendipitous identification and characterization of a novel possibly non-enveloped, positive-sense singlestranded RNA (+ssRNA) virus with di-cistronic genome structure originating from unknown host(s) from intestinal content of freshwater carp (Cyprinus carpio) in Hungary.

Identification and general features of a novel viral RNA genome sequence in intestinal content of carp
Using non-human entero-R/F screening primers (Table 1) an approximately 1000-nt-long single PCR-product was seen from a carp intestinal content in agarose gel (Fig. 1C). Homologue nucleotide sequence was not found in GenBank. However, significant amino acid sequence identity and characteristic amino acid sequence motifs (YGDD and FLKR) of RNA dependent RNApolymerase gene (replicase) of several members of picorna-like superfamily were found including dicistroviruses, picornaviruses and secoviruses with top matches to marine JP-B virus (EF198242; E value = 4610 214 , identities = 64/179; 36%) using BLASTp searches of the NCBI database. The novel genome sequence was provisionally named as Halastavi árva (in Hungarian = ''fishpond orphan'' in English) RNA virus (abbreviation: HalV).
No significant sequence match was found in GenBank based upon the complete nucleotide sequence. The Halastavi árva RNA virus genome is 9565 nt in length (Fig. 1), excluding the polyadenylate -poly(A) -tail, with 827 nt 59untranslated region (UTR) followed by 2 predicted non-overlapping ORFs of 5451 nt (ORF1, nt position 828 to 6278) and 3030 nt (ORF2, nt position 6397 to 9426) separated by an intergenic region (IGR) of 118 nt (Fig. 1A). ORF2 is followed by a 39UTR of 139 nt (nt position 9427 to 9565). Two large ORFs were found which encode potential polyprotein precursors of 1816 aa (ORF1) and 1009 aa (ORF2). No large ORFs were found in the inverse orientation suggesting that Halastavi árva RNA virus is a positive-strand RNA virus with dicistronic genome organisation.

Analysis of structural region (ORF2)
ORF2 is in a different frame to ORF1 (Fig. 1A). It possesses an AUG codon at nt 6397 and extends to nt 9426. ORF2 encodes a 1009 aa protein (Fig. 1B). This ORF2 protein length is between the shortest (765 aa, Kashmir bee virus, AAR19088) and the longest (1138 aa, Aurantiochytrium single-stranded RNA virus, YP_398835) dicistrovirus ORF2. Recognizable putative picornavirus-like capsid protein domains were identified by the conserved domain search of the NCBI database (Table 2). Multiple alignments are show conserved amino acids (underlined: GRLI, LRIPF, FGFSSP, DEM and YWAGSI, VATPFHAGRLVLAYVP, VWD, GE, DDFSF) on the Halastavi árva RNA virus with the capsid proteins (VP2, VP4 and VP3) of viruses in family Dicistroand Picornaviridae. In addition, an assumed spherical virus-type peptidase (Conserved domains database, pfam12264) with high E value (0.13) was also identified at the N-terminal end (from 75-250 aa in VP2) of the putative ORF2 capsid-polyprotein (Table 2). There was no sequence hits on the predicted VP1 region at the 39 end of the ORF2. The sequences that are have recognizable similarity to complete ORF2 VP2-VP4-VP3 of Halastavi árva RNA virus were the structural protein regions of Cricket paralysis virus (NP_647482; E value = 8610 232 , identities = 166/664; 25%) and Drosophila C virus (NP_044946; E value = 8610 231 , identities = 161/688; 24%). The amino acid sequence identity of Halastavi árva RNA virus ORF2 to marine RNA virus JP-A is 28% (E value = 2610 222 , identities = 94/331). Detailed amino acid similarity analysis of ORF2 regions (VP2-VP4; VP4-VP3; VP3-VP1) are shown in Table 2. Cleavage sites of the ORF2 encoded polypeptide were not experimentally determined but estimated by alignment of deduced amino acid sequences. The putative protease-mediated cleavage sites of ORF2 polyprotein are RAFGF/SSPPD (highly conserved motif among dicistroviruses at aa position 339/340) and ESMQ/DPY (at aa position 665/666) between capsid protein VP4/VP3 and VP3/ VP1 in Halastavi árva RNA virus, respectively. The predicted amino acid length of VP3 and VP1 are 326 aa and 344 aa. VP2+VP4 are 339 aa long. The cleavage site between VP2 and VP4 is probably between aa260 or aa280, maybe at position aa262/aa263 (Q/T). Based upon this calculation the predicted VP2 and VP4 minor capsid proteins are 262 and 77 amino acid-long, respectively.
Analysis of the non-coding region: untranslated regions (UTRs) and intergenic region (IGR) ORFs 1 and 2 represent 88.7% of the RNA genome. The other 11.3% consists of non-coding regions (UTRs and IGR). The 59UTR is 827 nts in length and it predicted an extensive  not shown). Similar non-coding sequences had not been found in GenBank. Figure 2 and 3 shows the phylogenetic analysis of the amino acid sequence of the complete ORF1 (Fig. 2) and complete ORF2 (Fig. 3) regions of Halastavi árva RNA virus and representative members of the highly diverse Picornavirales order. Halastavi árva RNA virus showed no close relationship to any existing viruses, an observation consistent with the lack of close resemblance of its genome architecture to that of other known RNA viruses.

Nucleotide and dinucleotide composition
The base composition of Halastavi árva RNA virus is 26.6% A, 27.4% C, 20.3% G, and 25.7%U; this results in a G+C of 47.7%. In common with other RNA viruses (and host genomes), Halastavi árva RNA virus shows a variety of under-and over-representations of dinucleotide frequencies compared to those expected from its mononucleotide base composition (Fig. 4A). For example, frequencies of the UpA dinucleotide were only 57% of the expected value, clustering closely with other picorna-like RNA viruses. However, in contrast to most picorna-like viruses infecting mammals and plants, Halastavi árva RNA virus showed no under-representation of the CpG dinucleotide with an observed/expected ratio of greater than 1. In this respect, its genome composition was more similar to those of insect and fish viruses (blue and yellow data points).
Nucleotide composition analysis (NCA) was used to predict the possible host for Halastavi árva RNA virus. In contrast to previous use of this method [17], an additional host group (fish) was included as a further control set in the analysis. Host predictions of the control sequences by showed 95% concordance with assignments based on sequence annotations, including the reliable differentiation of fish-derived sequences from insect viruses (that also lack the CpG deficiency observed in animal and plant viruses; Table S1). Using this method, the host assignment of Halastavi árva RNA virus was fish. Figure 4B shows a series of projections of the first three canonical factors (CFs) calculated by discriminant analysis. While differentiation of fish viruses from insect-derived viruses was limited using CF1 (which was primarily based of CpG; data not shown), CF3 showed strong differentiation of these two hosts, creating two distinct clusters (blue and yellow ellipses); Halastavi árva RNA virus fell into the fish-derived cluster consistent with its assignment by discriminant analysis.

Detection of Halastavi árva RNA virus in further carp sample
Using specific primers designed for Halastavi árva RNA virus 2C-like helicase/3A? region (between nt positions 3265 and 3714) a specific 450 nt long PCR-product was detected for Halastavi árva RNA virus by RT-PCR and sequencing in the intestinal content of a second carp, too, fished from the same fishpond at the same time. The two nucleotide sequences were identical.

Discussion
A novel -possibly non-enveloped -positive-sense singlestranded RNA (+ssRNA) virus (Halastavi árva RNA virus; HalV) with di-cistronic genome organization was identified and characterized originated from the intestinal content of freshwater carp (Cyprinus carpio) in Hungary. Based upon the results of the sequence-and phylogenetic analysis this virus belongs to the order Picornavirales. Halastavi árva RNA virus has no nucleotide sequence match in GenBank based upon the complete nucleotide sequence. Halastavi árva RNA virus has only low and partial amino acid sequence identity to known virus families in Picornavirales. In addition, the putative nonstuctural protein domains and capsid proteins show a mosaic pattern of relationships with sequences from viruses in this order especially in family Dicistroviridae. However, Halastavi árva RNA virus is different from the presently known virus families according to the possible host organism; the nucleotide/amino acid sequence identity; the possible unique structures of UTR/IGR and the AU/GC ratios compared to other viruses in Picornavirales. These data suggests that Halastavi árva RNA virus seems to be the first member of a possible new -previously unknown -family of ssRNA viruses.
Inferring the host of genetically distinct viruses is problematic especially if they are found in feces. Feces are known to contain viruses that infect host cells and/or bacteriophages, as well as viruses of dietary origins from consumed plants, insects, and animals [17]. As freshwater carp (Cyprinus carpio) are omnivorous, host assignment of Halastavi árva RNA virus recovered from intestinal contents of these fishes is particularly problematic. As a specific example, we have previously shown that viruses recovered from faeces of a child from Afghanistan which showed an unexpected genome organizations similar to insect dicistroviruses, could be predicted to have an insect host by NCA, in this case also potentially deriving from a dietary source [17]. NCA was used in the current study in order to predict host origins for Halastavi árva RNA virus by including a further group, fish (primarily containing available sequences from nodaviruses) into the analysis. The resolution of this method was clearly limited by the low number of available viral sequences from fish sufficiently long for robust composition analysis and the restricted diversity of the sequence used. Consequently, the conclusion for a fish host origin (in contrast to insect or potentially other arthropod or nematode origins) has to be regarded as provisional pending the availability of further fish-derived sequences that can be used by NCA as controls. The future incorporation of several picornaviruses recently identified in fish (Knowles et al., personal communication) will be of value in this regard. Therefore, there was no evidence of a close relationship with viruses infecting humans or other mammals among the order Picornavirales. Further specific studies are required to investigate the specific host species (and spectrum), ecology and role of Halastavi árva RNA virus in the nature.
The AU ratios of the three diatom viruses with ssRNA genome range 60.4% to 63.7%. In insect dicistroviruses AU ratios range 60 to 64%, while that of HaRNAV and SssRNAV are much lower; 53.1% and 50.2%, respectively. Interestingly, the base composition of Halastavi árva RNA virus is more similar (52.3%) to this last group. On the other hand, Halastavi árva RNA virus genome structure appears to have a polycistronic genome organization similar to that found in viruses in family Dicistroviridae. Several of these viruses contain internal ribosome entry site (IRES) that position the ribosome on the genome, actuating translation initiation even in the absence of known canonical initiation factors [3]. The exact secondary structures (and functional parts) of these untranslated regions (59UTR, 39UTR and a rather short IGR) of Halastavi árva RNA virus remained unknown. Insect origin conserved dicistrovirus IGR-IRES-like structural elements, previously reviewed by Nakashima and Uchiumi [3], were not detected. This observation implies that there is probably other IRES-like RNA structure in Halastavi árva RNA virus.
During the conserved domain search, a putative spherical virustype peptidase was identified at the N-terminal end of the ORF2 polypeptide. This type of peptidase is responsible for the cleavage of the viral polyprotein into individual proteins in Rice tungro spherical virus [18], although this enzyme coded in the nonstructural region. This finding brings up the possibility of the putative dual function of the VP2 (structural protein with enzymatic function). Based upon the current knowledge, there is no information regarding to the presence of a peptidase in the capsid proteins among picornaviruses. However, the peptidase function could be important during the cleavage of the capsidpolyprotein. The presence of the 59UTR-IRES and IGR-IRES in the viruses with di-cistronic genome organization could suggest the distinct initiation time of the ORF1 and ORF2 transcriptional processes. In this case the peptidase function of the N-terminal end of the capsid-polyprotein is useful for the capsid protein cleavage. To support this hypothesis we found the presence of the same putative spherical virus-type peptidase in the N-terminal end of the Himetobi P dicistrovirus capsid polypeptide (BAD27585.1, 88-248 aa) by conservative domain search (E value = 10 25 ). The verification of the presence of this putative peptidase at the Nterminal capsid polyprotein in viruses with di-cistronic genome organization warrants future investigations.
Unexpectedly high number of known and novel viruses particularly +ssRNA viruses were identified in marine and recently in fresh water samples by metagenomic methods [16]. In addition, there are likely to be many, yet undiscovered viruses with dicistronic genome structure. This is the first report of identification and complete genetic characterization of a virus with di-cistronic genome organization from freshwater communities, possibly a fish, which if confirmed will represent the first report of this family of virus infecting a vertebrate species.

Sample collection
Intestinal contents were freshly collected from the intestine of two freshwater carps (Cyprinus carpio) during fish processing. No further samples are available from fishes. Carps were fished by line-fishing from fishpond ''Lőrinte halastó'' located in Veszprém County, Hungary, in Apr 18, 2010. The specimen was stored at 220uC until RNA isolation.  Carlsbad, CA). Generic Non-HumanEntero-59UTR PCR primers (Table 1) were used for the detection of any picornaviruses by RT-PCR method. The generic primers were designed for a conservative nucleotide sequence of the 59UTR of the known non-human, non-simian enterovirus reference strains obtained from GenBank. All reagents were purchased from Promega (Madison, WI) unless otherwise specified. The cDNA synthesis was carried out in 50 ml final volume containing 5 ml of RNA extract; 10 mM dNTP, 5 ml 106 PCR buffer (Sigma, St Louis, MI), 1 ml 25 mM MgCl 2 solution, 10 pmol of the generic antisense 59UTR primer, 50 U M-MLV Reverse Transcriptase. The reverse transcription was performed at 40uC for one hour. The PCR reaction was conducted in 100 ml final volume using the entire volume of the RT reaction mixture. The PCR reaction mix was contained 10 pmol of the generic sense 59UTR primer and 2.5 U of DuplA-Taq DNA polymerase (Zenon-Bio, Hungary). The PCR reaction was conducted under the following conditions: 1 cycle at 94uC for 1 min, 40 cycles of 94uC for 30 sec, 57uC for 30 s, 72uC for 1 min, followed by a final elongation step of 72uC for 5 min.

RNA isolation, RT-PCR
To determine the complete genome of a positive-sense single stranded RNA virus a series of 59 and 39 RACE reactions were conducted using the 39/59 RACE system (Roche Diagnostics, Mannheim, Germany). The 39 RACE reaction in brief, cDNA was synthesized using an Oligo dT-anchor primer (Table 1) from total RNA. The Oligo dT-anchor tailed cDNA then amplified with gene-specific sense primers (s1, s2) and a PCR anchor primer ( Table 1). The 59RACE reaction in brief, the cDNA was generated in 20 ml final volume using oligonucleotide primers (as1a, as2a, as3a, as4a) for the four reactions from total RNA. The RNA template was degraded with RNaseH, and the cDNA was purified. The 39end of the cDNA was polyadenilated using terminal deoxynucleotidyl transferase and dATP. The polyA tailed cDNA then amplified in 50 ml final volume using Pfu DNA polymerase, Oligo dT-anchor primer and primers as1b, as2b, as3b and as4b, respectively. The PCR of the 39 and 59 RACE experiments was conducted using the following temperature conditions: 1 cycle at 94uC for 30 sec, 35 cycles of 94uC for 35 sec, 50uC for 1 min, 72uC for 5 min, followed by a final elongation step of 72uC for 10 min. The amplicons were subjected to second PCR reaction using upstream antisense primers (as1c, as2c, as3c, as4c), the PCR anchor primer and the same PCR thermal program used in the first PCR round ( Table 1). The amplification products were separated on a 1.0% agarose gel stained with ethidium bromide.

Sequence-and phylogenetic analysis
Samples with any visible amplicons were sequenced directly with the BigDye Terminator Cycle Sequencing Ready Reaction Kit (Applied Biosystems, Warrington, UK) using the PCR primers by primer walking methods and run on an automated sequencer (ABI PRISM 310 Genetic Analyzer; Applied Biosystems, Stafford, USA). Amino acid sequences -reference strains were collected from GenBank database -were aligned by ClustalX (version 1.81) and similarity analysis were performed using GeneDoc 2.7 software [19]. Phylogenetic analysis based on amino acid alignments was conducted using the minimum evolution method of MEGA software (version 4) with poisson model [20]. The secondary structures of untranslated regions (UTR; 59UTR, IGR and 39UTR) were predicted using the Mfold program [21]. Percental amino acid identity of the Halastavi árva RNA virus protein regions were compared to the GenBank sequences using blastp with the following algorithm parameters: expect threshold: 10; matrix: BLOSSUM62; gap costs: existence: 11, extension: 1. The borders of the polypeptide sections were chosen by the predicted cleavage sites of the recognizable viral proteins presented in Figure 1. Complete genome and amino acid sequences of novel RNA virus (Halastavi árva RNA virus; HalV) were submitted to GenBank under accession number: JN000306.

Nucleotide composition analysis (NCA)
A set of 352 virus complete genome or segment sequences longer than 3000 bases selected to be representative of different species, genera and families of positive-stranded RNA viruses classified in the picorna-like viruses were used for NCA [17] (sequences listed in reference [17]). Each was annotated by order, family and genus, along with host range. Mononucleotide and dinucleotide frequencies for each sequence were determined using the program ''Composition Scan'' in SSE version 1.0 (Simmonds, manuscript in preparation). Dinucleotide biases were determined as the ratio between the observed frequencies of each of the 16 dinucleotides from the expected frequencies determined by multiplying the frequencies of each of the two constituent mononucleotides.
NCA used the discriminant analysis program in the statistical package, SYSTAT with default parameters. Sequences were assigned to four host categories, mammal (n = 117), insect (n = 63), plant (n = 167) and fish (n = 5) and frequencies of each mononucleotide and dinucleotide used as predictive factors to infer host ranges of unknown virus sequences from the current study.