A planarian nidovirus expands the limits of RNA genome size

RNA viruses are the only known RNA-protein (RNP) entities capable of autonomous replication (albeit within a permissive host environment). A 33.5 kilobase (kb) nidovirus has been considered close to the upper size limit for such entities; conversely, the minimal cellular DNA genome is in the 100–300 kb range. This large difference presents a daunting gap for the transition from primordial RNP to contemporary DNA-RNP-based life. Whether or not RNA viruses represent transitional steps towards DNA-based life, studies of larger RNA viruses advance our understanding of the size constraints on RNP entities and the role of genome size in virus adaptation. For example, emergence of the largest previously known RNA genomes (20–34 kb in positive-stranded nidoviruses, including coronaviruses) is associated with the acquisition of a proofreading exoribonuclease (ExoN) encoded in the open reading frame 1b (ORF1b) in a monophyletic subset of nidoviruses. However, apparent constraints on the size of ORF1b, which encodes this and other key replicative enzymes, have been hypothesized to limit further expansion of these viral RNA genomes. Here, we characterize a novel nidovirus (planarian secretory cell nidovirus; PSCNV) whose disproportionately large ORF1b-like region including unannotated domains, and overall 41.1-kb genome, substantially extend the presumed limits on RNA genome size. This genome encodes a predicted 13,556-aa polyprotein in an unconventional single ORF, yet retains canonical nidoviral genome organization and expression, as well as key replicative domains. These domains may include functionally relevant substitutions rarely or never before observed in highly conserved sites of RdRp, NiRAN, ExoN and 3CLpro. Our evolutionary analysis suggests that PSCNV diverged early from multi-ORF nidoviruses, and acquired additional genes, including those typical of large DNA viruses or hosts, e.g. Ankyrin and Fibronectin type II, which might modulate virus-host interactions. PSCNV's greatly expanded genome, proteomic complexity, and unique features–impressive in themselves–attest to the likelihood of still-larger RNA genomes awaiting discovery.

The single-ORF genome organization of PSCNV presents a distinctive challenge for defining boundaries of three genome regions evident in the multi-ORF nidoviruses. We defined two boundaries, tentatively equivalent to the ORF1a/ORF1b and ORF1b/3'ORFs, in vicinity of the protein motifs universally conserved in all nidoviruses and PSCNV. As result, three regions were defined as follows: ORF1a-like, from the first nt of the start codon of the main ORF to the 18512 nt, the predicted -1PRF site 240 nt upstream of the codon encoding absolutely conserved lysine residue of the NiRAN An motif; ORF1b-like, from the 18513 nt to the 28346 nt, which is 260 nt downstream of the codon encoding catalytic glutamate residue of O-MT; 3'ORFs-like, from the 28347 nt to the last nt of the main ORF stop codon.

RNA virus proteins.
For the purpose of this study (Fig. 5), we compiled a list of RNA virus proteins larger than 1000 amino acids and expressed without PRF, based on the information available from the NCBI Viral Genomes Resource on 2017.04.13 [12] and RefSeq entries [11] specified there.
Virus discovery and genome sequencing timelines. The number of viral genomes that were sequenced each year, starting from 1982, was estimated using NCBI Entrez query [13], as the number of GenBank Nucleotide database (2018.01.02) entries belonging to the "Viral sequences" division and containing the phrase "complete cds" in the title, with publication dates within the year of interest [14]. To plot timelines of discovery of viruses with largest RNA and DNA genomes, those viruses were identified and associated information was retrieved for each year using NCBI Viral Genomes Resource on 2017.04.13 [12] and the relevant literature. We used poliovirus (PV), and nidoviruses avian bronchitis virus (IBV), mouse hepatitis virus (MHV), beluga whale coronavirus SW1 (BWCoV), and ball python nidovirus (BPNV) to highlight the longest RNA virus genome at 1981 and from 1987 onward, respectively, in Fig. 1A (see Table S1 for the genome sizes of the above nidoviruses).
Multiple sequence alignments of proteins. Multiple sequence alignments (MSAs) of 3CLpro, NiRAN, RdRp, ZBD, HEL1, ExoN, N-MT and O-MT protein domains were prepared for individual nidovirus families using the Viralis platform [15] and assisted by the HMMER 3.1 [16], Muscle 3.8.31 [17] and ClustalW 2.012 [18] programs in default modes. For each domain, MSAs of different nidovirus families and PSCNV were later combined using ClustalW in the profile mode, with subsequent manual local refinement. MSAs of RNase T2, FN2, and ANK domains and PSCNV tandem repeats were prepared using MAFFT v7.123b [19].
Identification of ORFs. PSCNV genome was scanned for ORFs in six reading frames by ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/) using the standard genetic code and minimal ORF length of 150 nt.

Identification of PSCNV polyprotein sequence regions enriched in particular amino acid residues.
To identify polyprotein regions enriched in a given amino acid residue, we calculated the distribution of that residue along the polyprotein and compared it to that of permuted sequences within a statistical framework that was applied to each residue type separately. Specifically, we calculated the cumulative count of a particular residue type within the ever expanding [1, i] window, where 1 is the first position and i is each position from the 1st to the last 13,556th in the polyprotein. The produced discrete data were approximated by R function "smooth.spline" with default parameters, and the first derivative of the approximation was obtained for each i value [30]. The procedure was then applied to 100 random permutations of the polyprotein sequence, and mean μ and standard deviation (SD) σ of the resulting derivative values were used to define significance threshold T=μ+Z(1−0.05/L)*σ=μ+4.5*σ, where Z( ) is a quantile function of the standard normal distribution and L is the polyprotein sequence length. Protein sequence regions with derivative values larger than the threshold (4.5 SD above the mean) were considered enriched in the amino acid residue. To avoid artefacts of the approximation, we excluded data corresponding to the N-and C-terminal 100 amino acids of the polyprotein.

Prediction of disordered protein regions.
Intrinsically disordered regions of the PSCNV polyprotein were predicted by DisEMBL 1.5 using Remark465 predictor with default parameters [31].
Prediction of transmembrane regions. Transmembrane (TM) regions of proteins were predicted using TMHMM Server v.2.0 (http://www.cbs.dtu.dk/services/TMHMM/) with default parameters. To conform to the input sequence length limitation (8000 aa), PSCNV polyprotein sequence was split into consecutive 8000 and 6556 aa fragments, with a 1000 aa overlap; predictions belonging to the overlap region were accepted even if supported only for one of the fragments.
Prediction of signal peptides. To predict signal peptides, SignalP 4.1 [32] was used. Prediction was made for all PSCNV polyprotein sequence fragments of length 70 aa with default parameters. A Dscore threshold of 0.75 was applied to predictions; when predicted signal peptides overlapped, the one with the highest D-score was selected.
Prediction of N-glycosylation sites. N-glycosylation sites were predicted using NetNGlyc 1.0 Server (http://www.cbs.dtu.dk/services/NetNGlyc/) with default parameters. Only predictions with potential above 0.75, supported by all nine networks were accepted. Predictions where potentially glycosylated asparagine is followed by proline, and predictions overlapping with TM helices were discarded. To conform to the input sequence length limitation (4000 aa), PSCNV polyprotein sequence was split into 4000 aa fragments, with 1000 aa overlaps starting from the N-terminus (the most C-terminal fragment was 1556 aa long; 5 fragments in total); predictions belonging to the overlaps were accepted even if supported only for one of the fragments.
Prediction of furin cleavage sites. Furin cleavage sites were predicted by ProP 1.0 Server [33] in default mode and with the PSCNV polyprotein sequence submitted as overlapping fragments as described for the N-glycosylation sites prediction.

Identification of protein sequence repeats.
To search for repeats in PSCNV polyprotein, its sequence was compared to itself using an in-house version of HHalign 2.0.16 with the following parameters: SMIN score threshold 5, E-value threshold 10, local alignment mode, realignment by the MAC algorithm not applied, up to 1000 alternative alignments allowed to be shown [34].

Identification of protein domains conserved in PSCNV and other viruses or hosts.
We used HHsearch 2.0.16 [35] to query databases scop70_1.75, pdb70_06Sep14 and modified pfamA_28.0 [36][37][38] with the PSCNV polyprotein fragments using iterative procedure. The modified pfamA_28.  (Table S1). This modification facilitates statistical evaluation of similarity between the PSCNV polyprotein and the nidovirus conserved domains within a framework that is used for the pfamA domains. During the first iteration of the procedure, polyprotein was split into fragments by TM clusters (TM helices separated by less than 300 aa), tandem repeats and Thr-rich region. Overlapping hits characterized by Probability above 95% were clustered, clusters were used to split polyprotein into smaller regions that served as HHsearch queries on subsequent iteration. Procedure was repeated until iteration during which no hits satisfying the 95% Probability threshold were detected. Finally, regions of polyprotein without hits were split into successive fragments of 300 aa length starting from N-and C-termini (shorter regions were discarded), which were again scanned for hits by HHsearch. To evaluate the statistical significance of HHsearch hits, we used two measures, E-value and Probability (estimates probability of the query being homologous to the target). We considered homology to be established for PSCNV regions and a database entry that were connected by hits with Probability >95%, and made additional considerations when evaluating hits with Probability ≤95%, as advised in the HH-suite User Guide [35]. In this subsequent analysis, we considered rank, size, and E-value of hits, and conservation of key functionally important residues in the query.
Search for the closest homologs of PSCNV protein domains not previously described in nidoviruses. PSCNV protein domains that were not previously described in nidoviruses (RNase T2, FN2, ANK) were compared with Uniprot (2017.01.16) [39] and Smed Unigene (2015.02.17) [20] databases using blastp (BLAST+ v2.2.29) [2]. Domains were extended by 100 amino acids at N-and C-termini in order to capture homology extending beyond that identified by HHsearch. The FN2a domain was not extended at the N-terminus because of the low-complexity Thr-rich domain located immediately upstream. For searches in Smed Unigene database, effective length of the search space was made equal to that of the search in Uniprot with the same query, in order to make E-values comparable. Domain composition of Smed Unigene hits was obtained from this database, while that of Uniprot hits -from InterPro database [40].

Identification of individual ankyrin repeats.
Full alignments corresponding to Ank and Ank_3 families of Pfam 28.0 [38], each representing individual ankyrin repeat, were combined. The resulting alignment was converted to HMM profile by HHmake 2.0. 16 (Table S1) and PSCNV, as well as an outgroup consisting of viruses of two species prototyping the astrovirus genera (Avastrovirus 1, Y15936.2; Mamastrovirus 1, L23513.1) [42]. Phylogeny was reconstructed using BEAST 1.8.2 package [43] with the model of amino acid replacement selected by ProtTest 3.4 [44] (Akaike information criterion and Bayesian information criterion employed for model selection; maximum likelihood (ML) tree topology optimization strategy utilizing subtree pruning and regrafting moves). Both strict clock and relaxed clock with uncorrelated log-normal rate distribution were tested, and a better-fitting model was selected based on Bayes factor estimate. Markov Chain Monte Carlo (MCMC) chains were run for 10 million iterations and sampled every 1000 iterations; the first 10% iterations were discarded as burn-in. Mixing and convergence were verified with the help of Tracer 1.5 (http://beast.bio.ed.ac.uk/Tracer). Results were summarized as maximum clade credibility (MCC) tree. R package APE 3.5 was used to calculate percentage of trees in the Bayesian sample, characterized by various phylogenetic positions of PSCNV [45]. The same procedure was used to reconstruct 1.) a phylogeny based on the MSA of five nidoviruswide conserved domains (3CLpro, NiRAN, RdRp, ZBD, HEL1; 1569 columns, 1065-1227, 1740-1881, 1958-2356, 2373-2427, 2520-2774 aa in the EAV pp1ab CAC42775.2 of X53459.3) including one representative of each nidovirus species (Table S1) and PSCNV; 2.) a phylogeny based on the MSA of PSCNV ANK and its closest cellular homologs (Fig. S16, from first to last column without gaps).
Ancestral state reconstruction. BayesTraits V2, MCMC method was used to test support for one ancestral state over the other at a given node [46]. A sample of phylogenetic trees, reconstructed by BEAST as detailed above, was utilized. State "1", single ORF, was assigned to PSCNV, while state "0", multiple ORFs, was assigned to all other viruses in the phylogeny. We also run a version of the analysis where state "-", that is the lack of information about genome organization, was assigned to astroviruses. To derive prior distributions for the rate parameters of the model, we calculated a ML estimate of the rate parameters on each tree in our sample, and set mean and variance of the gamma priors to conform to those of the obtained distributions. MCMC chains (10 million iterations, first 1% iterations discarded as burn-in) were run with the node of interest fossilized in both states. The Harmonic Mean value was recorded at the final iteration of each chain. Log Bayes Factor (BF) was calculated as twice the difference between Harmonic Mean values of the better and the worse fitting models. The procedure was repeated three times and, if the same model was favored every time, only the smallest value of the Log BF was reported. Preference for a state at a node was considered statistically significant only if Log BF exceeded 2 [47].
Identification of putative transcription-regulating sequences (TRSs). Nidoviruses utilize nonadjacent nucleotide repeats (conserved signals) in the 5'-UTR and the second half of the genome to regulate synthesis of subgenomic (sg) mRNAs (transcription). These repeats are known as leader and body transcription-regulating sequences, lTRS and bTRS, respectively. To search for potential TRSs, the 5'-UTR sequence was compared with the PSCNV genome using blastn (BLAST+ v2.2.29) [2].
RNA secondary structure prediction. RNA secondary structure prediction for PSCNV genome regions encompassing lTRS and bTRS (1-9000 nt and 20441-29440 nt, respectively) was assisted by the Mfold web server [48]. Only the top-ranking predictions with the lowest free energy were considered. Maximal distance between paired bases was set to 150 nt. Free energy for fragments of the prediction was calculated using http://unafold.rna.albany.edu/?q=mfold/Structure-display-and-freeenergy-determination. PRF site prediction. Analysis was conducted for PSCNV and SARS-CoV. KnotInFrame [49] was applied to a 1000 nt genome region immediately upstream of the region encoding the NiRAN An motif. Only the top prediction for each virus was considered.