Genome of Cnaphalocrocis medinalis Granulovirus, the First Crambidae-Infecting Betabaculovirus Isolated from Rice Leaffolder to Sequenced

Cnaphalocrocis medinalis is a major pest of rice in South and South-East Asia. Insecticides are the major means farmers use for management. A naturally occurring baculovirus, C. medinalis granulovirus (CnmeGV), has been isolated from the larvae and this has the potential for use as microbial agent. Here, we described the complete genome sequence of CnmeGV and compared it to other baculovirus genomes. The genome of CnmeGV is 112,060 base pairs in length, has a G+C content of 35.2%. It contains 133 putative open reading frames (ORFs) of at least 150 nucleotides. A hundred and one (101) of these ORFs are homologous to other baculovirus genes including 37 baculovirus core genes. Thirty-two (32) ORFs are unique to CnmeGV with no homologues detected in the GeneBank and 53 tandem repeats (TRs) with sequence length from 25 to 551 nt intersperse throughout the genome of CnmeGV. Six (6) homologous regions (hrs) were identified interspersed throughout the genome. Hr2 contains 11 imperfect palindromes and a high content of AT sequence (about 73%). The unique ORF28 contains a coiled-coil region and a zinc finger-like domain of 4–50 residues specialized by two C2C2 zinc finger motifs that putatively bound two atoms of zinc. ORF21 encoding a chit-1 protein suggesting a horizontal gene transfer from alphabaculovirus. The putative protein presents two carbohydrate-binding module family 14 (CBM_14) domains rather than other homologues detected from betabaculovirus that only contains one chit-binding region. Gene synteny maps showed the colinearity of sequenced betabaculovirus. Phylogenetic analysis indicated that CnmeGV grouped in the betabaculovirus, with a close relation to AdorGV. The cladogram obtained in this work grouped the 17 complete GV genomes in one monophyletic clade. CnmeGV represents a new crambidae host-isolated virus species from the genus Betabaculovirus and is most closely relative of AdorGV. The analyses and information derived from this study will provide a better understanding of the pathological symptoms caused by this virus and its potential use as a microbial pesticide.


Introduction
The rice leaffolder, Cnaphalocrocis medinalisGüenée (Lepidoptera: Crambidae), is a migratory and important insect pest of rice in Asia [1,2]. The larvae fold the leaves, feed on the photosynthetic leaf tissues in the folded leaves and such damages can result in reduction of rice yields [3]. In China frequent outbreaks have occurred in rice production regions and have caused rice yield reduction and farmers' overuse of insecticides. Insecticide control is the main measure farmers in China use and the pests have developed resistance to some insecticides [4]. The CnmeGV belonging to the family of Baculoviridae, was isolated from the infected caterpillars collected from fields in China recently. Bioassay showed that CnmeGV is a highly virulent baculovirus and suggested the potential of its use as an environmentally friendlier microbial agent for future rice leaffolder management [5].
Baculoviridae is a family ofrod-shaped baculoviurs with circular, covalently closed doublestranded DNA genomes, which has been successfully applied for the control of some agricultural and forest insect pests [6]. Based on phylogeny and host specificities, Baculoviridae is divided into four genera: Alphabaculovirus (lepidopteran-specific nucleopolyhedrovirus, NPVs), Betabaculovirus (lepidopteran-specific granulovirus, GVs), Gammabaculovirus (hymenopteran-specific NPVs) and Deltabaculovirus (dipteran-specific NPV) [7]. Alphabaculovirusis further subdivided into groups I and II according to the phylogenetic analysis of the lef-8, lef-9 and polh/gran genes [8]. Betabaculovirusis classified into three types based on the tissue tropism [9]. To date, the complete genomes of more than 51 NPVs and 17 GVs are published or available in GenBank.
GVs are more specific than NPVs, which have been reported only from Lepidoptera [10]. Partly because of the difficulty of establishing cell lines that are permissive for GV infection, the molecular biology and genetics of GVs have been less well studied than those of NPVs [11]. CnmeGV is a new isolate and is an effective baculovirus pathogen but less studied and genomic information is lacking. In this paper we present the complete sequence and morphological characterization of the CnmeGV genome and compared them to other baculoviruses using genomic and phylogenetic analyses. This is the first completely sequenced betabaculovirus isolated from a crambidae host to be reported.

Results and Discussion
Sequence analysis of the CnmeGV genome A total of 53,359 reads from post-filter sequencing libraries were used for genome assembly by the hierarchical genome-assembly process (HGAP). The genome of CnmeGV was sequenced and was registered as the first complete sequence of a crambidae infecting betabaculovirus in GenBank (Accession number KP658210). The genome consisted of 112,060 bp, which was within the sizes of the 17 sequenced betabaculovirus genomes ranging from 99,657 bp in AdorGV [12] to 178,733 bp inXcGV [13] (Table 1). The G+C content of CnmeGV genome was 35.2%, close to the lowest one estimated for betabaculovirus members which ranged between 32.5% in CrleGV and 45.2% in CpGV. However, no correlation was found between these data and the biological properties. In the criteria for selecting ORFs there should be methionine-initiated ORFs of at least 50 codons having minimal overlap with other ORFs [14], 133 putative ORFs were identified and were numbered from the ATG start codon of the granulin gene in a clockwise direction (S1 Table). Coding sequences represented 85.1% of the genome of CnmeGV similar to CpGV [15]. Seventy (70) ORFs were in the same orientation as the granulin ORF and 63 were opposite, indicating that CnmeGV ORFs have no obvious preferred orientation. Helicase (ORF79) is the longest sequence gene encoding 1162 amino acids, while ORF8 is the shortest in CnmeGV genome. The circular map of the CnmeGV genome was established and shown in Fig 1. The putative proteins of those ORFs were predicted by BlastX search which had an E-value of less than 10 −6 in NCBI. In total, 101 of the 133 putative ORFs encoding similar proteins are found in other organisms, while 32 of these were shown to be unique. Core Genes were a set of factors strongly conserved in the Baculoviridae family for they provide the essentials roles needed to complete the virus cycle [16]. When compared to the ORFs encoding the 37 described core proteins for Betabaculovirus genus [17], the 37 core genes were found in CnmeGV genome, representing the essential functions for replication and transcription; cell cycle interaction and/ or arrest with host proteins; packaging and assembly; viral release; and oral infectivity. Baculovirus repeated ORFs (bro genes) were striking features of many baculovirus genomes. Two repeated bro genes were identified in the CnmeGV genome (ORF65, 94) and were designated as bro-a, bro-b respectively based on their order in the genome. This highly repetitive and conserved family might have functioned as DNA binding proteins that influenced host DNA replication or transcription and improve the infection capability of virus [18,19].

Replication genes
The core genes of CnmeGV involved in DNA replication, alk-exo (ORF104), dnapol (ORF119), lef-1 (ORF62), lef-2 (ORF25), helicase (ORF79), were detected. Other replication genes that belonged to lepidoptera baculovirus conserved genes discovered in CnmeGV were  ORFs and transcription direction are indicated as arrows. Core genes were indicated by red arrows, genes present in other baculovirus were indicated by pink arrows, unique genes were indicated by blue arrows and hrs were indicated by yellow squares. The innermost circle shows GC skew, which indicates possible locations of the DNA leading strand, lagging strand, replication origin, and replication terminal during DNA replication. Below average GC skew is light orange and above average dark orange. The next innermost circle is a GC plot, with light green representing below average GC content, and dark green indicating above average GC content.

Unique ORFs
Thirty two (32) ORFs appeared to be unique to CnmeGV compared to the rest of the members of Baculoviridae (ORF9, 10,11,12,13,14,15,19,27,28,50,51,52,53,55,56,66,86,89,92,93,102,107,108,114,115,116,121,123,124,127,131). The predicated proteins were peptides with no significant similarities to any other sequences in GenBank. Among these unique ORFs in the CnmeGV genome, 14 ORFs were in the same orientation as the granulin ORF and 18 in the opposite. Three (3) ORFs (ORF15, 27 and 131) presented a late promoter motif (GATA), suggesting expression at a late stage of viral infection. Early promoter motifs, including CAKT and TATAWAW, were also detected at upstream of the start codon in other ORFs. ORF14 was the longest sequence of the unique ORFs encoded for a putative protein of 271 aa. It had no significant BlastP hits, and had early promoter elements upstream of the first ATG (TATAAAT). ORF55 encoded the shortest hypothetical protein of 52 aa in unique proteins. An early promoter motif (ATTTATA) was also found 57 nt upstream ORF55. The proteins encoded with others ORFs also showed no significant BlastP hits. It seemed apparent that the CnmeGV shared much more unique genes. Whether these are functional ORFs of CnmeGV would require further experimentation.
The SMART program detected 15 unique ORFs that contained at least one region which encoded a limited set of amino acids of special domains. Three (3) ORFs (ORF107, 116, 124) were found with trans-membrane helix regions by the TMHMM v2.0 program. Thirteen (13) ORFs (ORF13, 14, 28, 55, 56, 66, 86, 92, 107, 108, 114, 123, 131) were detected with low-complexity regions (LCRs) by the SMART program. ORF56 encoded three LCRs with the longest one containing 61 aa within 62-122 aa. ORF28 was found to be a coiled-coil region by the COILS program. The coiled-coil segments lie in areas that are possibly playing a functionally important role [30]. Interestingly, the predicated protein of ORF28 also contained a domain of zinc finger-like of 4-50 residues specialized by two C2C2 zinc finger motifs that putatively bound two atoms of zinc. The function of this domain was hypothesized to involve protein dimerization [31], or suggested as an ubiquitin ligase [32], or necessary for DNA binding and zinc-dependent repression [33]. In addition, zinc fingers are typical motifs distributed in DNA/RNA regulatory proteins whereas the coordination of heavy metals is often a characteristic of different metallothioneins in some cases [17]. These assumptions would need further experimentation.

Tandem repeats (TRs) and Homologous regions (hrs)
Tandem repeats (TRs) are DNA repeat sequences of each repeat unit located right next to each other, reflecting their origin in local duplications. These ubiquitous, unstable elements were found to combine characteristics of genetic and epigenetic changes that might facilitate organismal evolvability [34]. In the genome of PhopGV, 134 TRs were detected in a frequency of 7.65% in the genome. It was the highest TRs composed in the genomes of all betabaculovirus to date. The least TRs were detected in the genome of CalGV, which has 4 TRs with a frequency of 0.24% in genome (Table 1). In a screening of the CnmeGV genome for repeated sequences with TRs Finder [35], 53 TRs were found with sequence lengths from 25 to 551 nt. Fifteen (15) TRs were located in the coding region, 22 TRs were in the non-coding region and 16 TRs were in both the regions. All the 53 TRs contained 3.83% of genome of CnmeGV. These TRs in baculovirus genomes enhance the transcription of early gene in promoters and act as mediators for rapid phenotypic changes in coding sequences [34,36].
Mutations in these repeats often have fascinating phenotypic consequences [36]. The number of the repeating unit changes, recombination and replication slippage will bring about mutation in TRs [37]. Tr51, the least repeat unit of TRs in CnmeGV genome, contained 3 repeat units. The secondary structure of hairpin-loops was predicted by DNAMAN 8 (Minimum free energy of the structure is -14.01kcal/mol) (Fig 2B). This structure was a part of variability [36]. In addition to inherent instability, TR mutation can also be affected by external factors [38]. For example, CAG repeat stability is modulated by the chaperone protein hsp90 in the human cell. Hsp90 function can be overwhelmed by severe environmental stresses, resulting in a role of mediating an influence by the environment on TR mutation rates [39]. In the genome of CnmeGV, a CAG repeat unit in the TRs of Tr6 contained 12.3 repeat units was found in the coding regions (Fig 2C). It coded a hypothetical protein of 333 amino acids. This might be possible that the correlation between CAG repeat units and the stability of the hypothetical protein would response to the environment. This assumption would need further verification.
In the baculovirus genomes sequenced so far, it is common to find 1 to 16 homologous regions (hrs) present. [40]. Generally, the hrs in baculoviruses are the intergenic repeats that play putative or demonstrated roles as enhancers of transcription and origins of replication [41]. Twenty seven (27) (within 6 hrs) imperfect palindromes were identified in CnmeGV genome. However, only 4 hrs were identified in ClanGV genome sequence [42]. The alignment of these sequences revealed a typical structure of palindrome. The alignment of these shorter palindromes shows that they have a 10 bp conserved inverted repeats (Fig 2A). Similarly, these palindrome structures of hrs were found in numerous GVs (EpapGV, CrleGV, AdorGV, and ChocGV) [43,44,15]. The largest intergenic region, which contains imperfect palindromes, was found between ORF26 and ORF27 in CnmeGV. It contained 757 bp in size and a high content of AT sequence (about 73%). Eleven (11) imperfect palindromes were identified in this region, which was assigned in hr2, but it revealed no significant homology in tBlastx searches.

ORF21, with double chitin-binding domains
Chitin is an important component of the insect cuticle and the peritrophic matrix (PM) lining the gut epithelium. The chitinase gene of baculovirus is usually expressed in the late phase of virus replication in insects that can hydrolyze chitin in the body of the insect that promotes terminal host liquefaction [45]. The CnmeGV ORF21 encoding a predicted protein of 173 aa with a size of 19.85 kDa is homologous to chit-1 gene. A baculovirus consensus late promoter motif TTAAG was found at 8 nt upstream of the start codon ATG, indicating that ORF21 may express in the later stages of the infection cycle. ORF21 protein contains a trans-membrane helix region as detected by the TMHAMM V2.0 program (Fig 3B). The region started at position 7 aa and ended at position 24 aa. Moreover, two special domains, CBM_14A and CBM_14B, belonging to the carbohydrate-binding module family 14 (CBM_14) were found by the SMART to be located at the sites of 40-95 aa and 99-154 aa of the protein (Fig 3A).
The family of CBM_14 was known as the peritrophin-A domain found in chitin binding proteins, particularly, the PM proteins of insects and animal chitinases [46][47][48]. Homologous genes were also found in some other betabaculoviruses, but these genes only contained one chit-binding region (Fig 4). All the chitin-binding domains were characterized by processing a six-cysteine-containing motif: C-x(13,20)-C-x(5,6)-C-x(9,19)-C-x(10,14)-C-x(4,14)-C [49]. Comparing the homologous genes among sequenced betabaculoviruses using the BlastP and subsequently aligning by the ClustalX program, we found high similarity shared among HearGV, XcGV, SpfrGV, ChfuGV, ChocGV, EpapGV, except CnmeGV and PsunGV (Fig 4). Chitin-binding proteins encoded by baculoviruses might be involved in the virus-host interactions during the infection cycle [50]. The protein GP37 binding to the chitin of Spodoptera litura PM was found to facilitate virus infection by targeting the chitin component of PM [51]. The double chitin-binding domains of CnmeGV OFR21 might be more beneficial to virus infection and host liquefaction.
Homologous proteins of ORF21 were also found in 5 alphabaculoviruses. There were DekiNPV, BmNPV, AcMNPV, RaouNPV, and PlxyNPV. A phylogenetic tree was reconstructed based on these conserved domains (Fig 3C). The virus samples were divided into the two major groups of GV and NPV. The Betabaculovirus genus of CnmeGV was classified into the NPV group rather than the group of GV, which suggested possible horizontal gene transfer (HGT) occurring in ORF21 of CnmeGV. Two hypotheses are proposed for ORF21 introduction in CnmeGV: 1, the CnmeGV acquired the ORF21 gene from NPV during co-infection of C. medinalis; and 2, the CnmeGV acquired from the host itself and the viruses might acted as vectors of HGT between insects or animals [52]. Although the transposable elements (TEs) had not been detected in the ORF21 protein, a poly-glutamine residue of trinucleotide repeat ([CAA] 13 ) at 156-168 aa was found. This polyglutamine-containing protein appeared to be over-represented in spliceosome components [53].

Relationships with other baculoviruses
Gene colinearity was analyzed by comparing CnmeGV to all the other sequenced GVs and type species of the Alphabaculovirus genus, AcMNPV, using Artemis Comparison Tool. Syntenic maps of CnmeGV and other baculovirus genomes were constructed through tBlastX comparison between genomes with blue stripes indicating inversions and colour intensity to reveal the different percentages of identities [20]. The conserved gene colinearity of all 17 GV genomes and the poorly conserved synteny between GVs and AcMNPV are listed in Fig 5. It was apparent that the synteny maps were conserved among betabaculovirus species differing from that of alphabaculoviruses with greater gene order correlation among the CnmeGV and other GVs and some inversions and drifts. Nevertheless, CnmeGV is different from the rest of  the GVs by two main gene block inversions about 19.8 kb and 16.8 kb. Inversion of large portions of the genomes was observed in a region between nt 17931-37024 and 81922-98743, that contained eight major ORFs: p74, ubi, p106, odv-e66, alk-exo, dna ligase, lef-9 and dnapol. Most of the differences in the organization among GVs genomes could be explained by insertions and deletions that contributed to the plasticity of the viral population [54,55].
The neighbor-joining (NJ) and unweighted pair-group method with arithmetic means (UPMGA) trees were generated using the concatenated amino acid sequences of the partial polh/gran, lef-8 and lef-9 from 68 baculovirus genomes. The UPMGA tree revealed higher bootstrap values (Fig 6 and S1 Fig). The obtained cladogram reproduced the grouping of four genera reflecting the current systematic assignment of the virus family [56]. As expected, CnmeGV was grouped in the Betabaculovirus genus. A close relationship between CnmeGV with AdorGV was supported by high bootstrap value. In previous reports betabaculoviruswas mainly divided in two well separated monophyletic clades, Clade "a" and Clade "b". GVs of Clade "a" were isolated mainly from noctuidae hosts, while those of Clade "b" isolated from other hosts [57]. But the cladogram obtained in this work based on 17 complete GV genomes did not support the division of betabaculovirus in two separated monophyletic clades. The same result was also shown by other authors who constructed the tree using all core genes or polh/gran, lef-8 and lef-9 genes [58,17]. Additionally, compared to the evolution and phylogenetic utility in lepidoptera, there are no direct correlation between the classification of insect and host's virus [59].

Conclusion
In this study, the first crambidae host-isolated betabaculovirus CnmeGV was sequenced and characterized. Its genome encodes 133 putative ORFs including 37 core genes from baculoviurs. In addition, it contained 32 unique genes that were not shared with the rest of the family with unknown functions. The unique ORF28 protein contained a specialized zinc finger-like domain and a coiled-coil region hypothesized to involve special functions. Fifty one (51) TRs and 6 hrs were identified interspersed throughout the genome. ORF21 presented two peritrophin-A domains of CBM_14 that were beneficial to virus infection and host liquefaction. There was also evidence of HGT events from Alphabaculovirus to Betabaculovirus. Phylogenetic analysis revealed that the CnmeGV is a new Betabaculovirus species closely related to AdorGV. The cladogram obtained in this work grouped the 17 complete GV genomes into one monophyletic clade.

Virus and viral DNA separation
This study was carried out on private land (E: 119.388888, N: 32.479142), the owner permitted us to conduct the study on this site. CnmeGV was isolated from a larva of the rice leaffolder C. medinalis collected in 2008 and stored in the lab. The virus was not an endangered or protected species. It was multiplied in the laboratory by feeding the second instar larvae with CnmeGV OBs. Infected larvae were homogenized with ddH 2 O, filtered through four layers of gauze, and centrifuged at 7727 x g for 10min. The pellet was suspended in 0.5% (w/v) SDS and centrifugation steps were repeated 5 times until the liquid became clear. Then, the pellet was suspended in 40-65% sucrose gradient and centrifuged at 400 x g for 10 min to remove the debris of larvae tissue. Finally, the OBs were collected in ddH 2 O. Viral DNA was extracted according to Wang et al [60]. Its integrity and identity was analyzed by Nano Drop and Agilent 2100 Bioanalyzer.

DNA sequencing and analysis
The CnmeGV genomic DNA was sequenced with PacBio RS II at Nextomics inWuhan of China and assembled de novo using HGAP2.2.0 [61,62]. The annotation was performed using RAST [63] to identify the ORF that started with a methionine codon (ATG). The criterion for defining an ORF was the size of at least 150 nt (50 aa) with minimal overlap. Homology searches were done using BlastP in database of NCBI. The complete genome was compared with other betabaculovirus genomes using the Artemis Comparative Tool (ACT) [64] (http:// www.sanger.ac.uk/resources/software/act/) and the tBlastX program. The Tandem Repeats Finder [35] (http://tandem.bu.edu/trf/trf.html) was used to locate and analyze tandem repeats. The REPuter program [65] (http://bibiserv.techfak.uni-bielefeld.de/reputer) was applied to analyze homologous repeat regions. The secondary DNA structure and alignment of these sequences were predicted with the DNAMAN 8 and ClustalX programs [66].
The phylogenetic analysis was based on amino acid sequences of 3 core genes (polh/gran, lef-8, lef-9) from CnmeGV and the other 67 baculoviruses listed in NCBI genome database. NJ and UPGMA phylogenetic trees (1000 bootstrap replicates) were inferred from the amino acid sequence alignments by using MEGA, version 5.1.