The Discovery, Distribution, and Evolution of Viruses Associated with Drosophila melanogaster

Drosophila melanogaster is a valuable invertebrate model for viral infection and antiviral immunity, and is a focus for studies of insect-virus coevolution. Here we use a metagenomic approach to identify more than 20 previously undetected RNA viruses and a DNA virus associated with wild D. melanogaster. These viruses not only include distant relatives of known insect pathogens but also novel groups of insect-infecting viruses. By sequencing virus-derived small RNAs, we show that the viruses represent active infections of Drosophila. We find that the RNA viruses differ in the number and properties of their small RNAs, and we detect both siRNAs and a novel miRNA from the DNA virus. Analysis of small RNAs also allows us to identify putative viral sequences that lack detectable sequence similarity to known viruses. By surveying >2,000 individually collected wild adult Drosophila we show that more than 30% of D. melanogaster carry a detectable virus, and more than 6% carry multiple viruses. However, despite a high prevalence of the Wolbachia endosymbiont—which is known to be protective against virus infections in Drosophila—we were unable to detect any relationship between the presence of Wolbachia and the presence of any virus. Using publicly available RNA-seq datasets, we show that the community of viruses in Drosophila laboratories is very different from that seen in the wild, but that some of the newly discovered viruses are nevertheless widespread in laboratory lines and are ubiquitous in cell culture. By sequencing viruses from individual wild-collected flies we show that some viruses are shared between D. melanogaster and D. simulans. Our results provide an essential evolutionary and ecological context for host–virus interaction in Drosophila, and the newly reported viral sequences will help develop D. melanogaster further as a model for molecular and evolutionary virus research.


Introduction
Viral infections are universal, and virus-mediated selection may play a unique role in evolution [1]. Viruses are also responsible for highly pathogenic diseases, and the detection, treatment, and prevention of viral disease are important research goals. The model fly, Drosophila melanogaster, provides a valuable tool to understand the biology of viral infection [2,3] and antiviral immune responses in invertebrates [4,5], as well as the interaction between viruses and their vectors [6]. Drosophila has also helped elucidate the role of RNA interference (RNAi) as an antiviral defence [7,8] and has shown that endosymbiotic Wolbachia can protect against viruses [9,10]. Recently, the Drosophilidae have been used to address important questions in virus evolution, including determinants of host-range and disease emergence [11][12][13]. However, although Drosophila virus research has a long history, few D. melanogaster viruses are known in the wild [4,14], and experiments using non-natural Drosophila pathogens may bias our understanding of immune function and its evolution [12].
Following the discovery of Sigma Virus in D. melanogaster (DMelSV, Rhabdoviridae; reviewed in [15]), classical virology surveys in the 1960s and 1970s uncovered Drosophila C Virus (DCV, Dicistroviridae), Drosophila A Virus (DAV, related to Permutotetraviridae), Drosophila X Virus (DXV, Birnaviridae), DFV (Reoviridae), DPV, and DGV (unclassified) in D. melanogaster [4,14]. Subsequent transcriptomic studies of D. melanogaster identified D. melanogaster Nora Virus (unclassified Picornavirales; [16]), and analyses of small RNAs individuals of D. melanogaster, D. ananassae, D. malerkotliana, and Scaptodrosophila latifasciaeformis. Two collections (S and T) were sampled from fruit baits in southern England, and aliquots pooled for total RNA sequencing represented around 3,000 D. melanogaster (<1% other Drosophila). In total 0.5% of all reads mapped to DAV, DCV, DMelSV, and Nora Virus (S1 Fig and S1 Table). Viral read numbers varied dramatically between samples, with DCV absent from pool EIK and DAV absent from pool ST (verified by qPCR; S2 Fig). Only three reads mapped to DXV, and no reads mapped to Drosophila Totivirus, Drosophila Birnavirus, or American Nodavirus (S1 Fig and S1 Table). As DXV is reported to be a cell culture contaminant [14], and the other unmapped viruses were described from cell culture [17], this suggests that these viruses may be rare in wild flies.
To discover novel viruses, we assembled all RNA-seq reads de novo using Trinity [50] and Oases [51], and used the Basic Local Alignment Search Tool (BLAST) [52] against the Genbank protein reference sequences [53] to identify virus-like sequences. Raw de novo metagenomic contigs are provided in S2 Data. Virus-like contigs were supplemented with PCR and targeted Sanger sequencing to improve completeness, and in total we identified more than 20 partial viral genomes. Those sequences that could be unambiguously associated with D. melanogaster rather than other Drosophila species were provisionally named according to collection locations. Based on sequence similarity, these "BLAST-candidate" viruses included two Reoviruses ("Bloomfield Virus" and "Torrey Pines Virus"), three Flaviviruses ("Charvil Virus", and two others), a Permutotetravirus ("Newfield Virus"), a Nodavirus ("Craigie's Hill Virus"), a Negevirus, a Bunyavirus, two Iflaviruses ("La Jolla Virus" and "Twyford Virus"), two Picornalike viruses ("Thika Virus" and "Kilifi Virus"), a virus related to Chronic Bee Paralysis Virus ("Dansoman Virus"), a virus related to Sobemoviruses and Poleroviruses ("Motts Mill Virus"), six Partitiviruses, and a Nudivirus ("Kallithea Virus"). These novel viruses constituted a further 3.3% of RNA-seq reads, taking the viral total to 3.8% of all reads (S1 Fig). Further details of the new viruses are given in Table 1 and S2 Table, and virus sequences have been submitted to Genbank as KP714070-KP714108 and KP757922-KP757936. We note that fragments of Bloomfield Virus and Thika Virus were simultaneously identified in [54], there denoted DRV and DUV, respectively.
To place the virus sequences in a phylogenetic context, we subjected conserved regions to phylogenetic analysis along with known viruses and uncurated viral sequences from the NCBI Transcriptome Shotgun Assemblies [55]. Kallithea Virus, a DNA virus, was closely related to Nudiviruses from Drosophila innubila [56] and the beetle Oryctes rhinoceros [57]. Most of the new RNA viruses were relatives of known or suspected insect pathogens (Table 1 and S3 Fig). For example, based on polymerase sequences, Torrey Pines Virus is distantly related to Aedes pseudoscutellaris reovirus and several lepidopteran Cypoviruses, while Dansoman Virus is related to Chronic Bee Paralysis Virus. Others were distantly related to arthropod viruses, but were close to uncurated transcriptome sequences (Fig 1). For example, Bloomfield Virus is closely related to transcriptome sequences from the flies Delia antiqua and Teleopsis dalmanni (63% AA identity in the replicase), and distantly related to Nilaparvata lugens reovirus. Similarly, Motts Mill Virus is related to the recently described Ixodes scapularis (tick) associated viruses 1 and 2 [58], but is closer to uncurated transcriptomic sequences from three bees and the bobtail squid Euprymna scolopes. Together these sequences appear to represent a novel clade related to plant Sobemoviruses and Poleroviruses (Fig 1). Strikingly, the six unnamed Partitivirus-like polymerases were not related to any known insect pathogens (known Partitiviruses are pathogens of plants and fungi), but were related to uncurated sequences from arthropod transcriptomes-again suggesting a novel lineage of insect viruses (Fig 1).
The close relationship between the newly identified virus-like sequences and known arthropod viruses suggests that they are likely to be Drosophila pathogens, and some may be previously described viruses that lack sequence data. For example, Kilifi Virus or Thika Virus (related to the Picornavirales; S3 Fig) may correspond to DPV-which was reported to have a 25-30 nm particle and infect gut tissues [59]. Similarly, either Torrey Pines Virus or Bloomfield Virus could correspond to Drosophila F Virus [14], Drosophila K Virus or other reoviruses [21]. However, as those viruses were described only from capsid morphology, density, and serology, it would be challenging to conclusively link them with the novel sequences presented here.

Genomic Data and Small RNAs Imply Active Viral Infection
Although phylogenetic analyses show that these sequences are viral in origin, they may represent Endogenous Viral Elements integrated into the Drosophila genome ("fossil" viruses, or EVEs [60]), or they may derive from gut contents or surface contamination rather than active infections. To exclude the possibility that these sequences represent EVEs segregating in To test if these sequences derive from active viral infections, we additionally sequenced all 17-29 nt small-RNAs from the EIK and ST pools, reasoning that virus-like sequences will only be processed into 21 nt siRNAs (viRNAs, derived from replicative intermediates by Dicer-2) if they represent active infections within host cells [49,62]. In total ca. 7% of all 17-29 nt small-RNAs derived from DAV, DCV, DMelSV, and Nora Virus, and ca. 9% derived from the new "BLAST-candidate" viruses. As expected, for most viruses the viRNA size distribution was tightly centred on 21 nt reads and included reads from both the genomic and complementary strands, consistent with active viral infections processed by antiviral RNAi in Drosophila (Figs 2, S4 and S5).
Even if viral sequences do not represent EVEs or inactive contaminants, they could instead be active infections of Drosophila-associated microbiota rather than Drosophila. However, the 21 nt viRNAs observed for the majority of these viruses are inconsistent with viral infection of likely parasites or parasitoids such as hymenoptera, chelicerata, or nematodes, which have predominantly 22 nt viRNAs [63,64]. And, while 21 nt viRNAs could derive from viral infections of some fungi [65] or from eukaryotes with uncharacterised antiviral RNAi, in most cases the phylogenetic position of these viruses, high read numbers, and/or their appearance in laboratory fly stocks and cell culture (below) argue in favour of Drosophila as the host. Bars plotted above the x-axis represent reads mapping to the positive strand, and bars below represent reads mapping to the negative strand. All peak at 21 nt, except Kallithea Virus (a large DNA Nudivirus encoding a 22 nt miRNA) and Twyford Virus (Iflavirus), which peak at 22 nt. Note that DCV, Kilifi, Thika, and Nora Virus show a wide shoulder, up to 28 nt for positive-strand reads, which may indicate a difference in small RNA processing or retention. Bars are coloured according to the proportion of reads with each 5'-base (A-green, C-blue, G-yellow, U-red), and all except Twyford Virus show the expected pattern, including a slight bias against G at the 5' position. Counts used to plot this figure are provided in S1 Data.

Small RNAs Are Markers for Novel Virus-Like Sequences
In addition to the viral candidates identified by BLAST, we reasoned that contigs which lack BLAST similarity to reference sequences, but which display a signature of Dcr-2 processing (high levels of 21-23 nt siRNAs and low levels of 25-29 nt piRNAs), may also be viral in origin (Fig 3; This approach was very recently advocated in [54], but see also [17,65]). Using these small RNA criteria we identified a list of "siRNA-candidate" contigs that are potentially viral in origin, but could not be placed within a phylogeny of known viruses. Several siRNA-candidate viruses identified in this way were subsequently attributable as fragments of the BLASTcandidate viruses by other means (below). Of those that could not be attributed to viral genomes, two were provisionally named (Chaq Virus and Galbut Virus) and the remaining 57 Small RNA and RNA-seq reads were mapped to all de novo Trinity and Oasis contigs (regardless of origin), and the numbers of 21-23 nt RNA reads (viRNA-like) and 25-29 nt RNA reads (piRNA-like) were recorded relative to the number of RNAseq reads. For small RNAs all four of the mixed ST and EIK pools were combined, while for RNAseq data a single rRNA-depleted, non-normalised, mixed pool was used. Contigs with at least one small RNA read are shown in pale grey, and contigs are classifiable as more "virus-like" toward the top right as the total number of siRNAs increases and the ratio of siRNA:piRNA reads increases. As expected, DAV, DCV, Nora Virus, and DMelSV (yellow squares) have large numbers of siRNAs and few piRNAs, as do named BLAST-candidate viruses (blue circles), and other contigs with sequence similarity to known viruses (white circles). Based on the sector delimited by these viruses, we identified contigs with no BLAST hits in either Genbank refseq_rna or refseq_protein that we designate "siRNA candidate viruses" (dark grey). Two of these were given provisional names (Galbut Virus and Chaq Virus: red triangles). Note that each virus may appear more than once in the figure if more than one contig derives from that virus. Outliers include (A) a miRNA-bearing fragment from Kallithea virus, (B) unknown sequence with no strong blast hit, and (C) a Jockey-like retrotransposon, expected to be a source of piRNAs. The read-mapping count data are provided in S1 Data and raw de novo contigs are provided in S2 Data.
contigs were submitted to Genbank as uncultured environmental virus sequences (KP757937-KP757993). These unnamed siRNA-candidate viruses contribute 0.2% of all RNAseq reads, and 1% of 17-29 nt small RNAs in the EIK and ST pools (S1 Fig and S1 Table). As with the BLAST-candidate viruses, these siRNA-candidate viruses were absent from D. melanogaster genomic reads and their siRNAs were predominantly 21 nt in length and derived from both strands (S6 Fig), again suggesting that they represent active viral infections of Drosophila. Although the siRNA-candidate viruses displayed no strong BLAST similarity to known viruses, Galbut Virus and siRNA-candidate 24 display weak similarity to Nilaparvata lugens Commensal X Virus (a satellite virus with unknown helper [66]), and Galbut and Chaq viruses appear to be related to uncurated sequences present in a diverse set of arthropod transcriptome shotgun datasets (Fig 1 and

RNA Viruses Differ in the Numbers and Properties of Small RNAs
To test whether viruses vary in their small RNA profile, we analysed small RNAs from the EIK and ST pools, and from two libraries for each of the five collections (E, I, K, S, and T), with and without "High Definition" ligation adaptors designed to reduce ligation bias [67]. Overall, the number of viRNA reads per RNAseq read varied substantially among viruses ("viRNA ratio;" Figs 3 and S7), with Twyford Virus and Motts Mill Virus giving rise to more than a 1,000-fold additional 21-23 nt viRNAs than DCV, and DMelSV giving rise to nearly 7,000-fold more (S7 Fig). For viruses present in both EIK and ST, the viRNA ratios were highly correlated between pools (rank correlation ρ > 0.99; S7 Fig), suggesting that they are repeatable. This may reflect differences in the proportion of non-replicating viral genomes (e.g., encapsidated viruses in the gut lumen) which can contribute to RNAseq but are not actively processed by Dcr-2. Alternatively, differences could result from the action of viral suppressors of RNAi (VSRs), such as those encoded by DCV and Nora Virus [7,38].
The sizes and strand bias of small RNAs also varied substantially among viruses, although small RNA reads from the majority of RNA viruses were biased toward the positive strand (50%-70%) and to 21 nt in length (Fig 2), as expected for virus-derived small RNAs in Drosophila [49]. Exceptions included Nora Virus, DCV, Kilifi Virus, and Thika Virus, which showed a stronger positive-strand bias (85%-95% of reads) and a broad size range of positivesense viRNAs peaking at 21 nt, with a wide "shoulder" from 23 to 27 nt (Fig 2). This size distribution is not seen in most DCV infections of cell culture [49], and although 26-30 nt viral piR-NAs have been reported in OSS cell culture [17], the 25-28 nt reads identified here did not display the 5 0 U bias expected of piRNAs (Fig 2) [68]. As these four picorna-like viruses also displayed low numbers of viRNAs (Fig 3; S1 Table and S4 Fig), this is consistent with a difference in the way that the viral genome and/or viRNAs are processed-perhaps reflecting a higher fraction of non-specific degradation products for these viruses. However, because we found that viRNA properties were reproducible across sequencing libraries (S4 Fig), and because viRNAs were sequenced from large pooled samples composed of mixed infections, these profiles cannot result from idiosyncrasies of RNA extraction or library preparation.
In contrast to the other RNA viruses, Twyford Virus displayed unusual viRNAs. Although Twyford Virus is an Iflavirus (S3 Fig) and thus has a positive sense ssRNA genome, the strand bias was strongly negative and the viRNAs peaked sharply at 22-23 nt (cf. 21 nt for other RNA viruses). In addition, although other viruses displayed no strong 5 0 base-composition bias except for a weak bias against 5 0 G, most Twyford Virus viRNAs were 5 0 U, as is seen for piR-NAs (but lacking the A at position 10 expected of Drosophila piRNAs; S5 Fig). This bias is not due to small sample size or low sequence diversity, as we saw >9,000 unique sequences, and 3,500 of those were seen more than once. Comparison with the virus genome showed the 5 0 U bias was driven by differential production or retention of viRNAs, rather than subsequent editing or addition. Interestingly, the 3 0 position was also slightly enriched for U (S7 Fig), and a substantial fraction of 3 0 U were non-templated, indicative of 3 0 uridylation as seen in D. melanogaster and other species in the absence of the Hen-1 methytransferase [69].
These observations suggest that Twyford Virus may not be processed by the Drosophila Ago2-Dcr2 pathway, and could instead represent viral infection of an unknown eukaryotic commensal. Nevertheless, potential arthropod parasites such as chelicerata have not been reported to display this pattern of viRNAs, and although the 5 0 U is reminiscent of the 21U piRNA of C. elegans and related nematodes [70], neither Rhabditid nor Tylenchid nematodes could be detected by PCR in individual wild-caught D. melanogaster carrying Twyford Virus. Similarly, while 22 nt 5 0 -U small RNAs are known from the filamentous fungus Neurospora [71], those are derived from the host genome and are not associated with viral infection. Thus, although speculative, if these 22-23U viRNAs cannot be explained by a non drosophilid host, then it is possible that they reflect a previously unrecognised tissue-specific phenomenon in Drosophila, or one associated with the action of a novel suppressor of RNAi (e.g., suppression of Hen-1). The siRNA-Candidate 14 (KP757950) shows a similar pattern (22-23 nt peak in viRNAs, 5 0 U), suggesting it forms part of the same virus and/or is processed in the same way (S6 Fig).

Kallithea Virus Is Targeted by RNAi and Encodes a miRNA
We also identified many 21 nt viRNAs widely dispersed around the Kallithea Virus genome. This is consistent with an antiviral RNAi response against this DNA Nudivirus, as has been previously reported for Invertebrate Iridovirus 6 artificially infecting D. melanogaster [72]. As DNA viruses often encode miRNAs [73,74], and miRNAs have been implicated in the establishment of latency in Heliothis zea Nudivirus [75], we screened small RNAs from Kallithea Virus for potential virus-encoded miRNAs. One 22 nt RNA sequence was highly abundant (S5 Fig), and is predicted to represent the 5 0 miRNA from a pre-miRNA-like hairpin (miR-Deep2 [76]; S1 Text). The predicted mature 5 0 miRNA (AUAGUUGUAGUGGCAUUA AUUG) represented >35% of all small RNAs derived from Kallithea Virus, while the 3 0 RNA "star" sequence represented 0.3% of reads. This sequence was less highly represented, relative to 21nt viRNAs from Kallithea Virus, in an oxidised library, consistent with the absence of 2 0 O-methylation at the 3 0 end, as expected for miRNAs in Drosophila (S4 Fig) [77]. The seed region displays no obvious similarity to known miRNAs (although positions 5-17 are similar to dme-miR-33-5p), but a survey of potential binding sites in D. melanogaster using miRanda 3.3a [78] identified 522 genes with at least one potential binding site in the 3 0 -UTR (miRanda score 150). These were highly enriched for Gene Ontology terms that might be associated with viral function (including, amongst others, Regulation of Gene Expression, Cell development, mRNA binding, and Plasma Membrane; S3 Table). This miRNA could alternatively regulate virus gene expression [75], and 21 potential binding sites were identified in the Kallithea Virus genome. While predicted miRNA target sites include many false positives [79], and experimental work would be required to confirm a biological role, it is interesting to note that Kallithea Virus' closest relative (Oryctes rhinoceros Nudivirus) does not encode a detectable homolog of the miRNA and contains only four predicted binding sites (miRanda score 150).

Virus Prevalence Is High in Laboratory Stocks and Cell Culture
To detect the BLAST-candidate and siRNA-candidate viruses in previously published Drosophila studies, we mapped up to 2 million reads from each of 9,656 publicly available fly and cell culture RNAseq and small-RNA datasets to the new and previously described Drosophila viruses. Around 33% of "run" datasets contained viral reads above a threshold of 100 reads per million (rpm), representing 39% of "samples" and 58% of submitted "projects" (Figs 4 and S8 and S5 Data). The proportion of positive samples varied with log-threshold, so that 53% had at least one virus at 10 rpm, but 17% of runs had at least one virus at 1,000 rpm. These rates are slightly lower than, but not dissimilar to, previous estimates from serial passage of fly stocks, which found around 40% of fly stocks were infected [14]. BLAST-candidate and siRNA candidate viruses were both found, and by noting their co-occurrence, we were able to identify several siRNA-candidates as component parts of other virus genomes (e.g., segments of Bloomfield Virus and the second segment of Craigie's Hill Virus were initially identified as siRNA Candidate Viruses). The presence of BLAST-candidate and siRNA-candidate viruses in laboratory cultures supports these as bona fide infections of Drosophila and demonstrates the utility of the viRNA signature as a marker for viruses. Virus-derived small RNA reads in publicly available RNA datasets. To discover the distribution of viruses in laboratory cultures (including cell culture) the first 2 million reads from each of 9,656 publicly available RNAseq and siRNA datasets were mapped to all previously reported and newly discovered RNA viruses (S8 Fig and S5 Data). This heatmap illustrates the virus community across 62 small RNA sequencing projects (averaging across read sets within projects) and the metagenomic discovery pools E, I, K, S, and T. Cells are coloured by the number of viRNAs per million total mappable reads, and rows and columns have been re-ordered to cluster similar viruses and similar samples. Only datasets and viruses with >100 rpm in at least one cell are included here. This analysis shows that many of the viruses, including newly discovered ones such as Newfield Virus, are extremely common in laboratory culture (including cell cultures) and demonstrates they are able to replicate in Drosophila. The analysis also suggests a clear difference in the virus community between the wild collected flies and laboratory flies. Previously described viruses are named in blue, new viruses and virus-like sequences in black. Summary data for this figure are provided in S1 Data, and raw counts in S5 Data. Virus, Galbut Virus, DMelSV) were extremely rare, appearing in 12 or fewer datasets. It is widely known that Drosophila cell culture harbours many viruses [21], and we identified multiple viruses and occasionally high numbers of viral reads in cell culture datasets. For example, reads mapping to nine different viruses were found in datasets SRR770283 and SRR770284 (Piwi CLIP-Seq in OSS cells) [80] and 70% of reads from SRR609669-SRR609671 were viral in origin (total RNA from piRNA-pathway knock-downs) [81]. Virus presence/absence for widely used Drosophila Cell cultures [82] is presented in S9 Fig Given the presence of viruses in public RNAseq and siRNA datasets, we selected a subset of 2,188 datasets to perform viral discovery by de novo assembly. In adult D. melanogaster this identified a novel Picorna-like virus (present in three datasets among 9,656) and a novel Negevirus (present in 10 datasets among 9,656). We have provisionally named these Berkeley Virus and Brandeis Virus, respectively (Table 1, S2 Table and S6 Data). The survey also identified a novel Totivirus and several (possibly fragmentary or non-coding) Reovirus segments in cell culture, at least one of which is widespread (214 datasets; Figs 4 and S8 and S5 Data). The small number of novel viruses we identified in fly stocks over and above those described previously, may suggest that few further Drosophila viruses remain to be found regularly infecting laboratory stocks.

Drosophila Viruses Are Common and Widely Distributed in the Field
To infer viral prevalence and distribution in wild flies we used reverse transcription PCR (RT-PCR) to assay for the presence of 16 different viruses in a total of 1,635 D. melanogaster and 658 D. simulans adults sampled from 17 locations across the world (Fig 5 and S4 Table).  Table). Although sampling locations varied substantially in overall viral prevalence (in Athens GA >80% of D. melanogaster carried a virus; in Marrakesh less than 10%) there was no clear geographic structure in viral prevalence (Figs 5 and S10), and for most viruses prevalence was not correlated between D. melanogaster and D. simulans from the same location. Excluding siRNA-candidate viruses, around 30% of D. melanogaster individuals and 13% of D. simulans individuals carried at least one virus, and over 6% of D. melanogaster individuals carried more than one virus (S11 Fig). We are unable to explain the unusually high viral prevalence in D. melanogaster sampled from Athens (Georgia, US)-flies were separated to individual vials within a few hours, and D. simulans and D. melanogaster were netted together, but D. simulans did not show unusually high virus prevalence (S10 Fig).
In contrast to the other viruses, the novel siRNA-candidate viruses Galbut Virus and Chaq Virus often displayed extremely high prevalence. In D. melanogaster, Galbut Virus ranged from 13% in Plettenberg to 100% in Accra, and Chaq Virus from <5% in Edinburgh to 35% in Porto. In D. simulans, Galbut Virus ranged from 57% (Athens) to 76% (Torquay, Australia), although Chaq Virus was not highly prevalent (only 2/7 locations, at low prevalence). These rates are much higher than for the other viruses, and the inclusion of siRNA candidate viruses in overall infection rates brings many populations to 70% of flies carrying at least one virus (Figs 5 and S11). The high prevalence of these "siRNA-candidate" viruses is surprising and could perhaps imply an alternative (non-virus) origin for the sequences. However, we believe a viral origin is supported by a combination of the viral-like siRNA signature, their absence from D. melanogaster genomic reads, their close relationship to unclassified insect transcriptome sequences (Figs 1 and S3), and their occasionally low prevalence (S10 Fig).  For DMelSV, which is the only virus previously surveyed on a large scale [15], our data agree closely with earlier estimates based on other assays: here 4.6% of D. melanogaster infected versus 2.8% in [22] and 5.0% in [23] (DMelSV is absent from D. simulans). Nevertheless, our estimates more generally should be treated with some caution as RT-PCR assays are unlikely to be reliable for all virus genotypes, leading to PCR failure for divergent haplotypes and thus potentially underestimation of prevalence.
Virus prevalence was strikingly different between publicly available RNA datasets and our field survey (Fig 4). For example, we identified Newfield Virus and DCV in many fly stocks and cell cultures but very rarely in the field (Fig 4; also compare Figs 5 and S8), whereas Kallithea Virus and Motts Mill Virus were common in the wild (4.6% and 6.7% global average prevalence in D. melanogaster; Fig 5) but absent from DNA and RNA public datasets. The case of DCV is particularly striking as early surveys assayed by serial passage in laboratory cultures suggested DCV may be common in the field [14], as did PCR surveys of recently established laboratory stocks [83]. In the case of Newfield Virus and DCV, a downward collection bias (for example, if high titre flies do not get collected) may explain the result. However, for viruses that have higher prevalence in the field than the lab such as Kallithea Virus and Motts Mill Virus, and also the siRNA candidates Galbut Virus and Chaq Virus, it seems likely that differences in (e.g.) transmission ecology between the lab and field may explain the disparity.

Wolbachia Prevalence Is Not Detectably Correlated with Virus Prevalence
Infection with the bacterial endosymbiont Wolbachia pipientis has previously been shown to confer protection against secondary infection by some RNA [9,10] but not DNA [84] viruses in insects. We therefore surveyed Wolbachia prevalence by PCR in the wild flies, and tested whether Wolbachia was correlated with virus presence. We found Wolbachia at detectable levels in all populations, ranging from 1.6% of individuals (Accra) to 98% (Edinburgh) in D. melanogaster and from 82% (Plettenberg) to 100% (Marseille) in D. simulans (Figs 5 and S10). As expected [85], overall Wolbachia prevalence was higher in D. simulans (approximately 90%) than in D. melanogaster (approximately 50%). We could not detect any consistent correlation across populations between the prevalence of Wolbachia and that of any virus (S12 Fig), nor could we detect any association between Wolbachia and viral infection status within populations (contingency table tests with p-values combined across populations using Fisher's method; all p > 0.05). While this may indicate that Wolbachia-mediated antiviral protection is not an important determinant of viral infection in the field, our relatively small sample sizes (median n = 63 flies per location) mean that our power to detect an interaction is low. In addition, it is unclear whether a positive or negative association should be expected, as some Wolbachia-host combinations appear to protect via increased tolerance rather than increased resistance [86], not all strains are highly protective [87], virus titre may correlate with Wolbachia titre even if there is no presence-absence relationship, and for at least one DNA virus, resistance is decreased [84].

Timescales of Virus Evolution
To provide a preliminary examination of the evolutionary and ecological dynamics of previously described and novel viruses we used a time-sampled Bayesian phylogenetic approach to infer dates of common ancestry and substitution rates [88,89]. To allow for differences in data coverage across (partial) viral genomes, and to allow for potential recombination, we divided alignments into blocks. These blocks were permitted to vary in mutation-rate (substitutionrate) and date parameters. For Nora Virus and DAV, we were able to use early genomic samples [20,90] to aid time calibration, and we estimated the all-sites mutation rate for these viruses at approximately 4 to 8 × 10 −4 nt -1 yr -1 (Nora Virus; range across blocks, posterior medians) and 2 to 6 × 10 −4 nt -1 yr -1 (DAV). However, short sampling timescales means that these rates are estimated with low precision, resulting in a spread of 95% highest posterior density (HPD) credibility intervals across blocks of 2 to 12 × 10 −4 nt -1 yr -1 and 1 to 9 × 10 −4 nt -1 yr -1 , respectively. This range falls within the expected range for single-stranded RNA viruses [91].
These mutation rates imply dates for the most recent common ancestors of DAV and Nora Virus (Most Recent Common Ancestor [MRCA] for known extant lineages) of approximately 100-200 years ago (range across blocks; range of credibility intervals across blocks 70-320 years) and 50-300 years ago (range across blocks; range of credibility intervals across blocks 33-630 years), respectively. This rapid movement of viruses is consistent with the rapid global movement that has been seen in transposable elements [92], Wolbachia [93], and even Drosophila genomes [94]. Nevertheless, inferred dates of a few hundred years in the past should be treated with caution, as weak constraint can lead to substantial underestimates of the true time to common ancestry [95]. For the other nine RNA viruses analysed, sample sizes were smaller and dated samples were only available from a 5 year window, making unconstrained estimates of mutation rate and MRCA dates difficult so that quantitative estimates for these viruses should be treated with caution (results inferred using a strong prior on mutation-rate are presented in S13 Fig and S8 Data).

Some Viruses Move Repeatedly Between D. melanogaster and D. simulans
Viruses detected in both D. melanogaster and D. simulans might either be reciprocally monophyletic, with substantial divergence between host-specific lineages, or the hosts may be distributed across the virus phylogeny, if between-host movement is common. By incorporating information on host species (D. melanogaster versus D. simulans) in the phylogenetic analysis, we were able detect past movement of RNA viruses between these host species (e.g., Fig 6) and obtain rough estimates of host switching rate [88,89] (S13 Fig and S8 Data). These rate estimates have very low precision, associated with the relatively broad posterior estimates for substitution rate, and should be treated with caution. Nevertheless, there appeared to be substantial variation in host-switching rate amongst the seven RNA viruses that were detected in both hosts, with Chaq Virus showing the lowest rate and La Jolla Virus the highest (Fig 6; S13  Fig and S8 Data). A similar analysis of inter-continental geographical movement suggests substantial genetic structuring between geographic regions may be present in some viruses (S13 Fig and S8 Data), although again further time-sampled data are required for the analyses to provide useful precision.

DAV and Nora Virus Proteins Are Highly Conserved with Little Evidence for Positive Selection
To quantify patterns of selection acting on protein-coding sequences, we applied a phylogenetic approach to infer relative rates of synonymous (dS) and non-synonymous (dN) substitution [96] in 11 of the viruses (quantified as dN-dS or dN/dS; S13 Fig, S9 Data, and S14 Fig). For most viruses, sample sizes were insufficient to obtain robust inferences of selection; however, in general the protein sequences were strongly constrained (S13 Fig). DAV and Nora Virus, which had the most comprehensive sampling, showed among the lowest mean dN/dS ratios (0.25 and 0.24, respectively), although only Nora Virus displayed codons with strong evidence of positive selection. There were no clear patterns of dN/dS variation across genes within DAV or Nora virus genomes (S14 Fig), although three of the four positively selected codons were in the Nora Virus capsid [97], perhaps indicative of host-mediated selection. There was no evidence for positive selection or an elevated dN/dS ratio in the viral suppressor of RNAi encoded by Nora Virus [38], despite apparently rapid host-specialisation in this gene [12] and the rapid adaptive evolution seen in its antagonist, Drosophila Ago2 [47]. Nevertheless, within-species sampling may be unlikely to detect species-wide selective pressure [98] and the disparity in evolutionary timescales of host and virus may mean that even strong reciprocal arms races are not reflected in elevated viral dN/dS ratios (i.e., the time of common ancestry for the viruses is too recent for the host to have driven multiple substitutions within the sampling timeframe [99]).

Conclusions
We have identified around 20 new viruses associated with D. melanogaster in the wild and shown that some of them are common in the laboratory and the field. A substantial fraction of the virus lineages described here are newly (or only recently [33,58]) reported to infect melanogaster blue, D. simulans red) and branches are coloured by the inferred host, suggesting approximately seven host-switches in the tree. An estimated timescale is given in years before present, with 95% credibility intervals for node dates. However, the absolute timescale is tightly constrained by the prior assumption that RNA virus mutation rates fall between 1 × 10 −4 and 11 × 10 −4 mutations per site per year, and given the limited temporal information in the data, should be treated with caution. Note that while sequences generally cluster by location, there is substantial mixing on a global scale. The underlying BEAST xml file (including the alignment and model specification), along with the resulting maximum clade credibility tree file and summaries of posterior distributions, are provided in S7 Data and S8 Data. Viruses of Drosophila melanogaster arthropods. These include the novel "siRNA candidate" viruses with uncertain affiliation, and new lineages related to Partitiviruses, Sobemoviruses and Poleroviruses, and Picornavirales. The presence of these viruses in cell culture and laboratory stocks, their absence from fly genomes, and the presence of virus-derived 21 nt small RNAs all support the majority as bona fide Drosophila infections. Drosophila-associated viruses include many with positive sense RNA genomes (Dicistroviridae, Permutotetraviridae, Flaviviridae, Iflaviridae, Nodaviridae, Negeviruses, and others), negative sense RNA genomes (Rhabdoviridae; relatives of Bunyaviridae), and double-stranded RNA genomes (Birnaviridae, Totiviridae, Partitiviridae, Reoviridae), but few with DNA genomes (only the Kallithea and Drosophila innubila Nudiviruses [56] to date) and no retroviruses (cf. "Errantivirus" endogenous retroviruses [100]; retro-elements or retrotransposons that are usually transmitted as genomic integrations). It therefore seems increasingly unlikely that the apparent wealth of RNA viruses and paucity of retroviruses and DNA viruses in Drosophila represents a sampling artefact, and this may instead reflect underlying Drosophila ecology or immune function.
This newly discovered diversity of Drosophila viruses will prove valuable for the use of D. melanogaster as a model for viral infection and antiviral immunity. First, these data will facilitate the curation and management of laboratory stocks and cell cultures. Second, they will facilitate the isolation and culture of viruses for future experimental work, including natural pathogens of Drosophila that are closely related to medically or economically important viruses such as Flaviviruses and Bunyaviruses, or Chronic Bee Paralysis Virus. Third, our analyses of virus prevalence, distribution, and dynamics will provide an informed evolutionary and ecological context to develop D. melanogaster and its relatives as a model system for viral epidemiology and host-switching. In combination, we hope this work will provide a key reference point for future studies of Drosophila-virus interaction.  Table for details). In each case flies were aspirated or netted from fruit bait at intervals of 24 hours or less, and maintained individually in isolation for up to 10 days on a sugar/hard-agar medium prior to freezing at -80°C. In addition, a small number of laboratory-stock flies were provided by David Finnegan (University of Edinburgh) in February 2008 and February 2010.

Metagenomic and Small-RNA Sequencing
All raw reads have been submitted to the Sequence Read Archive under project accession SRP056120. RNA was extracted using Trizol (Life Technologies) and DNAse treated (Life Technologies) according to the manufacturer's instructions. For metagenomic RNAseq and small-RNA sequencing, aliquots of the samples E, I, K, S, and T were initially mixed into two pools: Dmel/UK (denoted ST) and mixed-species/non-UK (EIK). RNAseq was performed by Edinburgh Genomics (Edinburgh). Following ribosome-depletion using two cycles of RiboMinus (Life Technologies), around 40 million high-quality paired-end 100 nt Illumina reads were generated from a single sequencing lane of each library (accessions SRR1914484 and SRR1914527). Small-RNA sequencing from the same two pools was performed with a periodate oxidation step to reduce the relative ligation efficiency of miRNAs, which do not carry 3 0 -Ribose 2 0 O-methylation [77], and sequenced by Edinburgh Genomics (Edinburgh). An equivalent library, without periodate oxidation, was constructed and sequenced by BGI (Hong Kong) These libraries generated between 12 and 20 million small-RNA reads each (SRR1914671, SRR1914716, SRR1914775, SRR1914792). After preliminary analysis, RNA aliquots of all five pools were mixed (mix denoted EIKST) and RNAseq was repeated by BGI (Hong Kong) following either double-stranded nuclease normalization (around 65 million high-quality pairedend 90 nt Illumina reads; SRR1914412), or ribosome depletion using Ribo-Zero (Illumina; around 65 million high-quality paired-end 90 nt Illumina reads; SRR1914445). Finally, individual small-RNA libraries were constructed for each of the five samples E, I, K, S, and T, without periodate oxidation using both NEBnext adaptors (New England Biolabs) (datasets SRR1914946, SRR1914952, SRR1914955, SRR1914957, SRR1914959), and equivalent "High Definition" adaptors (datasets SRR1914958, SRR1914956, SRR1914954, SRR1914948, SRR1914945), which seek to reduce ligation bias by incorporating random tetramers into the ligation adaptor [67]. These ten libraries were sequenced by Edinburgh Genomics (Edinburgh). Any small RNAs produced by synthesis (e.g., nematode 22G small RNAs) carry a 5 0 triphosphate and are unlikely to be ligated by this protocol.
To estimate species-composition of the UK (ST) and non-UK (EIK) pools, RNAseq datasets were quality trimmed using ConDeTri2.0 and mapped using Bowtie2 [101]  To estimate the contribution of each of several potential sources of RNA, reads were also mapped to known Drosophila viruses, the D. melanogaster, D simulans, and D ananassae genomes, Wolbachia from D. melanogaster, D. simulans, D. willistoni, D. suzukii, and D. ananassae, and the bacteria Pseudomonas entomophila, Providencia sneebia, P. rettgeri, P. burhodogranariea, P. alcalifaciens, Gluconobacter morbifer, Enterococcus faecalis, Commensalibacter intestini, Acetobacter pomorum, A. tropicalis, and A. malorum (S1 Table). The percentage of mapping reads was quantified relative to total high-quality reads.

Identification of Virus-Like Contigs by BLAST
All paired-end RNAseq datasets were combined and de novo assembled using three different approaches. First, all reads were assembled using Trinity (r2013-02-25; [50]). Second, data were digitally normalised using the "normalize-by-median-pct.py" script from the Khmer package, and then assembled using Trinity. Finally, this normalised dataset was also assembled using the Oases/Velvet pipeline (version. 0.2.08; [51]). Contigs are provided in S2 Data. Using the longest inferred contig for each putative locus and a minimum contig length of 200 nt, contigs were compared to the Genbank protein reference sequence database using BLASTx [52] with default search parameters and an e-value threshold of 1 × 10 −10 to identify the top 10 hits. Contigs with no BLAST hit at this threshold were translated to identify the longest open reading frame in the contig, and the resulting amino acid sequence compared to Genbank protein reference sequence using BLASTp with default parameters and an e-value threshold of 1 × 10 −5 to identify the top 10 hits. For each contig with BLAST hits, the phylogenetic hierarchy of the best hits was traversed upward to identify the lowest taxonomic classification displaying a 75% majority taxonomic agreement, and this was used to guide subsequent cross-assembly and manual curation using the SeqManPro assembler (DNAstar). The resulting BLAST-candidate virus sequences were then used as targets for (RT-)PCR and Sanger sequencing confirmation (both on discovery samples EIKST, and on individual flies), with primer design guided by assuming synteny with close relatives. Final sequences for phylogenetic analysis and submission to NCBI were generated either by PCR and Sanger sequencing, or from a majority consensus derived by re-mapping RNA-seq reads to the combined de novo/PCR contig (sequences have been submitted to Genbank as KP714070-KP714108 and KP757922-KP757936).

Identification of Virus-Like Contigs by Small-RNA Profile
The Drosophila Dcr2-Ago2 RNAi pathway processes double-stranded RNA derived from viruses and transposable elements into 21 nt siRNAs [7,8,102]. In the case of viruses, Dcr2 may process both replicative intermediates and fold-back structures. However the presence of replicative strand reads (i.e., negative strand reads in positive sense RNA viruses) demonstrates at that replicative intermediates are a major substrate for Dcr2. Thus, the presence of virusderived 21 nt siRNA sequences from both strands is diagnostic of an antiviral response against replicating viruses. The piRNA pathway also generates small RNAs (25-29 nt piRNAs from transposable elements [68]), however, although viral piRNAs are found in Drosophila OSS cell culture expressing Piwi [17], these have not been detected in whole flies. Therefore the presence of large numbers of 21 nt small RNAs, relative to total RNA or to 25-29 nt small RNAs, provides corroborating evidence that virus-like sequences are recognised by the Drosophila immune system.
In addition, we reasoned that this signature of 21 nt siRNAs could also be used to identify viral sequences amongst the substantial fraction of contigs that displayed no BLAST similarity to any sequence in the NCBI reference protein database. To quantify the siRNA profile of the de novo contigs we mapped the 19-29 nt small RNAs from the ST and EIK datasets to all de novo contigs, recording all potential mapping locations. As a proxy for total RNA, we also remapped combined EIKST RNAseq data using the dataset generated by BGI (Hong Kong) with Ribo-Zero rRNA depletion. For each contig this resulted in a count per unit length of the number of 21-23 nt siRNAs, the number of 25-29 nt putative piRNAs, and the number of RNAseq reads. Using the well-studied Drosophila viruses DCV, DAV, Nora virus, and DMelSV as threshold a guide, we then used a high ratio of siRNA:RNAseq reads and a high ratio of siRNA: piRNA reads to corroborate the BLAST-candidate viruses outlined above, and to propose novel siRNA-candidate viruses amongst contigs lacking BLAST hits (Fig 3). Final siRNA-candidate sequences for phylogenetic analysis and submission to Genbank were generated either by RT-PCR and Sanger sequencing, or from a majority consensus derived by re-mapping RNA-seq reads to the each contig (Submitted to Genbank as KP757937-KP757993).

Viral Phylogenetic Analyses
We used translations from the novel viruses to perform a BLASTp similarity search of the Genbank protein reference database to identify relatives and suitably conserved genomic fragments for phylogenetic inference. In addition, we searched the Genbank transcriptome shotgun assemblies (TSA: database tsa_nt) using tBLASTn [52] to identify potential virus sequences (i.e., sequences currently unannotated as viral) in recently generated high-throughput transcriptomes. Protein sequences from known viruses, novel viruses, and TSA candidate viruses were then aligned using T-Coffee "psicoffee" [103], and some poorly aligned regions at the 5 0 and 3 0 ends of the selected fragment were manually identified and trimmed. Phylogenetic trees were inferred using MrBayes [104] under a protein substitution model that allowed model jumping between widely advocated amino acid substitution models, and gamma-distributed rate variation among sites. Two independent MCMC chains were run, sampling every 100th step until the posterior sample of tree topologies reached stationarity (i.e., standard deviation in split frequencies between chains dropping to <0.01) and the effective sample size of all parameters was >300 (after 25% burn-in). Maximum clade-credibility consensus trees were prepared by combining both MrBayes chains using TreeAnnotator from the BEAST package [89]. For most of the novel RNA viruses, the phylogeny was inferred from the RNA polymerase, which tends to be highly conserved, but for Kallithea Virus (a DNA Nudivirus) we selected five loci that have previously been used in Nudivirus phylogenetics [105] and the phylogeny was inferred from their concatenated alignments.

Properties of the siRNA Profile and Novel Viral miRNAs
To quantify the relative number of small RNAs produced from each virus, RNAseq reads and small RNAs from the EIK and ST pools were re-mapped using Bowtie2 [101]. To quantify the length distribution and 5 0 base-composition, the small RNAs from 14 different sequencing runs were mapped to known miRNAs (mirBase [106]), known transposable elements (Flybase [107]), all viral genomes, and other parts of the D. melanogaster reference genome (Flybase). The 14 siRNA datasets comprised: each of E, I, K, S, and T ligated using the NEBnext protocol according to manufacturer's instructions; each ligated using the NEBnext protocol replacing oligos with "High Definition" equivalents that incorporate four additional random bases [67]; and the EIK and ST pools sequenced by Edinburgh Genomics with and without periodate oxidation (BGI) to reduce miRNA ligation efficiency [77]. To identify potential miRNA sequences in the large dsDNA genome of Kallithea Virus, all small-RNAs were combined and mapped to the Kallithea Virus genome using Bowtie2, and novel miRNAs were inferred from read numbers and predicted hairpin-folding using miRDeep2 with default parameters [76].

Identification of Virus-Like Sequences in Public RNAseq and Genomic Datasets
Several viruses are known to be endemic in laboratory stocks and cell cultures of D. melanogaster [14,21], and we expected that many of the novel viruses identified here may also be detectable in those datasets. We therefore searched the European Nucleotide Archive for Illumina, SOLiD, and Roche454 sequencing "run" datasets that derive from the Drosophilidae and subordinate taxa. This resulted in 14,123 DNA sequencing "run" datasets and 9,656 RNAseq and siRNA sequencing "run" datasets (as of 9 May 2015), from each of which we were able to download and map the first 2 million forward-orientation reads to new and previously known RNA viruses using Bowtie2 [101] (Bowtie for SOLiD colour-space reads). To allow for mismapping, an arbitrary threshold of 100 reads per million (rpm) was chosen as a detection limit for viruses to be recorded as present. This threshold suggests at least one virus was present in 33% of runs, but the proportion of positive samples decreased with increasing log(threshold), suggesting that true infection rates may be higher. Across the RNAseq and siRNA datasets the co-occurrence of BLAST-candidate viruses with siRNA-candidate viruses allowed us to tentatively assign some of the siRNA-candidate viruses as components of BLAST-candidate virus genomes (e.g., Bloomfield Virus and Craigie's Hill Virus).
To detect additional novel RNA viruses present in these public datasets, but absent from the literature or from our metagenomic sequencing of wild flies, we further selected 2,188 Illumina RNAseq datasets for viral discovery. The first 20 million read pairs (or as many as were available, if fewer) from each dataset were mapped to the D. melanogaster genome (Bowtie2, local alignment), and unmapped reads digitally normalised (Khmer) and de novo assembled using Trinity [50] as above. Resulting contigs of >3.5 kbp in length were compared to known viral proteins using BLASTx, and the top hits manually curated to create a second group of BLASTcandidate viruses (such sequences cannot be accepted by Genbank, and are instead provided in S6 Data).
We reasoned that the absence of virus-like sequences from D. melanogaster genomes can provide evidence that these sequences do not represent "fossilised" endogenous viral elements (EVEs). However, if only a minority of individuals carry a particular EVE, and if genomes are derived primarily by mapping to a reference, the EVEs may not appear in assembled datasets. We therefore mapped all reads (forward and reverse reads mapped separately) from 779 Drosophila NEXUS genomic read datasets (representing 527 D. melanogaster genomes [61]) to known Drosophila viruses, BLAST-candidate viruses, and siRNA-candidate viruses. Because a small number of reads may mismap to viral sequences, potentially leading to false inference of an EVE, read numbers were quantified relative to target sequence length, and normalised to a 7 kbp single-copy region of the D. melanogaster genome (as a segregating EVE present in a genome should generate reads at a rate approaching that of other loci in that genome).

Population Prevalence
We used RT-PCR to survey wild populations for the presence of four previously published Drosophila viruses (DCV, DAV, DMelSV, and Nora Virus), ten novel BLAST-candidate viruses (Bloomfield Virus, Craigie's Hill Virus, Dansoman Virus, Kallithea Virus, La Jolla Virus, Motts Mill Virus, Newfield Virus, Thika Virus, Torrey Pines Virus, and Twyford Virus), and two novel siRNA-candidate viruses (Chaq Virus and Galbut Virus). In addition, we surveyed the same RNA extractions by RT-PCR for the presence (but not titre) of Wolbachia pipientis, using primers for the Wolbachia surface protein (wsp_F1 GTCCAATARSTGATG ARGAAAC and wsp_R1 CYGCACCAAYAGYRCTRTAAA).
This survey comprised a total of 1,635 D. melanogaster individuals and 658 D. simulans individuals sampled across 17 locations (see "Sample Collections" above), with between 2 and 236 individuals assayed for each species at each location (median n = 63). RNA extractions and random hexamer reverse transcription was performed on single flies (males and females) or small bulks of 2-20 flies (usually 10 males, with species initially assigned based on morphology). Fly species was confirmed using a competitive 3-primer PCR for the single-copy nuclear gene Ago2, which gives a 300 bp product in D. simulans and a 400 bp product in D. melanogaster (Primers: Mel_SimF CCCTAAACCGCAAGGATGGAG, Dsim303R GTCCACTTGTGC GCCACATT, Dmel394R CCTTGCCTTGGGATATTATTAGGTT).
For each of these 16 viruses we designed up to four PCR assays based on conserved positions in the metagenomic RNAseq datasets, and these were tested on the E, I, K, S, and T metagenomic discovery RNA pools and on a selected subset of wild-collected individual flies. Those primer sets that consistently gave repeatable bands in agreement with other assays for the same virus, and which could be sequenced to confirm band identity, were selected for use as assays for the wider geographic sampling (some designed as multiplex assays; details of primers and conditions are given in S2 Text). To allow for the possibility of PCR failure caused by primersite variants not detected in virus discovery or preliminary sequencing, we designated flies as nominally positive for a virus if at least one assay for that virus resulted in an RT-PCR product. Nevertheless, the potential for unknown sequence variants to prevent PCR amplification means that our estimates of viral prevalence could be considered lower bounds.
As only one fly needs to carry a virus for the bulk to test positive for that virus, the proportion of "positive" single-fly and/or small-bulk assays does not directly estimate the frequency of virus carriers. We therefore used a maximum likelihood approach to infer the underlying prevalence (and its log-likelihood confidence intervals). Briefly, we assumed that for each species/ location combination there was a single underlying prevalence (i.e., fraction of flies carrying detectable virus) and that the number of flies testing positive or negative by PCR assay would be binomially distributed given this prevalence. Then, given the number of flies included in each small bulk, and the number of small bulks or single flies testing positive, we searched across the range of possible prevalence values (zero to one) to identify the prevalence which maximised the likelihood of the observed set of positive/negative PCR assays. It is possible that differences in the sex ratio between individual-fly assays (majority female) and small-bulk assays (majority male) could make this approach misleading if prevalence differs between males and females. However, based on likelihood ratio tests (identical prevalence versus small bulks differ from single flies) we were unable to identify any widespread significant differences between the sampling regimes. Specifically 4.4% of tests were nominally "significant" at α = 5% (as expected), and only two were "significant" using the Benjamini-Hochberg approach to limit the false discovery rate to 5% (Galbut Virus in Accra-1, Wolbachia in Plettenberg).

Viral Phylodynamics and Patterns of Sequence Evolution
Eleven of the viruses with high prevalence (DAV, Nora Virus, DMelSV, Craigie's Hill Virus, Dansoman Virus, La Jolla Virus, Motts Mill Virus, Thika Virus, Torrey Pines Virus, Chaq Virus, and Galbut Virus) were selected for further sequence analyses. PCR products amplified during the prevalence survey were supplemented with additional amplicons and sequenced in both directions using BigDye V3.1 (Life Technologies). Reads were assembled into partial or near-complete genomes using SeqManPro (DNAstar) and all variable sites within or between chromatograms were examined by eye. Virus sequences that displayed substantial within-sample (within-fly or small bulk) heterozygosity for any amplicon, and therefore deemed likely to represent mixed infections, were discarded. Sequences were submitted to Genbank under accessions KP969454-KP970120.
Sequence alignments were divided into blocks, minimising missing data within blocks, and these alignment blocks were tested for evidence of recombination using GARD [108] from the HyPhy package under a GTR model of nucleotide substitution with inferred base frequencies and 4-category gamma-distributed rate variation between sites. Those blocks that displayed any evidence of recombination were further subdivided at the inferred recombination break points, to result in multiple alignment blocks for each virus.
Alignment blocks (containing no detectable recombination) were subjected to phylogenetic analysis using BEAST [89] under a strict-clock model with HKY substitution using inferred base frequencies, and 4-category gamma-distributed rate variation between sites. Separate substitution models with relative clock rates were used for 1st, 2nd, and 3rd codon and for noncoding positions, with substitution models linked across alignment blocks. Time calibration information was incorporated using a uniform step prior on tip sample-dates based on the value and precision of sample freezing date, and molecular clock rate was unlinked between alignment blocks. As the limited sampling timespan meant that little rate information was available for most viruses, we chose to implement a strong prior on nucleotide clock-rate, based on published virus estimates [91]. For RNA viruses this was a log-normal distribution with mean 4 × 10 −4 site -1 year -1 and 95% of the prior density between 1 × 10 −4 and 11 × 10 −4 . When greater time-sampling depth was available (DAV and Nora Virus) we ran models using a hierarchical prior for global nucleotide clock rate, in which clock rates for individual segments are drawn from a shared log-normal distribution. To infer rates of host-switching between D. melanogaster and D simulans, and rates of movement among broadly-defined sampling locations (Europe and North Africa, sub-Saharan Africa, North America, Australia, Laboratory Stocks), tips were labelled by location and host species and treated as discrete traits [88,109]. All analyses assumed a standard neutral coalescent in a constant-size population, and other parameters took default values and priors. Two analyses were run for each model for up to 10 8 steps sampling every 50,000 steps, and convergence was inferred for all parameter values both visually from stationary distributions, and from effective sample sizes of >1,000. In one case (Thika virus), time and date parameters mixed poorly, and the analyses were re-run using a hard upper limit to the nucleotide clock rate for one alignment block (2.5 × 10 −3 site -1 yr -1 ). Posterior parameter estimates and maximum clade credibility trees were inferred from the two independent parallel runs combined after discarding 10% of steps of each as burn-in.
Patterns of synonymous (dS) and nonsynonymous (dN) substitution, and evidence for positive selection (dN-dS > 0) were inferred using the FUBAR [96] analysis from the HyPhy package. The in-frame viral coding sequence alignment blocks (as used above in phylogenetic reconstruction) were analysed based on maximum clade-credibility tree topologies inferred using BEAST, with substitution parameters and branch lengths inferred by HyPhy. FUBAR provides a fast approximate approach to infer dN-dS and sites under selection by pre-computing a grid of conditional likelihoods for dS and dN (where the expectation of dS is 1) and reusing the precomputed values to infer the posterior distribution of dN-dS without recalculating likelihoods. The analysis used a 20 × 20 grid with ten independent MCMC chains each providing 1,000 subsamples from the posterior (each 5 × 10 8 steps after 5 × 10 8 burn-in steps). Sites were considered to display evidence of positive selection if more than 95% of the posterior sample for that site supported dN>dS. S2 Data. Metagenomic contigs. Raw metagenomic contigs derived from Trinity [50], Trinity with normalised data, and Oases [51] with normalised data, are provided in fasta format. The majority of contigs are likely to derive from Drosophila and associated microbiota. The contigs are uncurated and are likely to include chimeric assemblies. As such, they are not suitable for submission to Genbank and should be treated with caution.   (17-29nt) reads that map to characterised D. melanogaster transposable elements, Drosophila miRNAs, the remainder of the D. melanogaster genome, Wolbachia and other bacteria, the four "classical" viruses (DAV, DCV, DMelSV, and Nora Virus), the new named viruses, other BLAST-candidate viruses, and siRNA-candidate viruses. Counts exclude unmapped reads, and reads mapping to Drosophila ribosomal sequences. EIKST refers to the metagenomic pools or mixtures thereof, "BGI" and "EG" indicate sequencing was performed by the Beijing Genomics Institute or Edinburgh Genomics (respectively), "HD" indicates the use of "High Definition" ligation adapters for small RNA sequencing, "RM" and "RZ" indicate the use of rRNA depletion by RiboMinus or Ribo-Zero, and "DSN" indicates double-stranded nuclease normalisation. Raw counts data are provided in S1 Table. (TIF) and T, quantified by qRT-PCR relative to the Drosophila ribosomal protein gene RpL32. Note that DCV and Twyford Virus were each detectable in a single pool. Error bars represent 80% credibility intervals for means, based on two or three replicates per sample and assuming the efficiencies inferred from dilution series (S2 Text). Virus presence/absence agrees well with small RNA data (S1 Fig) but qRT-PCR quantification may be unreliable given the high sequence diversity in these virus pools. Raw CT data are given in S1 Data. For mixed pools (blue background) the reads in the lower rows were sequenced by Edinburgh Genomics following an oxidation step (to reduce miRNA representation), and the upper rows by BGI without an oxidation step. For unmixed pools (white background) the reads in the lower rows were sequenced using a "High Definition" ligation protocol (both without oxidation). Note that the relative number of longer (23-27 nt) reads seen in DCV, Nora Virus, Kilifi Virus, and Thika Virus appears to be reduced in the presence of an oxidation step, as is the number of Kallithea miRNA reads. Raw count data for this figure are provided in S1 Data.  (17-29 nt) for those unnamed siRNA-candidate loci that have >80 small RNA reads. Inset is the total number of small RNA reads summed across all libraries. Bars plotted above the x-axis represent reads mapping to the positive strand, and bars below the x-axis represent reads mapping to the negative strand. Bars are coloured according to the proportion of reads with each 5 0 -base (A-green, C-blue, G-yellow, U-red). All peak at 21 nt and show the expected 5 0 -base composition (a slight bias against G), except for siRNA Candidate 14 (KP757950) which shows the 22-23 nt 5 0 U-rich peak seen in Twyford Virus (KP714075). Count data are provided in S1 Data. (TIF) S7 Fig. Relative small RNA production. To quantify differences in the rate at which small RNAs are generated from different viruses, we calculated the relative viRNA production for each of 16 different viruses in the ST and EIK metagenomic sequencing pools. The bars show the ratio between the relative number of 21-23 nt small RNAs mapping to each virus (normalised by number of reads of the abundant Drosophila miRNA miR-34-5p), and the number of virus RNA-seq reads (relative to non-viral RNAseq reads, excluding rRNA reads). Ratios were then normalised to the lowest rate (Kilifi Virus in sample EIK) to give relative rates, such that 10 4 -fold more viRNAs are derived from DMelSV than from Kilifi Virus. Where viruses were present in both the EIK and ST pools, the correlation between the two datasets is high (rank correlation coefficient >0.99), suggesting that this variation is repeatable. Normalised data are provided in S1 Data. Only datasets that have at least one virus present at 100 viral reads per million total reads were included, and only viruses present in at least one dataset at 100 viral reads per million total reads were included. Note that different parts of the segmented viruses generally co-occur within datasets, which allowed us to provisionally associate siRNA-candidate sequences with BLAST-candidate sequences (e.g., Nodaviruses and Bloomfield Virus). The viruses from DAV to Drosophila melanogaster Birnavirus were reported previously, the others are newly described here. Counts are provided in S5 Data. (TIFF) S9 Fig. Virus distribution across widely used Drosophila cell cultures. A single exemplar RNAseq dataset was selected for each of 26 widely-used Drosophila cell culture lines, and all forward reads were mapped to viruses. Twenty-four of the datasets were drawn from ModEncode cell culture sequencing [82], and two that were not available (OSS and Kc167) were taken from datasets SRR070269 and SRR500470. The colour scale illustrates the fraction of reads that were viral in origin (virus reads per million total reads) and viruses are sorted in ascending order of viral read fraction (from ML-DmD32 with <1% viral reads, to ML-DmBG1-c1 with approximately 50% viral reads). Note that the virus population is likely to differ between different subcultures, and these values should only be considered illustrative. Counts are provided in S5 Data. (TIF) Missing bars indicate that the host species was absent from the collection location (for collection details and prevalence, see S4 Table). Panels Inset are rank correlation coefficients, and a simple linear regression (virus~Wolbachia) for illustration. The analysis does not account for any spatial autocorrelation in prevalence, and does not correct for multiple testing. Points are coloured by location (see S10 and S11 Figs for colours) and plotted with 2 log-likelihood intervals. Nominal "significance" at p < 0.05 is shown using asterisks. Values are provided in S4 Table. (TIF) S13 Fig. Evolutionary properties of Drosophila viruses. Each panel shows parameter estimates of viral evolution for the eleven RNA viruses for which sufficient sequence data were available. Panels are (A) mutation rate, (B) date for the most recent common ancestor, (C) the range of movement between geographic regions (defined as continents and laboratory), (D) the rate of switching between D. melanogaster and D. simulans, and (E) the relative rates of synonymous (dS) and nonsynonymous (dN) substitution (dN-dS > 0 implies positive selection). For panels A-D points are the median of the posterior sample and 95% credibility intervals, panel E shows the median and range across codons. For all viruses except DAV and Nora Virus, a strong prior was placed on mutation rate (Panel A, grey box), and mutation rates and MRCA dates were inferred separately for each alignment block. The underlying BEAST XML files (including alignments and model specifications) are provided, along with the resulting mcc tree files and summaries of posterior distributions, in S7 Data and S8 Data. The underlying FUBAR batch files and per-site parameter estimates are provided in S9 Data. (TIF) S14 Fig. dN/dS in DAV and Nora Virus. Point estimates of dN/dS are shown for each codon in open reading frames of DAV and Nora Virus (ratios of the posterior estimates, not the posterior estimates of the ratios). Bar colours illustrate posterior support that the site is constrained (blue for strong support that dN < dS) or positively selected (red for strong support that dN > dS). Positions coloured grey have little support for either positive selection or constraint. The dashed horizontal line indicates neutrality (dN = dS), so that bars higher than this are candidate sites for positive selection. The solid horizontal line shows the mean of the dN/dS estimates for that open reading frame. dN/dS estimates greater than 3 have been truncated to 3 for clarity. FUBAR batch files and parameter estimates are provided in S9 Data. (TIF) S1  Table. Potential binding sites for Kallithea Virus miRNA. Potential miRNA binding sites in D. melanogaster 3 0 UTRs predicted by miRanda, and Gene Ontology enrichment analyses for genes with a miRanda score >150. (XLSX) S4 Table. Survey collection details, locations, and virus prevalence. Collection data (sample sizes, city, country, latitude and longitude, date) and virus prevalence for each species at each location (maximum likelihood estimate with 2 log-likelihood confidence intervals, all rounded to two significant figures). (XLSX) S1 Text. Evidence for a micro-RNA in Kallithea Virus. Output from the software package mirDeep [76] showing the proposed pre-miRNA hairpin, read numbers, and predicted folding pattern and energy. Reads are summed across all small-RNA libraries. (PDF) S2 Text. PCR and qPCR primers and conditions. (DOCX)