A partial genome assembly of the miniature parasitoid wasp, Megaphragma amalphitanum

Body size reduction, also known as miniaturization, is an important evolutionary process that affects a number of physiological and phenotypic traits and helps animals conquer new ecological niches. However, this process is poorly understood at the molecular level. Here, we report genomic and transcriptomic features of arguably the smallest known insect–the parasitoid wasp, Megaphragma amalphitanum (Hymenoptera: Trichogrammatidae). In contrast to expectations, we find that the genome and transcriptome sizes of this parasitoid wasp are comparable to other members of the Chalcidoidea superfamily. Moreover, compared to other chalcid wasps the gene content of M. amalphitanum is remarkably conserved. Intriguingly, we observed significant changes in M. amalphitanum transposable element dynamics over time, in which an initial burst was followed by suppression of activity, possibly due to a recent reinforcement of the genome defense machinery. Overall, while the M. amalphitanum genomic data reveal certain features that may be linked to the unusual biological properties of this organism, miniaturization is not associated with a large decrease in genome complexity.


Introduction
Miniaturization in animals is an evolutionary process that is frequently accompanied by structural simplification and size reduction of organs, tissues and cells [1,2]. The parasitoid wasp Megaphragma amalphitanum (Hymenoptera: Trichogrammatidae, subfamily Oligositinae) is one of the smallest known insects, whose size (250 μm adult length) is comparable with unicellular eukaryotes and even some bacteria (Fig 1). Parasitoids from the genus Megaphragma parasitize greenhouse thrips Heliothrips haemorrhoidalis (Thysanoptera: Thripidae) developing on the shrubs Viburnum tinus (Adoxaceae) and Myrtus communis (Myrtaceae) [3], and possibly a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Thripidae) collected in Santa Margherita, Northern Italy (44.32, 9.20). Unfortunately, we could collect only a dozen M. amalphitanum individuals because their habitats are difficult to detect (culture in the laboratory is currently impossible), the imago life span is short (5 days), and the animal is extremely small. With several insects we could cleanly recover, we were therefore able to obtain only around 1-5 ng of genomic DNA for the each paired-end DNA and cDNA libraries. DNA was extracted from ten individuals (males and females) using NucleoSpin Tissue XS kit (Macherey-Nagel, Germany) for each DNA-library. Three DNA libraries (DNA-library1 -whole insects; DNA-library2 -thorax and abdomen; DNA-library3head) were constructed using Ovation Ultralow Systems V2 kit (NuGEN, USA). Limited amount of biological material and low quantity of starting material (1-3 ng) did not permit construction of mate-paired libraries. Genome libraries were sequenced using Illumina HiSeq 1500 (Illumina, USA) with 150 bp paired-end reads. RNA was extracted from ten M. amalphitanum individuals (males and females) using the Trizol reagent (Thermo Fisher Scientific, USA) by a standard protocol, and cDNA libraries were constructed using Ovation RNA-Seq System V2 kit (NuGEN, USA) with poly(A) enrichment.
Genome de novo assembly. The output from Illumina sequencing of the genomic DNA library (source format � .fastq) was used for de novo genome assembly. To assemble the genome of M. amalphitanum, we used 102,188,833 paired-end reads. Genome assemblies were constructed using different assembly algorithms, and their performance was compared to each other (S2 Fig). Then, M. amalphitanum reads were mapped to the final partial assembly with 92.3% conformity. Additionally, genomic DNA-libraries from thorax and abdomen (DNA-library2) of M. amalphitanum (SRR5982987) and from head (DNA-library3) of M. amalphitanum (SRR5982986) were prepared. In total, 79,317,970 (paired-end sequencing: 2×100 bp) and 85,409,775 (single-end sequencing: 50 bp) DNA reads were sequenced and used for M. amalphitanum coverage increase and as additional evidence during the search for potentially missing genes (S1 Table). Then, these reads were used for de novo building of the M. amalphitanum genome sequence by the SPAdes assembler (v.3.6.1) [20].
Transcriptome de novo assembly. Illumina RNA sequencing generated a total of 59,790,973 paired-end reads. Transcriptome de novo assembly was conducted using the default k-mer size in the Trinity software package (v. 2.4.0) [21], which combines three assembly algorithms: Inchworm, Chrysalis and Butterfly. Annotation of the M. amalphitanum transcriptome assembly was performed using the Trinotate pipeline [22].
Transposable element (TE) de novo identification and analysis. For de novo TE library construction, we used the REPET package [23] which combines three mutually complementing repeat identification tools (RECON, GROUPER and PILER), yielding a combined repeat library with the average consensus sequence length of 1.66 kb (ranging from 157-14,640 bp). The outputs were subject to additional classification with the RepeatClassifier tool from the RepeatMasker package (www.repeatmasker.org), which was also used to build the corresponding TE landscape divergence plots.

Genome and transcriptome sequencing and assembly of M. amalphitanum
To gain insight into the genomic signatures of miniaturization that would distinguish M. amalphitanum from other Hymenoptera, we performed whole-genome shotgun sequencing of DNA (DNA-library1) isolated from ten adult individuals (males and females), using the Illumina platform (S1 Table). The resulting partial genome assembly (PRJNA344956) has a cumulative length of 346 megabases (Mb), with a scaffold N50 of 10,296 bp. The total genome coverage is 88.6-fold. Thus, the genome of M. amalphitanum is comparable in size with other Chalcidoidea wasps, such as Copidosoma floridanum, T. pretiosum or Nasonia vitripennis [24,25]. The best-performing combination of assembly software yielded contig N50 of 4,285 bp and allowed us to assemble 94,687 scaffolds from the low amounts of starting DNA material (Table 1; S2 Table; S1 Fig).
The M. amalphitanum genome assemblies were evaluated with the BUSCO v3 (benchmarking universal single-copy orthologs) Hymenoptera gene set [26], which uses 4,415 near-universal single-copy orthologs to assess the relative completeness of genome assemblies. Through this analysis, 7.55% of the conserved genes were initially identified in the M. amalphitanum assembly as putatively missing (S3 Table). More detailed information on our extensive search for the missing genes in M. amalphitanum genome is presented below.
We also performed whole-body transcriptome analysis using RNA extracted from ten M. amalphitanum individuals (males and females). Transcriptome de novo assembly (PRJNA34 4956) was performed using the Trinity software [21]. A total of 46,841 contigs were assembled with a mean length of 586 bp and an N50 of 633 bp from the low amounts of starting RNA material (S4 Table). The Illumina paired-end RNA-Seq data from M. amalphitanum were mapped to the previously assembled genome using Bowtie2 [27]. Inspection of the alignments revealed that 79.95% of reads could be mapped to the genome. The BUSCO v3 statistics for the transcriptome assembly is also presented in Table 1; S3 Table. The BUSCO analysis shows the low completeness of the present partial genome and transcriptome assemblies, with 28-29% of BUSCO genes listed as fragmented. This may be caused by inability to use mate-paired DNA-libraries or single-molecule sequencing (because of low amount of starting DNA material); possible high heterozygosity and/or significant structural variation between different parasitoid wasp individuals that were used for genome and transcriptome assemblies; BUSCO database incompleteness; and other factors. An additional factor in poor transcriptome completeness could be a high number of short and chimeric isotigs: while nearly 80% of transcriptome reads map to the genome, only 24% of assembled contigs are represented in the complete BUSCO set.

Gene ontology analysis
We used Gene Ontology (GO) analysis terms to describe characteristics of M. amalphitanum gene products in three independent categories: biological processes (S2 Fig Table 2). All M. amalphitanum transcripts were matched to the Clusters of Orthologous Groups (COG) database to predict and classify their functions. In total, 8,810 genes were assigned to 25 COG functional categories. One of the largest groups is represented by the cluster for posttranslational modification, protein turnover, and chaperones (988 counts; 10.7%), followed by intracellular trafficking, secretion, and vesicular transport (659 counts; 7.2%), DNA replication, recombination and repair (606 counts; 6.6%), signal transduction mechanisms (599 counts, 6.5%) and transcription (587; 6.4%) (S5 Fig).
To better understand incorporation of genes into diverse pathways, all annotated transcripts were mapped against the KEGG database for pathway-based analysis. As a result, 6,130 transcripts out of a total of 46,841 were assigned to a KEGG pathway, and were present in 328 different KEGG pathways. The KEGG pathway distribution is summarized in S6 Fig Table. The Trinotate statistics for annotation of M. amalphitanum, C. solmsi, D. alloeum, F. arisanus, C. vestalis and T. pretiosum transcriptome assemblies is presented in S6 Table. Potentially missing genes and missing or rapidly evolving gene clusters in the M. amalphitanum genome Given the incomplete nature of the M. amalphitanum genome assembly (BUSCO coverage of 80%, Table 1), we could perform only a preliminary assessment of potentially missing genes  [28]. The core gene set of all the hymenopteran species was composed of 6,278 gene clusters, 122 gene clusters were unique to the chalcid clade. 262 gene clusters were not detected in any of the chalcids analyzed (Supplementary Dataset 2; NCBI BioProject: PRJNA344956), but found in all the other hymenopterans, consistent with a similar recent analysis [29]. Our findings suggest that that the loss of these genes apparently occurred in the last common ancestor of chalcids, or point to the possibility of parallel genome evolution across these species. Interestingly, the missing/rapidly evolving genes include homologs of genes that have important roles in embryonic patterning and development in other insects (e.g., krueppel-1, knirps or short gastrulation [29]).
To determine whether miniaturization in M. amalphitanum is associated with significant gene loss that could be detected even in a partial genome assembly, we used genomic data of six larger hymenopteran species (T. pretiosum, C. vestalis, C. floridanum, F. arisanus, N. vitripennis, and N. giraulti), as well as the well-annotated genome of the honeybee (A. mellifera) as reference (body sizes are presented in S5 Table). We mapped the M. amalphitanum (DNA-library1), T. pretiosum, C. vestalis, C. floridanum, F. arisanus, N. vitripennis, N. giraulti DNA reads on the A. mellifera genome sequence (PRJNA13343, PRJNA10625) (S10 Fig), and detected 115 genes that were not represented by M. amalphitanum sequencing reads but were present in other parasitoid wasps. We then increased the coverage of the M. amalphitanum genome to 146.8-fold by adding the reads from additional libraries (DNA-library2 and DNA-library3) (S1 Table) and observed the apparent absence of 114 of the 115 genes. An additional TBLASTX search identified 36 of these genes as present, yielding a total of 78 putatively missing genes (S7 Table). However, querying the M. amalphitanum genome with the corresponding amino acid sequences from the closest wasp ortholog (N. vitripennis or T. pretiosum) in TBLASTN searches reduced the number of putatively missing genes to just five: centrosomin, phosphoglycerate mutase 5, phosphoglycerate mutase 5-2, 26S proteasome complex subunit DSS1, and mucin-1/nucleoporin NSP1-like. We detected short M. amalphitanum genome sequences encoding protein fragments (~8-23 amino acid residues) with some similarity to four of them, suggesting that they may be in the process of degeneration in this species. Despite a thorough search, we were unable to find any homologous sequence related to centrosomin (cnn) gene either in the partially assembled genome or in our cDNA libraries. Although cnn is regarded as rapidly evolving [30], sequence homology can be readily discerned and orthologs are present in every other insect, including the parasitoid T. pretiosum, suggesting that this gene is specifically absent in M. amalphitanum. In Drosophila melanogaster, Cnn has important roles at the centrosome in mitotic spindle formation, cytoskeleton organization and neuronal morphogenesis [31,32], although these functions may not be indispensable because this species (and possibly other insects) possesses centrosome-independent mechanisms for spindle nucleation [33]. A fungal homolog of Cnn is involved in nuclear migration [34][35][36]. Since the presented genome assembly has only partial BUSCO coverage, the absence of cnn remains tentative. Globally, however, the analysis of the available genome assembly argues for relatively little gene loss in M. amalphitanum. Confident identification of true gene losses in this species will require additional DNA sequencing and improved genome assembly.

Chemosensory genes in the M. amalphitanum genome
Chemosensory receptors are encoded by some of the largest gene families in insect genomes, reflecting their important and wide-ranging roles in detection of environmental odors and tastants. We asked how these gene families have evolved in M. amalphitanum, whose central and peripheral nervous systems are highly reduced [2,14]. The highly divergent sequences of chemosensory receptors and relatively short genomic contig lengths available for M. amalphitanum precluded accurate annotation of full-length sequences in this species for the majority of loci. Nevertheless, comparison with chemosensory receptor repertoires of other insects allowed us to define probable orthologous relationships with receptors of known function in other species and obtain initial estimates of the size of each family.
The most deeply conserved family of chemosensory receptors in insects are the Ionotropic Receptors (IRs), which are distantly related to ionotropic glutamate receptors [37,38]. IRs function in heteromeric protein complexes comprising more broadly-expressed co-receptors with selectively expressed "tuning" IRs that determines sensory specificity. We identified orthologs of each of the co-receptors (Ir8a, Ir25a (two paralogs), Ir93a and Ir76b), as well as four genes encoding tuning IRs related to acid-sensing receptors in other species. We also identified orthologs of IR68a, which functions in hygrosensation [39] and IR21a, which functions in cool temperature-sensing [40,41]. Overall, the repertoire of IRs in M. amalphitanum is therefore very similar in size and content to that of N. vitripennis [38].
Insects possess a second superfamily of chemosensory ion channels-distinguished by a heptahelical protein structure-comprising Odorant Receptor (OR) and Gustatory Receptor (GR) subfamilies, which generally function in detection of volatile and non-volatile stimuli, respectively [42][43][44][45]. Similar to IRs, ORs function in heteromeric complexes of a conserved co-receptor (ORCO) and a tuning OR. We identified an M. amalphitanum ortholog of Orco and 83 additional Or-related sequences. We caution that many of these Or sequences are small fragments (often located near the end of the assembled contigs), so it is currently difficult to determine whether these are intact genes or pseudogenes. Within the GR repertoire, we identified genes encoding proteins related to GR43a, a sensor of both external and internal fructose [46], two others similar to other insect sugar-sensing GRs [47], and 25 additional Gr gene fragments. The sizes of these repertoires are smaller than in N. vitripennis (300 Ors (including 76 pseudogenes) and 58 Grs (including 11 pseudogenes) [48]), but similar to non-miniaturized parasitoid wasps Meteorus pulchricornis and Macrocentrus cingulum [49,50]. However, precise comparison with the latter two species is difficult, as receptors in these wasps were identified from antennal transcriptomes, thereby representing only one of these insects' chemosensory organs.
In sum, these analyses reveal that despite drastic nervous system reduction, M. amalphitanum has retained the conserved chemosensory receptors of larger wasps (and other insects), and appears to have numerous additional order-or species-specific receptors to allow detection of environmental chemical cues.

Venom components in the M. amalphitanum transcriptomic data
Parasitoid wasps often use venom to modify the metabolism of their hosts; toxins and their known or presumed biological functions are described in various species [51]. We investigated the presence of homologs of N. vitripennis toxin constituents in M. amalphitanum and other parasitoid wasps (Megastigmus spermotrophus, N. vitripennis, C. solmsi, T. pretiosum), using previously published venom data [52,53] and the transcriptomes of chalcid wasps (S5 Table). We identified 28 transcripts encoding putative venom proteins (Fig 2; S8 Table); homologs of these are found in all investigated Chalcidoidea species (Table 3). Assuming that most of these candidates are truly conserved venom proteins among Chalcidoids, M. amalphitanum venom diversity does not seem to have been significantly affected by size reduction.

M. amalphitanum transposable elements and genome defense
Transposable elements (TEs) constitute a measurable fraction of virtually all eukaryotic genomes, and can play important roles in their function and evolution. In insects, TE activity has been implicated in evolution of eusociality, based on comparison of ten bee genomes with increasing degrees of social complexity [56]. We performed de novo TE identification and comparative analysis of TE dynamics in M. amalphitanum and in a representative set of larger wasp genomes for which TE content has previously been reported: the parasitoid N. vitripennis and two primitively eusocial aculeate wasps Polistes canadensis and Polistes dominula [12,25,57]. Additionally, we analyzed TEs in the genomes of parasitoid wasps T. pretiosum from the family Trichogrammatidae and D. alloeum from the family Braconidae.
For uniformity of measurements, we applied the same workflow to all genomes, without relying on pre-existing repeat libraries. We employed the REPET package for de novo TE identification (also used in [56]), and RepeatMasker for repeat classification and construction of TE landscape divergence plots. Comparison of the overall repeat content across six wasp species did not reveal substantial differences between four species (18.5% in M. amalphitanum vs. 18.1%, 17.7% and 14.2% in P. canadensis, P. dominula and T. pretiosum, respectively). The N. vitripennis genome was 32.5% repetitive, in close agreement with the published estimate [25], and D. alloeum was highly repetitive at 52.8% (pie charts in Fig 3; S11 Fig). TE dynamics over time, which is shown on the corresponding TE landscape divergence plots, was found to differ substantially for M. amalphitanum, which displayed a pronounced decline in recent TE activity after an initial increase, a pattern that is rarely observed in other hymenopterans [58,59] (Fig 3).
While TE dynamics may be affected by different factors, the observed drop in active TE content in M. amalphitanum may be relevant to the unique biology of this highly miniaturized insect. Its closest relative, T. pretiosum, is about 2-fold larger in body length. Wolbachia infection, which typically results in T. pretiosum parthenogenesis, can afterwards indirectly affect TE mobility in the host as a consequence of asexual reproduction, resulting in proliferation of specific TE families [58,60,61]. Other wasps do not display notable drops or spikes in current TE activity; TE inactivation was reported in two asexual mites [58], however it appears to be ancient and may have occurred prior to the abandonment of sex. Overall, the continued decline in M. amalphitanum TE activity over the span of several million years-not observed in T. pretiosum which shares the most recent common ancestor with M. amalphitanum-represents a rather unusual genomic feature compared to other hymenopteran we examined, including ants (not shown). We note, however, a recent comprehensive study [59] described two hymenopterans with a similar decline in recent TE activity (see below). No traces of Wolbachia infection or other representatives of the Rickettsiaceae family were found in M. amalphitanum individuals [62], while the sequenced T. pretiosum carries the Wolbachia symbiont [63]; the sequenced Nasonia strain was maintained on antibiotics to cure it of infection.
To gain insights into possible reasons for reduction in TE activity after the initial burst, we investigated the major components of the genome defense machinery in M. amalphitanum, including Dicer (Dcr)-like and Argonaute (Ago)/Piwi-like protein-coding genes. In insects, Ago-1 and Dcr-1 homologs represent the key components of the miRNA pathway; Ago-2 and Dcr-2 mediate antiviral RNA interference; and Piwi and Ago-3/Aub suppress TE activity in the germline [64]. Both M. amalphitanum and T. pretiosum possess equal numbers of Dcr-1 and Dcr-2 homologs, as well as Ago-2 and Ago-3 homologs (S12 Fig). However, in M. amalphitanum, the Ago-1 and the Piwi/Aub homologs underwent a relatively recent duplication in comparison to T. pretiosum (Fig 4). This may indicate additional layers of enforcement in the miRNA and piRNA pathways of M. amalphitanum, both of which should result in suppression of TE activity. Indeed, after inspecting the genomes of two other sequenced hymenopteran species showing recent declines in TE activity (Leptopilina clavipes and Solenopsis invicta; [58,59]), we found that they also display relatively recent duplications of Piwi-like proteins (Fig 4). The drop in TE activity is also evident from the transcriptome analysis. The GO radar plot (S7 Fig) shows a substantial number of short contigs related to DNA integration, most of which upon inspection were found to represent separate fragments of gypsy-like and copialike LTR retrotransposons, and a few belong to Polinton, P and Ginger DNA TEs. Transcriptionally active copies fall into two groups: first, those which apparently proliferated during the burst of TE activity and have since accumulated debilitating mutations making them incapable of transposition, but still retain a certain level of transcriptional activity; second, those that originate from recent infections by retrovirus-like TEs and contain uninterrupted ORFs, but are not actively proliferating and are present at very few genomic loci. Comparison of BLASTN hits for M. amalphitanum integrase-related TE transcripts showed that high-copy hits represent MITEs (S13 Fig). We hypothesize that actively proliferating TE copies represent recent arrivals, possibly brought about by viruses or host-parasite interactions [65].

Concluding remarks
Our study provides a first view of the genomic content of one of the smallest insects currently known, the parasitoid wasp M. amalphitanum. In contrast to the expectation that the small body size, in combination with the parasitic lifestyle, should lead to significant reduction in the amount of genomic DNA and in gene content, we do not observe a drastic reduction in the overall genome size or in the number of expressed genes in comparison with larger parasitic wasps. However, the multiple experimental constraints described above limit the quality of genome and transcriptome assemblies. In the future, improved genomic studies in this species (and other Hymenoptera) will be essential to confidently assess specific genetic adaptations that may be linked with body miniaturization.
Interestingly, transposable element dynamics over time were found to differ substantially between the analyzed wasp species, with M. amalphitanum displaying a relatively recent decline in TE activity preceded by a burst, a pattern not observed in most other parasitoid wasps. The decline in TE activity may have been associated with evolution of additional Ago and Piwi copies, not present in T. pretiosum, which could have reinforced the genome defense machinery to prevent uncontrolled TE expansion. This hypothesis is strengthened by identifying duplications of Piwi-like proteins accompanied by a decline in TE activity over time in two additional species of Hymenoptera; by contrast, most other hymenopterans show no such decline. The relationship between body size and genome size has been discussed for a long time. Significant correlations of these values have been described for flatworms and copepods [16]; by contrast, such correlations were not found in ants [66]. Our results show that body size reduction in hymenopterans is not accompanied by greatly decreased transcriptomic and genomic complexity. This observation begs the question of how miniaturization is encoded genetically. We hypothesize that changes in regulatory sequences, rather than gene content, were important in the process of body size reduction, similar to mechanisms of morphological evolution that have driven adaptive diversification in all animals, great or small [67].
Supporting information S1 Fig. M. amalphitanum genome assembly statistics using ABySS, SPAdes, and Velvet software. K-mer sizes were matched for ABySS, SPAdes and Velvet. Note: CLC Genomics Workbench does not use k-mer size; CLC assembly was performed with default settings, and the statistics are given in S2 Table. (PNG)   C. solmsi, M. spermotrophus, T. pretiosum, N. vitripennis. (DOCX) and analyzed. This work was carried out using high-performance computing resources of federal center for collective usage at NRC "Kurchatov Institute", http://computing.kiae.ru/. Support of this project was provided by the Russian Scientific Foundation (RSF) grant #14-24-00175 and RSF grant #14-50-00060. B.L. was supported by the Rosenthal Brown-MBL internship and the REU supplement to NSF MCB-1121334 to I.A. R.B. was supported by a European Research Council Consolidator Grant (615094).