Insect Innate Immunity Database (IIID): An Annotation Tool for Identifying Immune Genes in Insect Genomes

The innate immune system is an ancient component of host defense. Since innate immunity pathways are well conserved throughout many eukaryotes, immune genes in model animals can be used to putatively identify homologous genes in newly sequenced genomes of non-model organisms. With the initiation of the “i5k” project, which aims to sequence 5,000 insect genomes by 2016, many novel insect genomes will soon become publicly available, yet few annotation resources are currently available for insects. Thus, we developed an online tool called the Insect Innate Immunity Database (IIID) to provide an open access resource for insect immunity and comparative biology research (http://www.vanderbilt.edu/IIID). The database provides users with simple exploratory tools to search the immune repertoires of five insect models (including Nasonia), spanning three orders, for specific immunity genes or genes within a particular immunity pathway. As a proof of principle, we used an initial database with only four insect models to annotate potential immune genes in the parasitoid wasp genus Nasonia. Results specify 306 putative immune genes in the genomes of N. vitripennis and its two sister species N. giraulti and N. longicornis. Of these genes, 146 were not found in previous annotations of Nasonia immunity genes. Combining these newly identified immune genes with those in previous annotations, Nasonia possess 489 putative immunity genes, the largest immune repertoire found in insects to date. While these computational predictions need to be complemented with functional studies, the IIID database can help initiate and augment annotations of the immune system in the plethora of insect genomes that will soon become available.


Introduction
The innate immune system evolved early in the evolution of multicellular life, while the adaptive immune system evolved in the ancestor of the vertebrate lineage [1]. Thus, in insects and other invertebrates, the innate immune system not only combats foreign invaders, but it is also employed in wound healing, stress responses, and the management of microbial symbiont populations [2]. The versatility of the insect innate immune response is in part championed by the ability of insects to colonize diverse ecological niches across the planet while defending against pathogens that inhabit those niches [3]. Indeed, immunity genes in general evolve at a faster rate than the genome as a whole [4], which is in part explained by the persistent selective pressures posed by a flux of new pathogens.
With the advent and growth of next-generation sequencing technology, rapid genome sequencing of non-model organisms is now feasible. The ''i5k'' initiative, launched in 2011, aims to sequence 5,000 insect genomes by 2016 [5], generating vast amounts of data for comparative studies among insects. Annotation of immunity genes in these novel insect genomes will not only provide valuable insight into the diverse mechanisms insects employ for defense, but may also contribute to the development of new insecticides for the control of agricultural pests. To facilitate the annotation of immunity genes in insects, including our own model system of Nasonia parasitoid wasps, we have generated an open-access database called the Insect Innate Immunity Database (IIID, http://www.vanderbilt.edu/IIID) to serve as a starting point for researchers interested in using comparative biology to identify potential immune genes in insects. The database contains the immune repertoires of five insect models (including Nasonia) that span several orders, and each gene is categorized based on the pathway it participates in and the role it plays in that pathway. The intuitive web interface allows researchers to search for specific immunity genes by name, retrieve all immunity genes in the database for a particular species, pathway or class, and find putative homologs for a gene of interest using an internal BLAST tool.
The jewel wasp Nasonia is a genus of haplodiploid, parasitoid wasps composed of four closely related species (Order: Hymenoptera): N. vitripennis, N. giraulti, N. longicornis, and N. oneida. Nasonia is a model system to study the genetics of interspecific differences including host-microbe interactions [6][7][8], development [9][10][11], and behavior [12][13][14][15]. Recently, the genomes of the first three species mentioned above were sequenced [16]. An initial characterization of immune genes in N. vitripennis was conducted as part of the Nasonia genome project [16] using two sets of Hidden Markov Models (HMMs). The first set of HMMs was generated based on alignments of select immune-related protein families from Aedes aegypti, Anopheles gambiae and Drosophila melanogaster [17], and the second set was compiled using A. aegypti immune genes as seeds to find orthologous genes from five vertebrate and five insect species [16]. Scanning the N. vitripennis gene set with these HMMs produced a total of 270 putative immunity genes (http://cegg. unige.ch/nasonia_genome). This number is likely an underestimate given that not all immune genes from the three Dipteran species above were used to generate the first set of HMMs. The second set of HMMs expanded the number of species incorporated in the models but only for those immune genes present in A. aegypti. Furthermore, only the N. vitripennis genome was examined; no study has attempted to identify immune genes in the sequenced sister species, N. giraulti and N. longicornis. Using the genes within the IIID to perform homology searches against the Nasonia genomes, we independently describe 306 putative immune genes in each of the Nasonia species, of which 146 genes were not found in previous annotations of N. vitripennis [16].

Initial Construction of the IIID
To facilitate the annotation of innate immunity genes in insects, we initially created an Insect Immunity Database (IIID) composed of the published immune repertoires of four insect models spanning several different orders: Drosophila melanogaster, Diptera [18,19], Anopheles gambiae, Diptera [16,20], Apis mellifera, Hymenoptera [17,21], and Acrythosiphon pisum, Hemiptera [22]. Our criteria for inclusion were that the species have a complete, publicly-available genome sequence, that the innate immune genes have been previously identified in computational or molecular studies, and that each species has an extensive review of its global immune pathways available as a resource. Sequence information was obtained through NCBI for the 105 immunity genes described for Acrythosiphon pisum [22], 317 genes for Anopheles gambiae [20,23], 379 genes for Drosophila melanogaster [18,19], and 174 genes for Apis mellifera [17,21]. In total, 975 genes were included in the dataset used to analyze the Nasonia genomes. Each gene was categorized into its primary, secondary and tertiary pathways of putative function (i.e. Toll pathway, IMD pathway, humoral response, JAK/STAT, and cell cycle regulation) and into finite classes of function based upon its putative role in an immune response. Such classes include recognition (identifying potential pathogens and stressors), signaling (communicating between recognition and response), and response (molecules that interact with the pathogen or stressor).

Comparative Analysis of N. Vitripennis Immunity Genes
To validate the utility of this database, we used a sequence similarity BLASTx approach to mine for putative homologs of the 975 protein sequences in the IIID within the N. vitripennis transcriptome (OGS v1.2). A total of 18,941 unique transcripts were obtained from NasoniaBase (http://hymenopteragenome. org/nasonia/). For the BLASTx analyses, we used the BLO-SUM62 matrix with a word size of 3 and a gap cost of 11, 21. The results were filtered to only contain hits with an E-value ,1e-10, a bit score $30,. A total of 1206 N. vitripennis transcripts were similar to entries in the IIID (Table S1). To eliminate redundancies in the dataset, a reciprocal BLASTx analysis for each of the 1206 Nasonia transcripts was conducted against each of the four insect immunity gene datasets. This analysis resulted in 306 unique immune gene identifiers in Nasonia vitripennis (Table  S2).

Analysis of N. Giraulti and N. Longicornis Immunity Genes
Since the immune genes in the sister species N. giraulti or N. longicornis had not yet been evaluated, we conducted independent BLASTn analyses of the 489 N. vitripennis immunity genes (IIID predictions and previously annotated immune genes) against the N. longicornis (NCBI assembly name Nlon_1.0) and N. giraulti (NCBI assembly name Ngir_1.0) scaffolds [16]. The parameters for the BLASTn search are as follows: E-value ,1e-10, word size 11, low complexity filter, and a gap cost 5, 22. For each species, best hits for the 489 genes were manually assessed as to the E-value and bit score, as previously described above, and nucleotide sequences were compiled for each gene in N. giraulti (Table S3) and N. longicornis (Table S4).

Results
The initial IIID was compiled using the immune repertoires of D. melanogaster, A. gambiae, A. pisum, and A. mellifera for a combined total of 975 genes. Using this dataset to perform homology searches against the N. vitripennis transcriptome, we identified 306 putative immune genes. 138 of these genes were previously reported as immune genes in the Nasonia genome (Nvit_1.2) paper, which identified a total of 270 putative immune genes using HMMs for protein domains common in immunity gene families [16]. We also manually searched the N. vitripennis official gene set (v1.2) and the Nasonia literature [24][25][26] for genes with annotations similar to those of conserved immunity genes in other insect species. In total, we found 66 genes from our manual search that were not reported in Werren et al., [16]. Importantly, 146 of the 306 genes identified using the IIID were not previously described in any of the Nasonia literature. Furthermore, using the IIID, we were able to assign names to 28 genes that were not previously annotated in the N. vitripennis gene set (Nvit_1.2). Conversely, a total of 183 immune genes identified previously in the Nasonia literature are absent from the IIID analyses of the N. vitripennis genome (see discussion).
Combining the immune genes identified using the IIID with the additional genes described in the literature, N. vitripennis possesses a total of 489 putative immunity genes (Table S2). This is the largest predicted immune repertoire found in insects to date. None of the genes found in N. vitripennis were missing in either N. giraulti (Table S3) or N. longicornis (Table S4).

Discussion
Using the IIID, we increased the putative Nasonia immune repertoire by 58% in comparison to the number of immune genes originally published in the Nasonia genomes [16], while only finding 46% of the immune genes originally published. The missing genes are of interest. It is important to note that the Nasonia immune gene set in the genome sequence [16] was identified using Hidden Markov Models (HMMs) that search for genes with protein domains common in immunity genes. One problem with this approach is that all members of a gene family with an immunity-related protein domain may not have a biological role in innate immunity if this domain can also function in other processes. Thus, using only HMMs to find immune genes will increase the likelihood of false positives for any given protein family in which only a subset of its members are involved in immune pathways. For example, sixty-four of the innate immunity genes in the original Nasonia genome annotation are not found in our annotation using the IIID; these genes are classified as serine proteases. Several serine proteases play important roles in insect innate immune pathways, specifically the Toll pathway and the prophenoloxidase signaling cascade leading to melanization [17,[27][28][29][30][31][32]. However, the serine protease family is highly diverse, and most of its members function in other aspects of insect physiology [33][34][35][36]. A HMM that identifies conserved serine protease domains may simply find any serine protease, regardless of its biological function or relevance to insect immunity. Using the IIID for sequence similarity searches partially avoids this source of error because the search is performed using an entire gene, not just a protein domain, which has been identified as part of the innate immune system in another insect species. For example, the IIID predictions identified only 38 serine proteases while the HMMs found 97 serine proteases. Nevertheless, further experimental approaches are needed to determine whether the genes that we have identified actually function in the Nasonia immune system.
The other obvious limitation of using a sequence similarity based approach to find immune genes in a specific gene set is that the analysis misses any species-specific genes. For example, thirtynine genes from our manual search of the literature (that were not detected by the BLASTx analysis) are antimicrobial peptides (AMPs) unique to the Nasonia genus, which were predicted computationally based on structural properties common to AMPs [24,25]. Sequence similarity searches are also constrained by the reference species used to generate the database. Genes in the Nasonia immune repertoire present in an insect species not in the IIID would also be missed, although they are not unique to Nasonia.
In total, 489 unique genes have been described as potential immune genes in N. vitripennis (Table S2) when all previously published studies [16,[24][25][26], manual annotations, and sequence similarity searches using the IIID are combined. To our knowledge, this list is the most complete set of insect immunity genes currently available and the first to include those from N. giraulti and N. longicornis. While future studies are needed to confirm the functionality of these genes in the Nasonia immune response, the list will provide a stepping-stone for comparative analyses within the Nasonia genus and between Nasonia and other insect species. More importantly, the IIID will provide one more tool in the efforts to annotate complete immune gene repertoires in other insect genomes. Based on our investigation, we recommend the use of multiple annotation tools that will provide the most comprehensive set of predictions in silico, which can then be analyzed for their biological role in vivo.