Integrating Computational Biology and Forward Genetics in Drosophila

Genetic screens are powerful methods for the discovery of gene–phenotype associations. However, a systems biology approach to genetics must leverage the massive amount of “omics” data to enhance the power and speed of functional gene discovery in vivo. Thus far, few computational methods for gene function prediction have been rigorously tested for their performance on a genome-wide scale in vivo. In this work, we demonstrate that integrating genome-wide computational gene prioritization with large-scale genetic screening is a powerful tool for functional gene discovery. To discover genes involved in neural development in Drosophila, we extend our strategy for the prioritization of human candidate disease genes to functional prioritization in Drosophila. We then integrate this prioritization strategy with a large-scale genetic screen for interactors of the proneural transcription factor Atonal using genomic deficiencies and mutant and RNAi collections. Using the prioritized genes validated in our genetic screen, we describe a novel genetic interaction network for Atonal. Lastly, we prioritize the whole Drosophila genome and identify candidate gene associations for ten receptor-signaling pathways. This novel database of prioritized pathway candidates, as well as a web application for functional prioritization in Drosophila, called Endeavour-HighFly, and the Atonal network, are publicly available resources. A systems genetics approach that combines the power of computational predictions with in vivo genetic screens strongly enhances the process of gene function and gene–gene association discovery.


FlyBase
The first tool is FlyBase [1] itself, from which HIGHFLY uses a number of data sources, namely Gene Ontology (GO) and phenotypes. FlyBase offers a QueryBuilder tool that allows retrieving all genes using an expert-chosen query. We used QueryBuilder to retrieve all genes that are annotated with "relevant" GO terms for our process under study. Relevant terms were chosen based on the current GO annotation of Atonal itself. A second type of query we performed with QueryBuilder was to retrieve all genes that are known to be expressed in "relevant tissues" for our process. Again, relevant tissues were decided based on the tissues where Atonal is known to be expressed (given in FlyBase's "Gene Expression Report"). These types of queries result in a list ("bag") of genes, but this list is not ranked according so similarity. This means that all candidates have to be tested in the genetic assay. This makes this procedure less suited for candidate gene selection for knowledge-guided genetic screens when the query yields too few or too many candidates.
Here is the query we used for GO: is another useful web application that makes FlyBase data and other functional genomics data available. However, we did not include FlyMine in this analysis because the FlyMine project has unfortunately announced it will no longer be updated after December 2008. FlyMine allows building similar queries like we performed with FlyBase QueryBuilder, and allows for several more genomic data sources to be used in the query. However, HIGHFLY's main advantages, like the use of training sets and the generations of combined rankings, are not available in FlyMine.

UCSC Gene Sorter
The second tool we used was UCSC Gene Sorter [3]. This very efficient tool ranks all genes in the genome (for which data is available in the chosen data source) according to one chosen data source and one query gene. The ranked list can also be filtered. In our case we used all genes in our positive deficiency regions as filter. Many of the data sources in the Gene Sorter are the same as we use in ENDEAVOUR-HIGHFLY (e.g., GO, gene expression from microarray data, protein-protein interactions, protein sequence similarity, protein domain similarity). We have chosen three data sources as illustration, namely GO, expression, and protein-protein interactions. An important difference with FlyBase QueryBuilder, when using GO, is that Gene Sorter calculates a GO similarity, and not only retrieves genes that are annotated with the same GO term. Therefore, this tool is more suited for candidate gene selection for genetic screens. However, as already mentioned, this tool does not allow to combine the different data sources into a single fused ranking, nor does it allow to use a set of training genes as query.

STRING
The last tool we used was STRING [4]. This tool shares an important feature with our method, namely the integration of data from various heterogeneous sources, both experimental data (e.g., gene expression, protein-protein interactions), and derived data (e.g., text-mining). STRING can be used to detect known and predicted associations with a query gene or a list of query genes. The results are presented as a network, which can be saved as text file, together with their confidence scores. This way, one can retrieve a ranked list (based on the confidence score) of predicted associations. In the first analysis we used "Atonal" as query gene and retrieved all 228 predicted associations. Unfortunately, STRING does not allow a filter on the genome, so we compared these 228 offline with our candidate set of 1056 genes from our positive deficiency regions. Also, to circumvent STRING's automatic mapping of gene identifiers (we used CG gene identifiers as input), we downloaded the fasta file, which also contains the CG number. We found an overlap of 13 genes, of which 2 were positive in our genetic assay.
In a second analysis we used STRING's multiple gene input function. Note that the input of multiple genes may resemble our use of a training set, but an important difference is that STRING returns individual interactions with and among the input genes, while HIGHFLY integrates the training genes to build a summarized data models across that training set. To compare these two approaches, we used the same training set as multiple gene input in STRING. Unfortunately, the maximum number of allowed interactions in STRING is 500. Using this threshold, we retrieved 500 associations with the genes in our training set, of which 35 fall into our positive deficiency regions, and of which 5 are positive Ato-interactors.