Using Evolutionary Conserved Modules in Gene Networks as a Strategy to Leverage High Throughput Gene Expression Queries

Background Large-scale gene expression studies have not yielded the expected insight into genetic networks that control complex processes. These anticipated discoveries have been limited not by technology, but by a lack of effective strategies to investigate the data in a manageable and meaningful way. Previous work suggests that using a pre-determined seed-network of gene relationships to query large-scale expression datasets is an effective way to generate candidate genes for further study and network expansion or enrichment. Based on the evolutionary conservation of gene relationships, we test the hypothesis that a seed network derived from studies of retinal cell determination in the fly, Drosophila melanogaster, will be an effective way to identify novel candidate genes for their role in mouse retinal development. Methodology/Principal Findings Our results demonstrate that a number of gene relationships regulating retinal cell differentiation in the fly are identifiable as pairwise correlations between genes from developing mouse retina. In addition, we demonstrate that our extracted seed-network of correlated mouse genes is an effective tool for querying datasets and provides a context to generate hypotheses. Our query identified 46 genes correlated with our extracted seed-network members. Approximately 54% of these candidates had been previously linked to the developing brain and 33% had been previously linked to the developing retina. Five of six candidate genes investigated further were validated by experiments examining spatial and temporal protein expression in the developing retina. Conclusions/Significance We present an effective strategy for pursuing a systems biology approach that utilizes an evolutionary comparative framework between two model organisms, fly and mouse. Future implementation of this strategy will be useful to determine the extent of network conservation, not just gene conservation, between species and will facilitate the use of prior biological knowledge to develop rational systems-based hypotheses.


Introduction
The emergence of system-wide approaches ('-omics'; e.g., genomics, proteomics, metabolomics, etc.) and related technologies to quantify molecular changes that accompany biological processes or disease states has resulted in an explosion in the amount of data collected by researchers. Investigators across all areas of biology have designed large scale experiments to capture a broader systems-based understanding of gene or protein expression changes that accompany their process of interest. However, many have found that such datasets are too large to be immediately informative, and extracting useful information from these datasets is dependent upon additional analysis.
One strategy to analyze such data is to generate gene network models using one of several analytical frameworks [1][2][3][4][5]. In theory, these network approaches have two advantages: they should accelerate the rate of novel discoveries by automating data analysis and they should be more immune to experimenter bias. This use of computational strategies will potentially lead to discoveries from omics data without a priori knowledge of the system. However, these computational approaches require a tremendous amount of biological data. For example, if an investigator wants to understand which genes function together during a particular developmental process, she might profile changes in gene expression over developmental time. Ideally the number of conditions (e.g., ages, experimental perturbations) under which gene expression is measured should be much larger than the number of genes being profiled in order to obtain an accurate estimate of the covariance matrix upon which the network of all genes is based [6]. Thus, for a microarray experiment that measures the expression of 5000 genes, one should measure the expression of each gene under more than 5000 different conditions. Even collection of 20% of the ideal amount of data for robust analyses is both time and cost prohibitive for most investigators. As a consequence, the majority of biologists collect datasets that are too small for effective computational analysis and too large for systematic and efficient consideration of candidate molecules. This data limbo is a limiting factor to the growth of the field of systems biology.
While it is essential that the development of computational tools and approaches continue, it is also essential that efforts are made to establish 'biological heuristics' that will allow benchtop investigators to perform meaningful analyses on the sometimes limited amounts of data they are capable of collecting. A key first step in this process is to consider the development of strategies to efficiently query omics data, as opposed to exhaustively analyzing it. The use of biological heuristics is a flexible strategy, which utilizes prior biological knowledge of the system to design queries. These queries ask specific questions about relatively small groups of interacting genes and return manageable numbers of candidate genes for further analysis at the bench.
Our approach to querying high-throughput data utilizes prior biological knowledge by starting with a 'seed-network' of genes, and is based on the paradigm that the expression of genes that function together will change in similar ways over time (i.e., their expression will be correlated). The basic assumption is that if a gene is correlated with one member of the seed network, it may be involved in the process of interest; however, if the same gene is correlated with multiple members of the seed-network it much more likely to be involved in that process (e.g., retinal cell fate determination). One of us has demonstrated previous success identifying gene candidates in development of rod photoreceptors by using a seed-network-based heuristic to query high throughput data [7], and this success motivated our efforts to further develop strategies to identify effective seed networks to query large datasets.
Here we employ our seed-network approach to a genetic comparison of two important models in the study of retinal development: the fly, Drosophila melanogaster, and the mouse, Mus musculus. Despite the morphological and developmental disparity of the fly compound eye [8,9] and the mouse camera-type eye [10,11], gene conservation during both fly and mouse retinal development is well-documented [12][13][14][15][16] and there is an implicit assumption of gene regulatory network conservation as well [17,18]. However the networks are not completely congruent [19]. We test the hypothesis that gene relationships established in the developing fly retina can be identified in correlation networks generated using gene expression data from the developing mouse retina. Further, we hypothesize that the resulting mouse network will be an effective tool to discover candidate genes and gene networks that function during mammalian retinal development. In this report, we take advantage of two biological systems by constructing a 'comparative seed-network' based on studies of retinal determination in fly and use it to query gene expression data from the developing mouse retina. Our study was guided by three objectives: 1) to construct a literature-based seed network representing the relationships between genes involved in retinal determination in the fly; 2) to determine whether the network relationships of fly genes are identifiable among homologous mouse genes in expression correlation networks generated from the developing mouse retina; and 3) to assess whether this strategy, based on evolutionary comparison between model organisms, is a useful method to identify biologically relevant candidate genes important in retinal determination. Based on these objectives, our results demonstrate successful application of this strategy within our experimental system and provide a clear framework to evaluate this approach in other biological areas.

Seed network construction in fly
Seed networks are graphs that represent relationships among genes during a biological process, such as retinal determination. These relationships may be physical interactions or causal relations by direct or indirect means, and are represented as edges in the graph or connections (links) in the gene network. We used the results of published experimental studies on eye differentiation in fly to identify a set of 18 genes implicated in fly retinal development, which was built off of the fly retinal determination gene network (RDGN) [12]. We integrated these data into a comprehensive fly 'seed network' (Figure 1) based on the work described in File S1.   The mouse expression datasets are: I [20]; II [21]; III [22], IV [23]. Numbers in parentheses are the positive or negative correlation coefficient of seed genes in each mouse datasets. ''-'' indicates that the seed gene is present in the dataset, but is not correlated with other seed genes. ''NA'' indicates that the seed gene is not present in the dataset. doi:10.1371/journal.pone.0012525.t002  [20,28] Retina/Brain development Transferase activity; Expressed in developing lens and retina and may play a role in apoptosis suppression [24,25] Retina Probably distantly related to glutathione S transferase S1 (GstS1); no mu homolog in Drosophila

Extraction of homologous seed network from mouse datasets
To determine whether the gene relationships represented in the fly seed network are represented in the developing mouse retina, we first converted the literature-based fly seed network into a mouse gene network of putative homologs (Table 1). Then we used BioNet Workbench [http://bionetworkbench.sourceforge.net/] to query four previously published gene expression datasets (I-IV) from mouse [20][21][22][23]. The datasets were queried for pairwise correlations of .|0.65| between all mouse genes that are homologous to fly seed network members (''seed genes''). A summary of the seed genes and their pairwise correlation values (if .|0.65|) in each of the mouse datasets are given in Table 2. The result was a mouse seed network ''extracted'' from published gene expression datasets for mouse.
Based on the finding that a subset of relationships from the fly seed network appear to be conserved in the developing mouse retina, we hypothesized that the extracted seed-network (ESN) of mouse gene relationships would be useful for querying the mouse gene expression data to identify additional candidate gene network members. To identify candidates, genes correlated .|0.65| with each gene in the extracted seed-network were retrieved from each dataset and the lists were compiled. Lists of genes correlated with each ESN gene were analyzed to identify genes that correlated with more than one gene in the ESN. Based on the paradigm that genes correlated with multiple ESN genes are likely to have a functional relationship to the gene network, we focused our analysis on 46 candidate genes that were correlated with three or more ESN members (Table 3). Among these 46, 39 genes were correlated minimally with Eya1, Notch1 and Six3. We evaluated the relevance of candidate genes identified by this comparative seed-network approach in three ways.
First, we performed a manual literature search to find reports of candidate genes' association with the retina, retinal development, brain development or other developmental processes. Results from this manual search are given in Table 3. Forty out of 46 (86%) candidate genes have been previously reported to be associated with one or more of these topics [20,. Additionally, eight candidate genes (17%) are associated with retinal ganglion cells (RGCs) or RGC development in previous experimental studies [26,30,36,48,53,55,58,62,64,67,73,78,81] (Table 3).
Second, we examined the spatial and temporal expression of six candidate genes in the developing mouse retina. We chose to examine candidates that had been previously reported to be associated with the developing brain, but not the developing retina, and had commercially available antibodies. Using immunohistochemistry in retinal tissue sections from mice ages embryonic day (E)13, E15, E17 and postnatal day (P)0, P5 and P10, we characterized the expression of APLP2, DPYS14, NDN,   Table 3. Cont.
PAFAH1B3, PSME1 and TMSB10. Candidates were considered highly relevant if they were: 1) expressed in the developing retina, 2) exhibited specific (as opposed to diffuse) localization in the developing retina, and/or 3) the localization of the immunoreactivity changed as the retina matured. Based on these criteria, five of the six candidate genes tested that had not been previously associated with retinal development (aplp2, ndn, pafah1b3, psme-1 and tsmb10) were considered good candidates for further investigation (Figures 2, 3, 4, 5 and 6).
Third, we used the biological process GO annotations Nervous System Development (0007399) and Neural Retina Development (0003407), to statistically evaluate our candidate list. In the list of 46 candidates identified by using our seed-network approach, seven of the genes had a Nervous System Development GO annotation. By using a Fisher's exact test we determined that Nervous System Development is over-represented among the group of candidate genes. The p-value for this test was 0.026, which represents the probability of seeing seven or more Nervous System Development genes in a list of 46 genes randomly selected from the 8544 genes represented on the Murine Genome U74Av2 array. Because it would be unlikely to see so many Nervous System Development genes in our candidate list of 46 genes by chance, our results suggest that Nervous System Development genes were overrepresented in our candidate list.
In summary, our analysis identified a network of 46 highly correlated candidate genes. Expression of 22 (,47%) of these candidate genes has been previously reported in the retina or developing retina (see references in Table 3), although their specific relationship to genes within the retinal determination gene network has not been reported. We examined six candidate genes that had previously been associated with brain development, and determined that five of these genes have dynamic spatial and temporal expression in the developing mouse retina. Finally, of these 46 mammalian genes, 42 (,91%) have homologs in fly, making them potential candidates for studies of fly retinal development as well. These findings demonstrate the powerful advantages of integrating evolutionary comparisons and systems approaches, even when approaching well-studied biological questions.

Discussion
The compound eye of Drosophila is an outstanding model system to study the molecular basis of eye specification, in part, because retinal development is an organized, step-wise process with clearly demarcated regions of cell differentiation and patterning [8,87]. These properties of the fly model have facilitated the elucidation of genetic networks involved in retinal cell differentiation and the Comparative studies between model organisms [12,18] led to discoveries that homologous genes play important and similar roles in fly and mammalian retinal development and many of these key genes have similar connectivity in gene networks [19]. This principle of gene network conservation has motivated our development of the seed-network strategy, which we have presented here, and provides a way to validate our novel heuristic approach.
We tested our strategy using gene expression datasets from the developing mouse retina. The results from this study support our hypothesis that gene relationships in the developing fly retina are identifiable in correlation networks generated using gene expression data from the developing mouse retina. While not all gene relationships in the fly network were identified in the mouse ESN, this is not unexpected. Our results provide support for the assumption that there will be a degree of conservation within genetic networks of homologous genes, even between highly divergent species such as fly and mouse. Complete congruence between the RDGN of fly and mouse would be surprising given that these organisms possess highly divergent eye morphologies. Our results also support the second hypothesis that the mouse network derived from relationships between homologous genes from the fly RDGN (i.e. our extracted seed network [ESN]), would be an effective way to discover high quality candidate genes involved in retinal development in mouse. Our queries identified a reasonable number (46) of candidates, when compared to the hundreds or thousands of genes that correlate with a single gene of interest. The majority of our candidate genes were correlated (positively or negatively) with the same three seed genes (Notch1, Eya1 and Six3) suggesting that these three seed genes are at the functional core of this network regulating retinal development in mouse.
At the heart of our approach is the development of biological heuristics to focus queries of relatively sparse (albeit typical) expression datasets from the developing mouse retina. It is important to note that this approach is intended to facilitate the formulation of hypotheses by providing a mechanism to integrate prior biological knowledge, but not intended to make conclusions about the function or assign significance to the candidates. The use of relationships among genes as a biological heuristic to query high-throughput data, as opposed to queries based on single genes, appears to be more profitable and efficient for the identification of additional candidates. Thus, the candidate genes we identified in this study are not end points, but are the basis of hypotheses to guide future experimental work. Traditional wet-lab experiments will be required to test these hypotheses of the specific role of each candidate gene and its placement in the gene regulatory network during mouse retinal development.
From a comparative evolutionary perspective, our results underscore the importance of looking for conservation of networks, and not just conservation at the level of individual genes. While gene orthologs may function in a similar way in a complex process or a disease state in different organisms, it is the conservation of not only the gene, but of its relationships to other genes in a network, that dramatically increases the likelihood that the gene, in fact, functions similarly. Although it has been directly demonstrated in only a few cases [19,[88][89][90][91], regulatory network conservation has long been the rationale for the use of model organisms to study human diseases. Comparative studies that investigate the extent of conservation in developmental regulatory networks (and of characteristics, such as modularity, connectivity, etc.) are beginning to identify common themes in networks that direct organogenesis, e.g., [92]. While it is unreasonable to expect that genetic regulatory networks controlling the development of organs in highly divergent organisms will be conserved in their entirety, application of the approach proposed here to identify conserved network modules should allow systems biologists to better capitalize on what is known in one species to advance discovery in another.

Materials and Methods
Our biological heuristic strategy is described below and summarized in Figure 7.

Construction of the fly seed network
We identified 18 genes in Drosophila that are involved in the retinal determination gene network (RDGN), based on published literature. Previous researchers have identified individual relationships of these genes to one another and these experimentally-determined relationships among the 18 genes were the basis of our fly seed network (see descriptions and citations in File S1).

Homolog identification of seed network genes
Homolog identification can be difficult when comparing genomes across great evolutionary time as a result of sequence evolution and paralogous duplication events within a lineage. Because of these issues, we identified putative mouse orthologs of the Drosophila seed network manually, using a combination of approaches, including examination of the genomic databases FlyBase . Additional assignment of orthology between fly and mouse genes was based on experimental data. For example, the mouse has three Teashirt (tsh)-like genes, Tshz1, Tshz2 and Tshz3, all of which can rescue tsh null mutants and induce ectopic eyes in the fly [98]. Likewise, we designated the Pax6 isoform, Pax6(5a), found in humans and mouse, as the ortholog for fly eyg because the genes are structurally related [99], and we treated the mouse gene Math5 (Atonal7) as the homolog to the fly gene Atonal based on others' work [100,101] reviewed in [19]. Finally, qualitative and functional comparisons of the mouse genes Six3 and Six6 to so and optix in fly, suggest that optix should be treated as an ortholog of Six3 and Six6 [102], reviewed in [19]. Table 1 lists fly seed network genes and their mouse homolog assignment based on these data. Description of mouse data sets and construction of ESN Published, freely available datasets measuring gene or protein expression in the developing mouse retina at multiple time points were collected and preprocessed as described in Hecker et al. [7]. Each dataset represents expression data collected from developing mouse retinae at multiple time points and includes: a SAGE (serial analysis of gene expression) of whole retina from Blackshaw et al. [20] was downloaded from online supplementary material; one cDNA microarray of whole retina from Zhang et al. [21] was downloaded from online supplementary material; and two Affymetrix microarrays of whole retina, the Mu74Av2 chip from Liu et al. [22] was downloaded from GEO (GDS 1845) and the Mu74Av2_1 chip from Dorrell et al. [23] was downloaded from http://www.scripps.edu/cb/friedlander/gene_expression/). These mouse datasets were designated as I, II, III, and IV, respectively, and were saved in BioNet Workbench [http://bionetworkbench. sourceforge.net/] for analysis.
We calculated Spearman Rank pairwise correlations in each mouse expression dataset using BioNet Workbench to construct the extracted seed network (ESN) for mouse. Correlation networks provide a visual representation of pairwise associations between genes in large data sets consisting of expression measurements for hundreds or thousands of genes. In a gene or protein expression correlation network, the nodes represent the genes or proteins and weighted links model interactions between them. The weight associated with a link between a pair of nodes models the correlation estimated from measurements of expression (e.g., mRNA or protein) levels of the corresponding genes across a set of experimental conditions or time points. The Spearman rank correlation measure, which assumes only an arbitrary monotonic, not necessarily linear, relationship between variables being correlated, has been demonstrated to be effective for detecting functional relationships between genes [103]. Correlation coefficients using time-course expression data are calculated by a measure of how the expression levels between any given pair of genes changes over time. Genes that are perfectly correlated with one another have a correlation coefficient of 1. Gene pairs whose expression is exactly the opposite of one another have a correlation coefficient of 21. Two genes whose expression is not correlated (no different than random) have a correlation coefficient of 0. In cases where multiple mouse paralogs for a single fly gene are present, each paralog was queried separately. Not all mouse seed genes were present in all datasets.
A link between a pair of seed network genes is supported by a dataset if the corresponding genes are positively or negatively correlated in that dataset, with the absolute value of correlation greater than or equal to 0.65 in one of the mouse datasets (I-IV). Our choice of the threshold of 0.65 for correlation was influenced by similar choices in previous studies [104][105][106] that have revealed biologically relevant links between co-expressed genes. It should be noted that we do not assign statistical significance to the value of the correlation coefficient, but rather consider the value a flexible tool that can be used at the discretion of the investigator to filter gene lists to a manageable number of candidates. This is appropriate, as our strategy is designed to facilitate the generation of hypotheses, not conclusions. Generating candidate gene lists To identify candidate genes that may be involved in the gene network controlling mouse development, we used genes from the extracted seed-network (ESN; mouse homologs of fly RDGN genes whose pairwise expression correlation coefficients were .|0.65| in at least one dataset) to query large-scale gene expression datasets of the developing retina (I-IV). Each of the 17 genes from the ESN (a.k.a. seed genes) in mouse was examined separately in datasets I-IV to develop candidate gene lists. Lists were compiled by identifying genes that were correlated with individual seed genes, with a correlation coefficient greater than |0.65|. Then, gene lists for all seed genes were compared to identify candidate genes that correlated with more than one seed gene. Genes that correlated with more than three seed genes were investigated for potential biological relevance in mouse retinal development.
Biological  [95,96]. When necessary, paralogous genes were included to more fully capture gene homology between the two organismal models.
In order to determine if the GO annotation Nervous System Development (0007399) was over-represented among the 46 candidate genes, a Fisher's exact test was used. Over-representation was declared if the number of genes with the GO annotation of interest on our candidate list was significantly higher than would be expected by chance, i.e., if the observed number of Nervous System Development genes was greater than would be expected when randomly selecting 46 genes from a collection of 550 Nervous System Development genes mixed with 7994 other genes. Information from version 2.4.1 of the R statistical software annotation package mgu74av2.db [http://www.bioconductor.org/] and release 30 of the NetAffx annotation file for the Murine Genome U74Av2 array [http://www.affymetrix.com/] was combined in order to perform the Fisher's exact test. Due to relatively frequent changes in probe set annotations and gene symbols, and also due to slight disagreements between the two annotation sources, a conservative analysis was performed. Although there are likely more genes represented on the Murine Genome U74Av2, the total number of genes represented was declared to be 8544. Similarly, the number of genes with Nervous System Development annotation was declared to be 550, although this number is likely high. Using an under-estimate of the total number of genes and an over-estimate of the genes with Nervous System Development annotation results in a higher p-value and thus more conservative results than if the true values were used.
It should also be noted that no probe set was mapped to the gene Prrt1 in either of the current annotation sources (although it Figure 6. Dynamic protein expression of TSMB10 in developing mouse retina. TSMB10-IR in the E13 and E15 mouse retina was distributed throughout the retina (A, B). In the E17 mouse retina, TSMB10-IR was more intense in the inner one-third of the retina (C). By P0 TSMB10-IR in the mouse retina was largely restricted to the IPL, GCL and OFL (D). Similarly in the P5 and P10 retinas, TSMB10-IR was observed in the IPL, GCL and OFL (E, F). Abbreviations same as in Figure 2. Bars, 30 mm. doi:10.1371/journal.pone.0012525.g006 was in previous annotations), but this gene was included in the analysis because of its inclusion in the analysis of previous papers. Removing this gene from the analysis does not affect the significance of our results.

Examining candidate gene expression with immunohistochemistry
To investigate the biological relevance of the candidate genes correlated with the ESN, we examined the spatial and temporal expression of six candidate genes in the developing mouse retina. Tissue was prepared from C57BL/6 mice in a colony maintained at Iowa State University. The gestational period of C57BL/6 mice is approximately 19 days and date of birth is designated as postnatal day 0 (P0). The developmental time series investigated included pups from embryonic days 13, 15 and 17 and postnatal days 0, 5, and 10. Mice were euthanized and their heads were removed and immersion fixed in 4% paraformaldehyde in 0.1M PO4 buffer (pH 7.5). The tissue was cryoprotected in a 30% sucrose solution in 0.1 M PO4 buffer (pH 7.4) and embedded in OCT mounting media. Tissue was sectioned at a thickness of 20 mm on a cryostat and the sections were thaw-mounted onto microscope slides and stored at 220uC. All animal procedures had the approval of the ISU committee on animal care.
Frozen tissue sections were rinsed in 0.5M KPBS and incubated in blocking solution consisting of KPBS containing 1% bovine serum albumin (Fisher, Pittsburgh, PA), 0.4% Triton-X 100 (Sigma), and 1.5% normal donkey serum (Invitrogen) for 2 hours. Cells were incubated in primary antibody overnight at 4uC. The following day slides (tissue or cells) were washed in KPBS containing 0.02% Triton-X 100 after which fluorescent secondary antibody was applied for 2 hours. After washes in KPBS containing 0.02% Triton-X 100, the slides were incubated in 300 mM DAPI diluted in KPBS. The slides were rinsed in KPBS before cover-slipping with Vectashield fluorescence mounting medium (Vector Laboratories, Burlingame, CA). The antibody sources and concentrations used for the immunohistochemical analysis are summarized in Table S1.