Candidate Genes Detected in Transcriptome Studies Are Strongly Dependent on Genetic Background

Whole genome transcriptomic studies can point to potential candidate genes for organismal traits. However, the importance of potential candidates is rarely followed up through functional studies and/or by comparing results across independent studies. We have analysed the overlap of candidate genes identified from studies of gene expression in Drosophila melanogaster using similar technical platforms. We found little overlap across studies between putative candidate genes for the same traits in the same sex. Instead there was a high degree of overlap between different traits and sexes within the same genetic backgrounds. Putative candidates found using transcriptomics therefore appear very sensitive to genetic background and this can mask or override effects of treatments. The functional importance of putative candidate genes emerging from transcriptome studies needs to be validated through additional experiments and in future studies we suggest a focus on the genes, networks and pathways affecting traits in a consistent manner across backgrounds.


Introduction
In Drosophila an increasing number of whole genome expression studies relating gene expression to genetic differences in stress resistance traits and longevity have now been carried out [1,[2][3][4][5][6][7]. These studies are focused on identifying candidate genes and genetic networks of importance for lifespan and resistance to stressful conditions including heat, cold and desiccation resistance. However, with recent advances in transcriptomics the number of putative candidate genes is accumulating much faster than what can be verified in much detail. Few candidate genes detected in Drosophila studies have so far been validated by studies on knockout or over-expression lines or by functional genomics studies using sequencing or an association mapping SNP approach (for exceptions see) [7]. Although whole genome expression studies have proved fruitful in some organisms [8][9][10][11], it is still unclear to what degree candidate genes identified in transcriptomic studies will be valuable and relevant for candidate gene identification [12].
As multiple whole genome transcriptomic studies aiming at identifying genes and pathways explaining variation in similar traits become available, it becomes possible to evaluate the repeatability of changes in transcriptomic patterns across studies. Any similarity among studies might well depend on the effect of 1) genetic background and standing genetic variation -there might be more than one way to obtain similar phenotypes, 2) inbreeding/genetic drift effects on genome wide gene expression patterns, and 3) impacts of environmental conditions that may vary between laboratories.
Two strategies are mainly used to detect candidate genes in D. melanogaster. Lines can be selected in the laboratory for increased stress resistance/longevity and compared to control flies that differ in the phenotype of interest. Alternatively, phenotypic variation in traits of interest in highly inbred isogenic lines can be associated to gene expression in these lines. Results from the different studies make it possible to investigate to what degree genetic background or inbreeding influence the lists of candidate genes detected.
In this paper we compare the gene lists from 4 different whole genome transcriptome studies on D. melanogaster investigating overlapping traits [1,[4][5][6]. In order to further evaluate whether inbreeding per se influences patterns, we included two studies on the effect of inbreeding on the transcriptome [13,14]. We found a much larger proportion of significant overlap between traits within genetic background than within similar traits investigated in different genetic backgrounds. There was also a tendency for inbreeding to affect transcription in a directional manner. In the light of our results we conclude that transcriptome studies should be interpreted cautiously and that it is advisable where possible to validate the functional relationship between candidate genes from transcriptome studies and the specific trait in question. This also has implications for the emerging transcriptome studies in nonmodel species [15,16], where functional validation of candidate genes will be difficult. Additionally, studies could be designed to include a focus on networks of genes being differentially expressed across several independent genetic backgrounds.

Materials and Methods
We reanalysed and compared gene expression datasets from six studies on gene expression in D. melanogaster [1,[4][5][6]13,14]. Table 1 summarises the traits and sexes investigated in these studies. In all studies global gene expression was assayed using Affymetrix Drosophila (version 1 or 2) microarrays. Data from Ayroles et al. [4] was reanalysed with sexes separate (data kindly provided by T.F.C. Mackay). The array data was analysed using R (version 2.9.0) (http://www.r-project.org/) based applications. The raw data was GC-RMA normalised with the BIOCONDUCTOR application for R [17] as implemented in the 'Affy' package for R (version 1.22.1). With respect to the data from the study of Ayroles et al. [4], the t-test statistics were generated based on the association between the organismal phenotypes and the expression data from information on 40 inbred lines. We used the gene list generated in [14] while the remaining data sets were analysed contrasting the selected or inbred lines with control lines.
Significance of all datasets was re-evaluated following [4] with a cut off at P,0.01 and no FDR correction to equalise the methodology. The resulting lists of significant genes were used as the basis for analyses. To compare among different versions of Affymetrix gene chips, Entrez IDs were used as the common identifier for all genes. We identified the overlap among gene lists and estimated the probability that the overlap of differentially expressed genes varied from the number expected by chance using Monte Carlo simulations. The empirical P-value for the observed overlap of genes among the different treatments was determined using simulations. In each simulation, the gene list for each treatment was permutated and the random overlap among gene lists was recorded. This procedure was repeated 100,000 times. The empirical P-value was determined as the fraction of all permutations where the observed overlap was larger or equal to the random overlap among the gene lists.

Results
The 253 contrasts investigated showed large differences in gene overlaps ( Table 2). The generated lists of significant genes from each study contained between 165 and 1944 genes (average 528), and the overlaps ranged between 1 and 249 (average 34.8).
Of the 253 individual contrasts, 113 were significantly larger than expected by chance. One noticeable result was the lack of significant overlap among studies looking for candidate genes for the same traits (Table 2). This was true for starvation resistance, chill coma recovery time and female lifespan. Only for male longevity did we detect a significant gene overlap between the studies of Sarup et al. [1] and Ayroles et al. [4]. In general the overlap was not larger among similar traits (chill coma recovery, starvation and longevity/life span) than among traits not expected to be functionally correlated.
A clear pattern was the apparent similarity among sexes in cases where both sexes were investigated for the same trait in the same genetic background (5 significant overlaps out of 7 comparisons); the only exception was longevity where we did not find a significant overlap within the study of Ayroles et al. [4] or between the genes that were found studying males [1] and females [6].

Genetic background
We found a high number of overlaps of candidate gene lists within the same genetic background (65 significant overlaps out of 102 comparisons) compared to the overlaps between genetic backgrounds (37 out of 151). This difference in the frequency of overlaps was larger than expected by chance ( Figure 1A, x 2 = 19.3, P,0.001).

Inbreeding
The proportion of significant overlaps between the study that associates organismal traits with gene expression in inbred lines [4] and the studies of inbreeding effects on the transcriptome [13,14] (16 out of 24) was higher than the proportion of significant overlaps between the studies of inbreeding effects on the transcriptome and the studies on outbred lines [1,5,6] (9 out of 20), although this difference was not significant ( Figure 1B). However, the studies of Kristensen et al. [13], Sørensen et al. [6] and Sarup et al. [1] share a common genetic background, so this comparison was confounded by effects of genetic background and inbreeding. Omitting the study of Kristensen et al. [13], there were 12 comparisons that associate organismal traits with gene expression in inbred lines [4] and Ayroles et al. [14] with 10 significant overlaps, and 10 comparisons between the remaining studies [1,5,6] and Ayroles et al. [14] with 3 significant comparisons. There was a significant difference between the study using inbred lines [4] and those using outbred lines [1,5,6] in the proportion of significant overlaps with the study on the effects of inbreeding depression on the trascriptome [14] ( Figure 1C, x 2 = 6.7, P,0.01).  Numbers above the diagonal denote overlapping genes between lists, below the diagonal are P-values, and on the diagonal is the number of unique genes in the lists. NS, non-significant. The letters in parentheses specify genetic background (see Table 1). Comparisons between same traits are in italics and same genetic background are in bold. M: Males, F: Females, C30: Heat 30uC, CCR: Chill coma recovery, Co: Cold resistance, DS: Desiccation resistance, Fit: Fitness, H: Heat resistance, KD: Heat knock down, Loc: Locomotor activity, Long: Longevity, Mate: Mating activity, Starv: Starvation resistance, I: Inbreeding. Numbers denote paper codes (see Table 1). Letters denote genetic background (see Table 1). doi:10.1371/journal.pone.0015644.t002

Genetic background
If genetic background has a large impact on the list of candidate genes generated from full genome transcriptomic studies, we expect a high degree of overlap between traits in common genetic backgrounds. This is actually what we observe, as contrasts performed on the same genetic background (Table 1) [1,4,6,13] have a high proportion of significant overlaps ( Table 2) independent of whether the same traits or different traits are considered. Genetic background effects are a likely cause of this discrepancy although other factors such as laboratory-specific environmental conditions and inbreeding/genetic drift might also contribute. This points to caution in extrapolating results from one transcriptomic study to another and also highlights the general importance of genetic background in evolutionary studies (see also) [18,19]. Based on our findings we suggest that future studies aiming to identify candidate genes/pathways should consider validating detected genes/pathways across different backgrounds.
The population-specific nature of candidate genes detected via transcription studies might reflect the fact that a candidate gene can only be detected in association or selection studies if there is variation in relevant loci either in the base population or arising from mutations during the selection/line establishment process. Moreover due to genetic drift, allelic variation present within the base population might differ between replicate lines in selection experiments or between inbred lines often used in Drosophila association studies. Thus 'false candidate genes' may be detected due to genetic drift. To rule out this explanation/hypothesis, effective population sizes should be high in base populations/ replicate lines.

Inbreeding
A high level of inbreeding results in increased homozygosity and expression of deleterious recessive alleles not expressed to the same extent in large natural populations. Inbreeding depression is known to affect multiple traits including lifespan and stress resistance traits in Drosophila [20][21][22] and inbreeding per se can also result in changes in gene expression of hundreds of genes [13,14,23,24].
Ayroles et al. [4] associated organismal phenotypes (chill coma recovery, starvation, lifespan, fitness, mating time and locomotion) with gene expression in 40 highly inbred D. melanogaster lines. Based on these associations, a number of candidate genes for the investigated traits were proposed. A future challenge is to determine whether some alleles of importance for the traits in question have been purged or lost due to drift during the inbreeding process, and whether variation in organismal phenotype and transcription patterns might be partly due to some lines suffering more from inbreeding depression than others.
We need more studies to improve our understanding of the underlying genetic structure of stress resistance and longevity traits and to be able to determine to what extent the overlap among gene lists from studies of the same trait in the same sex is affected by different genetic backgrounds, the influence of inbreeding/ genetic drift on the transcriptome or a combination of these factors. More studies are required which investigate the response of the transcriptome to selection in both sexes as such studies could help elucidating whether the large overlap between sexes in Ayroles et al. [4] (Table 2) was caused by genetic background and/ or inbreeding. Finally, we need to test whether the few genes that show consistent changes across studies are those most likely involved in trait variation. This could be achieved by functional studies of those genes compared to genes specific to particular studies and genetic backgrounds.