Skip to main content
  • Loading metrics

Tandem duplications lead to novel expression patterns through exon shuffling in Drosophila yakuba


One common hypothesis to explain the impacts of tandem duplications is that whole gene duplications commonly produce additive changes in gene expression due to copy number changes. Here, we use genome wide RNA-seq data from a population sample of Drosophila yakuba to test this ‘gene dosage’ hypothesis. We observe little evidence of expression changes in response to whole transcript duplication capturing 5′ and 3′ UTRs. Among whole gene duplications, we observe evidence that dosage sharing across copies is likely to be common. The lack of expression changes after whole gene duplication suggests that the majority of genes are subject to tight regulatory control and therefore not sensitive to changes in gene copy number. Rather, we observe changes in expression level due to both shuffling of regulatory elements and the creation of chimeric structures via tandem duplication. Additionally, we observe 30 de novo gene structures arising from tandem duplications, 23 of which form with expression in the testes. Thus, the value of tandem duplications is likely to be more intricate than simple changes in gene dosage. The common regulatory effects from chimeric gene formation after tandem duplication may explain their contribution to genome evolution.

Author summary

The enclosed work shows that whole gene duplications rarely affect gene expression, in contrast to widely held views that the adaptive value of duplicate genes is related to additive changes in gene expression due to gene copy number. We further explain how tandem duplications that create shuffled gene structures can force upregulation of gene sequences, de novo gene creation, and multifold changes in transcript levels. These results show that tandem duplications can produce new genes that are a source of immediate novelty associated with more extreme expression changes than previously suggested by theory. Further, these gene expression changes are a potential source of both beneficial and pathogenic mutations, immediately relevant to clinical and medical genetics in humans and other metazoans.


Tandem duplications are known as a source of genetic novelty that can contribute new genes with novel functions [1, 2]. For example, duplication of homeobox loci has been associated developmental changes across vertebrates [3]. The globin gene families have achieved functional differences via copy number expansion in mammals [4]. Venom proteins in snakes are derived from paralogs of phospholipases [5]. Copy number changes are associated with pesticide resistance in Drosophila (reviewed in [6]). In spite of the many case studies showing adaptive changes, theoretical arguments suggested that functional divergence would be difficult to attain via whole gene duplication [7]. If more than one substitution was required to produce novel functions across paralogs many generations would be required to facilitate functional divergence [2, 7]. The expected long wait times to develop new functions raise the risk that duplicate genes may be eliminated via non-functionalizing mutations before they can evolve new functions, even in large populations where effects of drift are limited [7]. Indeed, loss appears to be the prevailing fate of duplicate and chimeric genes [8, 7, 9]. This observed contradiction between the role of duplicates in adaptive evolution and models that led to their erosion of genes from the genome was perceived at the time as a problem for duplicate gene theory. How might these duplicate genes with novel functions accumulate if loss was swift and functional divergence was slow?

In this context, proposals arose that might explain forces that could preserve duplicate genes in genomes long enough to contribute to genome evolution. One solution proposed for how duplicate genes might accumulate in genomes given these limitations is the duplication-degeneration-complimentation model [7]. If duplicate genes accumulated even a very few mutations in regulatory sequences, they might partition expression profiles of duplicate copies using very few mutations [7]. This expression divergence might drive a situation where neither copy could be eliminated, resulting in long term preservation in the genome [7]. Similar models might explain neofunctionalization as well [10]. An alternative hypothesis to explain the utility of newly formed duplicates invoked adaptive changes in gene expression [6]. Based on theoretical arguments, it was suggested that newly formed duplicate genes may contribute to expression variation through additive changes in gene expression due to gene dosage [6]. Here, newly formed duplicates could produce immediate changes of gene expression. If such expression changes were adaptive, they might offer immediate phenotypic consequences that would circumvent the long wait times for functional divergence [6]. Selection for dosage changes might preserve duplicate genes in genomes long enough to accumulate point mutations that might lead to functional divergence [6].

Although this ‘dosage’ hypothesis was viewed as a compelling solution, it remained untested in a genome wide setting for years. More recently it has become possible to survey natural variation in gene expression at duplicated loci, in order to better distinguish the factors that contribute to the utility and maintenance of duplicate genes in the genome. With the advent of Illumina sequencing, we can now test this ‘dosage’ hypothesis by examining empirical data. We can also survey other types of constructs that are produced by duplications to see how they may contribute to regulatory and protein sequence diversity in nature. Chimeric genes and novel recruited UTRs can cause expression changes in novel tissues through the shuffling of regulatory elements [11, 12, 13, 14]. Yet, previous surveys have simply looked at presence and absence of transcripts in tissues with no systematic survey of quantitative changes or have focused on small numbers of candidate genes. Similarly studies of CNVs in D. melanogaster have identified a role in eQTLs [15], but with assays in whole adult flies that do not resolve different types of regulatory changes or the precise mechanisms of such changes. Systematic, genome wide surveys of the effects that tandem duplications produce on gene expression is essential as a first step toward understanding how duplicate genes may contribute to regulatory variation in natural populations. D. yakuba offers an excellent genetic model to examine changes in genome architecture and genome content in natural populations. Comparisons across the Drosophila genus indicate that D. yakuba has experienced a large number of changes in genome structure [16], and population level surveys have identified large numbers of duplications that are polymorphic in comparison with sister species [17].

Here, we describe a genome wide survey of polymorphic variation for tandem duplications in natural populations of D. yakuba and the types of regulatory changes that they can facilitate. We further describe biases in the ancestral expression patterns of genes that are duplicated. We show that whole gene duplications rarely produce effects on expression. In order to survey the detailed changes in gene expression produced by chimeric genes, gene fragments and recruited non-coding sequence, we introduce a hidden Markov model to assay site specific changes in gene expression, independent from gene annotations. These mutations form new gene structures not reflected in reference genome annotations, requiring an alternative approach from existing differential expression testing software. Using this new model, we identify 30 cases where duplications result in de novo gene origination, with an excess of new genes appearing with expression in the testes. Tandem duplications associated with chimeric constructs, novel UTRs, and recruited non-coding sequence are commonly associated with regulatory changes. These findings are consistent with previous studies showing testes bias [18]. The results presented here suggest that complex changes in gene structures will be an important source of mutations of major effect and that the value of whole gene duplications is unlikely to lie in additive changes in transcript levels due to gene copy number.


Here, we describe expression data for tandem duplications as a first step to elucidate the extent to which the molecular impacts of tandem duplications may explain their functional and evolutionary impacts. Using high coverage genomic sequence data we previously identified tandem duplications in population genomic samples for D. yakuba, with high validation rates of 97%, for duplications ranging from 74 bp to 25,000 bp in length [17]. We performed RNA-sequencing for adult male and female soma and reproductive tissues in 15 sample strains of D. yakuba as well as three replicates of the D. yakuba reference, which contains none of these tandem duplications. We have assayed transcript levels in new RNA-seq data for 15 of the 20 sample strains from Rogers et al, 2014 [17] as well as previously published data for 3 replicates of the reference strain [19] to obtain a portrait of regulatory changes that complex mutations can produce. Among strains assayed with RNA-seq data, we have identified 1116 tandem duplications in total. Among the 1116 duplications, 112 capture solely intergenic sequence while 1004 tandem duplications capture a total of 1306 genes or gene fragments based on new RNA-seq based gene annotations [20]. Among these, we identify 66 whole gene duplications, 76 chimeric genes, and 30 cases of recruited non-coding sequences that might potentially contribute to de novo gene formation.

Scarce support for the dosage hypothesis

One commonly proposed source of adaptive variation suggests tandem duplications may cause two-fold changes in transcript levels, resulting in quantitative phenotypic change via “gene dosage” [21, 15, 22, 6]. This “dosage” hypothesis offers one putative genetic mechanism for immediate evolutionary change prior to pseudogenization and loss. However, we observe scarce support for changes in RNA levels within tissues in response to duplication using both quantile normalized expression data (Fig 1, S1 Fig) and FPKM normalized expression data (P ≥ 0.37; S2 Fig). Using the Tophat/Cufflinks differential expression testing suite, we assayed 52 whole gene duplications (including UTRs) that had gene models that passed cuffdiff quality filters. In every tissue, the number of genes with significantly increased expression levels compared to the reference strain was not significantly different from genome wide expectations (S1 Table). In all of these cases, expression levels did not reflect additive two-fold changes in expression levels but rather indicated much greater fold change (S3 Fig, S2 Table). When we require at least 1 kb of upstream and downstream sequence, we do not observe any evidence of additive changes in gene expression. This is equally true when restricting duplications to cases where reference expression level is FPKM≥ 2. Cufflinks is fully capable of detecting low level changes in gene expression [23]. The whole gene duplications with upregulated expression here are associated with several different functions with no clear functional enrichment. Functional categories represented among whole gene duplications include testes expressed endopetidases, a metalloendopeptidase, a chorion protein, and two metabolism genes: sorbitol dehydrogenase, giberellin oxidase (S3 Table). However it is not clear that any of the high-magnitude expression changes observed at whole gene duplications are the product of duplication. High frequency duplications may be older and have secondary modifications on expression levels. They may also be filtered by selective pressures in comparison with low frequency duplications, possibly weeding out genes with expression changes. We examined 33 singleton variants that are expected to reflect primarily newly formed duplications, including detrimental (but not lethal) variants. Qualitatively, results remained unchanged, with no significant excess of expression changes for whole gene duplications (S4 Table). We additionally find no statistical support for increases in gene expression due to duplication in any of the four tissues, even when comparing mean-fold change using only whole gene duplications that have been validated using PacBio long molecule sequences (P ≥ 0.2). This comparison indicates that the results are not driven by false positives. Thus, there appears to be little support for this gene dosage hypothesis for duplicate genes in adult tissues.

Fig 1. Mean fold change in sample strains vs. reference for strains containing chimeras or whole gene duplicates (red) and unmutated sample strains for the same regions (grey) in A) ovaries B) female carcass C) testes D) male carcass.

Chimeric genes are more likely to result in high mean fold change than unmutated counterparts in all tissues. Whole gene duplicates create multifold expression changes more rarely.

One hypothesis for the lack of increased expression is that silencing of additional copies via secondary mutations might subdue expression changes produced by whole gene duplication. We identified 52 whole gene duplications with at least one ‘heterozygous’ SNP mutation present that might differentiate duplicate copies based on genomic sequencing. We filtered out SNPs that display asymmetric expression in non-duplicate strains, which would indicate allele-specific expression independent of duplication. This leaves a remaining 11 candidates that might represent asymmetric expression of duplicate genes in at least one tissue (S5 and S6 Tables), though the possibility of allele specific expression at a single locus cannot be ruled out. These numbers represent a minority of whole gene duplications. Thus, we conclude that whole gene duplication with dosage-sharing is common, even if asymmetric expression cannot be excluded.

Gene expression changes from alternative gene structures

In light of these surprising results, we determined to take a closer look at the expression impacts of these tandem duplications, especially alternative gene structures beyond whole gene duplication. Chimeric gene structures, gene fragments, and cases of recruited non-coding sequence all reflect partial gene changes, not present in reference GFF files. Precise breakpoints for most tandem duplications cannot always be determined [17] even with high confirmation rates in PacBio long molecule data. To identify more detail with respect to changes in gene expression for alternative gene structures whose precise breakpoints remain unresolved, we developed a hidden Markov model to identify changes in gene expression for individual sites in the genome. This HMM allows for differential expression testing for segments of chimeric genes, gene fragments, and cases of recruited non-coding sequence. The method is agnostic with respect to size of genetic constructs assayed and it does not require perfect knowledge of duplication breakpoints, in contrast with standard differential expression testing software. To establish a baseline for comparison, we used the HMM to identify gene expression changes at whole gene duplications. In total, a maximum of 5 out of 66 whole gene duplications that capture both UTRs display signals of increased expression for 50% or more of total exonic sequence (S3 Fig; Table 1) whereas the majority of genes remain unchanged (e.g. GE18452, Fig 2). Most promoters in Drosophila lie within 50 bp of gene sequences [24]. Restricting whole gene duplications to cases where 100 bp of upstream and downstream of both UTRs where the promoter is likely to be captured, 5 out of 58 sequences display expression changes. Both with and without upstream regions the likelihood of upregulation is not significantly different from the background rate of 5.26% as estimated from HMM-identified upregulated sites genome wide (S7 Table; , P = 0.7787; binomial test , P = 0.2324). The HMM used to identify expression differences is fully capable of detecting 2x expression changes (S4 Fig), suggesting that the lack of genes with expression changes is not solely due to a lack of power. Both the number of whole gene duplications identified as upregulated and the background rates of upregulation are lower than results from cuffdiff, but both methods suggest that whole gene duplication is not associated with additive increases in expression where two copies of a gene produce a greater number of transcripts. Only one gene is identified as upregulated in male carcass, and this locus also exhibits upregulation in female carcass. Hence, it is unlikely that the use of paired end reads in male tissues has a strong influence to produce higher power in the HMM. No gene ontology functions are overrepresented among the five genes (S3 Table).

Fig 2. Chimeric gene structures result in novel expression patterns.

A tandem duplication that does not respect gene boundaries unites the 5′ end of GE18453 with the 3′ end of GE18451 to produce a chimeric gene on chromosome 2L. Plot shows quantile normalized coverage in RNA seq data for sample (red) and reference (grey) with HMM output (blue) on chromosome 2L for female carcass. The chimera displays a change in transcript levels, while transcript levels for parental gene sequence are not altered. Sites with upregulated or downregulated sequence as defined by HMM output is shown in blue, using the right axis. HMM state calls for sites with unchanged expression are not shown. The region spanned by the tandem duplication is shaded in grey. The region spanned by the chimeric gene shows high-level upregulation. The whole gene duplication of GE18452 does not display a significant change in mRNA levels but rather falls within the bounds of expression profiles for reference replicates (Ref FPKM = 19.9; Sample FPKM = 24.5; uncorrected P = 0:52; corrected P = 1:0).

We observe one case where a duplication followed by a secondary deletion (S5 Fig) [17], has resulted in upregulation of a gene fragment only at the modified locus, not the faithfully copied parental gene, showing that complex mutations can produce regulatory changes when RNA-level is unaltered at the unmodified paralog (Fig 3). Coverage from whole genome Illumina sequencing libraries of genomic DNA [17] shows a two-fold to three-fold increase in coverage for the portion of the duplicated segment not affected by the deletion, indicating that this segment is not multi-copy to a level that would explain the observed expression change (S5 Fig). Tandem duplications that do not respect gene boundaries can also create chimeric gene sequences via exon-shuffling [25] (S6A Fig). In contrast to whole gene duplications, chimeric gene structures often result in expression changes. Among the 15 lines we identified 76 chimeric genes arising from tandem duplication. Of these a total of 24 chimeras display increased expression for 50% or more of exonic sequence within the duplicated gene segment (either 5′ or 3′). These numbers are significantly different from random expectations given a background rate of 5.26% (binomial test , P = 5.16 × 10−13). The high mean fold change across all sites captured in chimera formation indicates high levels of upregulation independently from HMM results regardless of the tissue assayed (Fig 1).

Fig 3. Duplication followed by secondary deletion, as indicated by a total of 104 long-spanning read pairs, leads to an expression change in a gene fragment of GE21202 on chromosome 3L.

Plot shows normalized coverage in RNA seq data for sample (red) and reference (grey) with HMM output (blue) on chromosome 3L. Only the sample strain with the deletion shows such upregulation. Transcript levels increase by greater than two-fold, beyond changes that would be produced by additive changes in gene dosage. Sites with upregulated or downregulated sequence as defined by HMM output is shown in blue, using the right axis. HMM state calls for sites with unchanged expression are not shown. HMM output for upregulated regions match well with the predicted gene structures formed by this complex mutation. The region spanned by the tandem duplication is shaded in grey.

These changes in gene expression are not consistent with additive effects of gene dosage, but rather reflect gene upregulation above two-fold changes due to the shuffling of regulatory elements in 5′ and 3′ segments of the gene. Plots of RNA-seq coverage and HMM output for these regions reflect the changes in gene structure, with only regions matching to chimeras exhibiting expression changes, not parental genes (Fig 2). These results suggest that expression changes are a direct product of chimera formation, not of environmental variation or secondary mutations that alter gene expression. Even with substantially less stringent criteria allowing for any expression change at least 50 bp in length, chimeric genes have a larger percentage of expression effects than whole gene duplications, an indication that the greater number of upregulated chimeras is not the product of gene sequence length (S8 Table). Thus, we suggest that chimeric constructs and other complex mutations that shuffle regulatory elements commonly alter expression. Therefore, they are likely to be a force that can produce immediate and drastic changes in RNA levels. In contrast, whole gene duplications rarely produce expression effects in adult gonads and soma studied here. Tandem duplications that form chimeric genes are more likely to be found at lower frequency in comparison to whole gene duplications (Wilcoxon rank sum test W = 2452.5, P = 0.03881), suggesting predominantly detrimental impacts. However, chimeras have been shown to be more likely to show signals of selection favoring their spread in natural populations [14]. The observed role of chimeric genes as mutations that can produce non-neutral impacts, especially in comparison to whole gene duplications, is at least partially explained by their ability to produce large magnitude changes in gene expression.

Recruitment of non-coding sequence and de novo gene origination

In addition to chimeric gene structures, duplicated gene fragments that capture the 5′ portion of a transcript have the potential to activate neighboring sequences that were previously untranscribed, thereby creating the potential for de novo genes (S6B Fig). We observe signs consistent with putative de novo gene origination through the combination of 5′ gene sequences with untranscribed regions during tandem duplication. We observe 43 cases of putative recruited non-coding sequence, 15 of which do not inherit a start codon from the parental gene. Among tandem duplications, we observe 30 cases associated with activation of transcription in neighboring regions that were previously untranscribed. These new genes are typically associated with duplication within a transcript or through the union of a 5′ UTR and neighboring non-transcribed sequence (Fig 4, Table 1).

Fig 4. Tandem duplication creates a de novo gene on chromosome 3R.

The 5′ end of GE24349 is duplicated and placed adjacent to formerly untranscribed sequence, producing transcription and putative de novo gene creation. The reference strain does not show transcription in the region (grey) and no other sample strain exhibits upregulated sequence across the region. Sites with upregulated or downregulated sequence as defined by HMM output is shown in blue, using the right axis. HMM state calls for sites with unchanged expression are not shown. The region spanned by the tandem duplication is shaded in grey. The tandem duplication activates a previously untranscribed region from roughly 14703500–14705000 bp. There is also upregulation in some exons for GE24349, possibly indicating a longer fusion transcript that reads through to the end of the nearest adjacent 3′ UTR.

In the absence of information about genome structure these will appear to be de novo gene creation, but with clearly defined boundaries of tandem duplications we can clarify that shuffling of 5′ segments of transcripts is one potential mechanism for activation of previously untranscribed regions. Among these putative cases of de novo activation, 23 are identified in the testes (Table 1), consistent with the out-of-the-testes hypothesis observed for new genes [26, 18]. The mean size of these de novo expressed regions is 385 bp, with no evidence of significant size differences across tissues (F = 0.798, df = 2 P = 0.458; S9 Table). For single transcripts, however, there can be variation in length across tissues, possibly reflecting isoform switching across tissues or general imprecision (S9 Table). Reference genome expression level for parental genes that contribute to de novo gene formation are given in S10 Table. These results offer one potential molecular mechanism to explain previously observed de novo gene origination, which is expected to have widespread results on evolution of new genes [27] and potential contribution to disease. Given the large number of sequences identified in such a small fraction of the genome that is spanned by tandem duplications, we would suggest that tandem duplicates can be a powerful force for new gene creation and neofunctionalization as well as contributors to pathogenic misexpression. While the predominant fate of new proto-genes is eventual loss [13, 7, 9, 28], such variants are expected to contribute a steady stream of new transcripts.

Duplication of ancestrally carcass-expressed genes

To determine whether ancestral expression patterns of genes influence their propensity for tandem duplication, we compare genes that are captured by duplications with those that are not. Three replicates of the D. yakuba reference were previously assayed for differential expression across tissues [20]. These reference strains contain none of the tandem duplications described here and should reflect the unmutated ancestral state. Among genes captured by duplications, 195 are biased toward ovary in the ancestral state whereas 345 are biased toward female carcass based on comparisons of ovary vs. carcass. In male somatic and germline comparisons, 168 genes captured by tandem duplication are biased toward testes in the ancestral state, and 131 are biased toward the male carcass. Based on resampling of genes in the reference, there is an excess of genes with biased expression toward female carcass (one-sided P < 10−4) and a deficit of genes that are duplicated with biased expression toward the ovaries in the ancestral state (one-sided P = 0.002). In males we observe an excess of genes that are duplicated with biased expression toward the carcass (one-sided P = 0.0029) but no bias with respect to testes expressed genes (one-sided P = 0.1443). Genes that duplicate have higher expression level in reference strains in every tissue (Fig 5,S11 Table), pointing to the potential for biases in tandem duplicate formation or putatively selection to retain genes. Tandem duplications that are present only in 1 or 2 sample strains are expected to be newly formed, with little room for selection to bias relationships. When we limit analyses to rare variants present only in 1 or 2 sample strains, the excess of expressed genes is equally true (S12 Table), suggesting that biases in formation toward transcribed regions certainly contribute to a large portion of the expression difference for duplicated sequence.

Fig 5. Expression levels (in FPKM) for unduplicated ancestral state for three D. yakuba reference replicates for genes that are duplicated in sample strains compared to expression levels for all genes.

FPKM values are indicative of ancestral expression patterns prior to duplication. Duplicated genes have higher mean and median ancestral expression compared to non-duplicated genes in female tissues (A) and male tissues (B). Genes that are duplicated have lower median expression in ovary compared to carcass in females (A) but there is no difference in expression in reproductive vs. somatic tissue in males (B). Plots shown exclude outliers.


Little evidence of expression differences due to whole gene duplication

One hypothesis to explain the phenotypic impacts of duplicate genes is that changes in transcript levels due to gene copy number result in novel phenotypes [6]. In contrast to these common assumptions about the molecular impacts of tandem duplications, we observe little evidence for increased expression in response to duplication, with 7.6% or fewer duplicated genes showing evidence for increased expression in each tissue. These numbers are not significantly different from the random expectation based on the frequency of upregulation across the genome as a whole (S1 Table). Results based on the HMM which uses site specific criteria show qualitatively similar results, with no enrichment for expression differences compared with background rates. The concordance with genome wide background rates points to the possibility of secondary mutations modifying expression or environmental effects on gene expression in spite of controlled growth conditions. Similar expression buffering has been observed for deleted genes and for ubiquitously expressed duplicated genes (but not for non-ubiquitously expressed duplicates) for large chromosomal abnormalities in a small number of Drosophila mutants [29]. Ubx deletions often exhibit buffered phenotypes [30]. The results described here suggest that these early results for lab mutants directly reflect patterns that can be observed in natural populations.

The observed lack of expression changes is consistent with previous results showing that expression changes at CNVs are not commonly targets of natural selection [31]. Furthermore, many such expression changes appear to be qualitative changes that are not compatible with the notion that duplication commonly results in two-fold increases in expression. The majority of genes show no evidence for asymmetrical expression of duplicates, suggesting that dosage sharing is common. These results are compatible with the hypothesis that many genes are subject to tight regulatory control and that transcription is not the limiting factor in protein production for many genes. Alternatively, it may be that promotors and full transcripts including UTRs are not sufficient to drive gene expression, implying strong cis-regulatory effects beyond the promoter. Together, these results suggest that the phenotypic impacts of tandem duplications are more complex than additive changes in transcript abundance due to copy number. Previous work has suggested that selection to maintain total expression levels across ohnologs might lead to expression subfunctionalization [32]. If the majority of genes are subject to feedback loops in whole genome duplication as we observe for whole gene duplications here, it might partially explain these results. Rather than genes increasing expression due to additive changes, then having to evolve back toward lower levels, we would suggest that genes initially might be held at that same constant level through regulatory feedback loops.

Similarly low rates of expression changes for CNVs in humans [33] and rodents [34] imply that these results are likely to be general across many organisms. In humans, copy number changes are associated with a large number of diseases. For some genes, especially those where relative dosage is more likely to matter, the phenotypic and selective impacts may be different and we might expect to see different patterns for this small minority of genes [35, 22, 6]. Pesticide resistance genes have been reported to have changes in gene dosage after duplication (reviewed in [6]). The most highly expressed genes, which may be more likely to be transcription limited may be more likely to exhibit such expression changes from gene dosage. One of the most highly expressed genes in Drosophila is Adh. Recent transgenic experiments using the highly expressed gene Adh show transcription levels respond in response to higher copy number [36]. At the moment we do not have sufficient duplications for very highly expressed genes to test this hypothesis in a meaningful way. Further experiments to assay differences in expression levels for various classes of genes will be necessary to determine the relationship between expression levels and regulatory changes produced by duplication. Hemizygous deletions in D. melanogaster suggest that expression effects for many genes are mediated by robust regulatory architecture, but with larger effects from copy number reduction in the most highly expressed genes [37]. Ohnologs, retained in the genome after whole genome duplication, also appear to be more sensitive to copy number changes than general CNVs, suggesting qualitative differences in certain gene’s responses after copy number changes [38]. If similar phenomena affect single copy number variants, there may be classes of genes that behave differently from the majority of genes. The whole gene duplications with upregulated expression here encompass diverse functional roles, including a testes-expressed endopeptidase, metabolism peptides, and a chorion protein. Yet, given the rarity of regulatory changes due to increases in gene copy number presented here, we suggest that alternative mechanisms are necessary to explain the role tandem duplications play in generating pathogenic phenotypes [39].

Recent work has found some evidence for increases in expression at CNVs, in contradiction with the data presented here [40]. However, this previous study also found that two-fold upregulation is rare [40]. The two datasets identify duplications using different bioinformatic methods with different false positive rates and potentially with different accounting for precise breakpoints of duplications [40]. D. melanogaster CNVs may also include dispersed duplications as well as tandem duplications. If these two classes behave differently from one another, it could potentially yield differences in the regulatory consequences observed. It is also possible that their filters only for highly expressed genes focus on genes that are more likely to be limited by transcription. The molecular and statistical methods used to establish regulatory changes also differ between the two manuscripts. Here, we have used RNAseq data, with cufflinks, an HMM, and comparisons of mean fold change between sample strains and the unduplicated reference for adult gonads and soma. Cardoso-Moreira et al. [40] uses a permutation test on microarray data comparing duplicated lines to a pool of sample strains using whole adult males. Finally, it is possible but unlikely that these differences in observations may represent species-specific differences in the regulatory consequences of gene duplications. It is unclear which of these explanations may ultimately clarify the discrepancy between the results for D. melanogaster and the data presented here for D. yakuba.

Regulatory novelty from exon shuffling

In contrast with unaltered expression patterns among whole gene duplications, chimeric genes, UTR shuffling, and recruitment of non-coding sequence often produce changes in expression with extreme up-regulation. These variants are polymorphic, and expression effects are seen even among genes at low frequency in the sample, suggesting that many of these constructs are very young with little time to accumulate secondary mutations that might explain patterns observed. Furthermore, such changes in gene expression reflect the chimeric and fragmented gene structures produced, indicating that they are the direct product of chimera formation, not environmental effects or other spurious signals. Regulatory modules for genes can be complex, with promoters and enhancers located at 5′ or 3′ ends of genes. Additionally, transcripts may carry motifs or secondary structures that are part of regulatory feedback loops via degradation pathways [41, 42]. Because chimeric genes shuffle the 5′ and 3′ ends of gene sequences, they can recombine diverse regulatory elements to generate novel expression patterns. Similarly, gain or loss of regulatory elements for gene fragments or genes that recruit non-coding sequences could produce novel combinations, resulting in altered transcript levels. Here, we observe a regulatory novelty in chimeric constructs, analogous to novel combinations of functional domains that result from exon shuffling [43, 44, 25]. This regulatory novelty may explain one mechanism to generate immediate regulatory divergence between tandem duplications that can contribute to genome evolution and population level variation.

One hypothesis to explain the evolution of network structure after whole gene duplication involves loss of expression or interaction after polyploidy [45]. However, we have found that upregulation, not silencing, is a common result of tandem duplication, indicating that such results reflect either major differences between polyploidy and gene expression or that present interaction and expression information does not perfectly reflect ancestral states. Previous results have suggested that duplications produce dosage changes in transcript levels [21, 15, 6]. However, such results are likely the product of limited ability to detect tissue-specific changes in whole adult flies, with no tissue level resolution (for associated data description [46, 47]). Separation of tissues is critical to establishing effects on gene expression, as upregulation in a single tissue that is only a fraction of the biomass will give a false signal of minor expression changes. Given the limited effect of gene copy number for whole gene duplication and the extreme expression changes associated with alternative gene structures, we suggest that such additive models of duplicate gene evolution do not reflect the full complexity of regulatory pathways or the fundamental nature of mutation.

We have observed regulatory changes and misexpression of gene fragments as a product of chimera formation, recruitment of non-coding sequence, and deletions that proceed rapidly after duplication to create variants with unusual gene structures. De novo proto-genes are commonly found in subtelomeric regions in yeast [28] and changes in genome structure are common in these regions as well [17] possibly explaining a portion of the pattern. One mechanism for origination of de novo genes that has been proposed is antisense transcription from divergent promotors [48, 49]. These results offer a second mechanism that relies on canonical promoters, transcription start signals, and translation start signals with genome shuffling to serve as drivers of new gene sequences. These newly originated exons outside annotated gene sequences have a mean length of 385 bp. These are slightly shorter than previous assays of de novo genes [27], although these numbers do not include length of copied gene fragments.

We observe no clear evidence of divergent promoters generating new genes at the tandem duplicates surveyed here, suggesting that the two mechanisms operate independently to serve as sources of new gene sequences. Many of the de novo transcript sequences that are newly formed may have abnormal translation products, and most new genes that form are expected to be eventually lost [28]. However, a portion of such new proto-genes can be modified by selection to form fully functional genes [28]. Thus, the tandem duplications described here are expected to serve as a steady source of new gene sequences, and a minority of these are expected to be sources of novel functions [50, 28, 13, 14, 51, 52, 53, 27]. RNA-seq based annotations in D. yakuba have identified 1340 lineage specific genes based on the D. yakuba reference, which do not have orthologs in other Drosophila genomes [20]. The observed high rates of de novo gene formation are likely to explain a significant portion of this signal.

Previous work has found qualitatively similar results for small numbers of genes and such mutations have potential to cause other types of qualitative changes in gene regulation beyond the limited amount captured in the current study. Chimeric genes can produce differences in presence or absence of transcripts in tissues or timepoints [14, 13], and a synthetic lab-generated chimera produces differential regulation in spatial patterning of hox gene expression during development [12]. Although differing methods of regulatory feedback mechanisms in mammals might be thought to render different effects, there are three case studies of chimeric gene formation in humans associated with expression changes, suggesting that the phenomenon deserves more careful study in human datasets. First, a chimeric gene that forces novel expression in the brain is associated with schizophrenia in humans [11]. Second, a newly formed chimeric gene is known to have novel expression in human testes [54], suggesting that these results are likely to be generally applicable to studies of human health. Finally, one known case of de novo gene origination through chromosomal rearrangement is know to have formed a new testis-expressed gene in humans [55]. Our data strongly suggest shuffling of modular genomic units can be a powerful force to develop novel regulatory profiles or unique expression patterns that has not been fully explored. We therefore suggest that these genes with altered transcription patterns are a prime source for genetic novelty, immediate neofunctionalization, and genes with widespread potential for non-neutral effects well deserving of future study in model and clinical systems.

Mutations of major effect

Young whole gene duplications are expected to be highly similar and modification of amino acid sequences through point mutations can take many generations. Barring changes in transcript dosage, these new faithfully copied whole gene duplications are unlikely to have extreme and immediate phenotypic effects. Mutations that shuffle UTRs, recruit non-coding sequence, or combine separate coding sequence can produce regulatory changes and protein sequence changes immediately upon formation and a priori are more likely to produce phenotypic effects. Although many such effects are likely to be pathogenic [39, 56, 57, 19, 58, 59, 60], they may often be adaptive as well [13, 14, 51, 52, 53]. Indeed, chimeric genes that combine segments of two or more coding sequences are more likely to be involved in selective sweeps immediately after formation in comparison to whole gene duplications and are a richer source of genetic novelty [14]. Because many of these variants capture only portions of gene sequences [17], high-throughput use of gene models in reference strains will underreport expression differences, thereby missing a large portion of variation in gene expression that could potentially explain phenotypic variation. The use of gene-model free expression testing in high coverage data, as we have presented here, offers greater power to assay gene expression changes at abnormal gene structures and could have important impacts even in organisms outside Drosophila. Similar approaches can readily complement standard differential expression testing software to gain additional information in studies for the genetic basis of adaptation, quantitative genetics, and studies of pathogenic phenotypes.

We have previously described large numbers of deletions that appear rapidly after duplication [17] which here are found to be associated with expression changes. CNV identification methods that do not account for secondary deletions, or that cluster all putatively duplicated loci too broadly thereby misidentifying breakpoints will lose important information with respect to gene structure. Such missing information can have a detrimental impact on the ability to correctly identify variation, associated expression effects, and regulatory changes associated with gene fragmentation. Although common CNVs at a frequency ≥10%, which are well tagged by SNPs, are unable to explain missing complex trait and disease heritability in humans [61] the majority of tandem duplicates described here appear to be at low frequency and tandem duplicates modified by secondary deletions will be rarer still [17]. Especially given the difficulties of identifying variants where linked SNPs are more common than causative mutations [62], the inability to identify modified duplicates may explain some portion of failure to identify causative variants or eQTLs in GWAS and other clinical studies [39, 58]. Here, the precision that is available in Drosophila allows greater resolution than has been previously provided in non-model systems, allowing inferences concerning the nature of mutation that are well worth exploring in future studies of phenotype and disease in more complex genomes, including humans.

Ancestral expression patterns of duplicated genes

We observe elevated ancestral expression level in the unduplicated reference strain for genes that are captured by duplications in at least one sample strain, suggesting that genes that are originally highly expressed are more likely to be associated with duplications (Fig 5, S11 Table). Even limiting the genes surveyed to genes that are identified in only one or two strains, expression still appears to be elevated above the genome wide background (S11 Table). Thus, we suggest that genes that duplicate are more likely to be expressed or are more highly expressed in the unduplicated ancestral state compared to the genome wide average. This pattern is observed in male and female somatic and reproductive tissues as well as low-frequency variants, making it unlikely that selection on a single functional category or gene family is responsible for the duplication of transcribed genes. Although mutations in somatic tissue are not passed on to progeny, we observe high expression for duplicate genes in somatic tissue. It is possible that selective pressures or correlations between expression in germ line and soma could lead to such results.

Tandem duplications can form through several mechanisms, including replication slippage, ectopic recombination, aberrant DNA break repair, and non-homologous end joining. Transcription-coupled repair and the avoidance of repair in regions bound by nucleosomes is commonly invoked to explain mutational patterns for SNPs in mammals and yeast [63, 64]. However there is no strong evidence for such transcript coupled repair in Drosophila [65, 66]. Genes that are transcribed are often members of open chromatin, and it is possible that the correlation between actively transcribed genes and chromatin states might promote greater recombination and repair and thereby explain the excess of transcribed genes among tandem duplications. We observe equal levels of upregulation for chimeric gene segments in female germline as in male germline, but lower fold-change in the testes (Fig 1). Because many genes are already expressed in the testes, chimeric portions which are already highly expressed are less likely to show high level upregulation under a scheme of non-additive expression effects from shuffling of regulatory elements. Similarly, widespread transcription of parental genes in the ancestral state rather than selection is likely to explain the overabundance of novel gene expression we observe in the testes due to a simple abundance of testes-driving promoters. This widespread transcription may be due to spurious, non-functional transcription in the testes, which combined with tandem duplication can be a fortuitous but powerful source of new genes.


Identifying tandem duplications and gene expression changes

We identified tandem duplications using paired-end Illumina genomic sequencing, as previously described [17]. Briefly, tandem duplications were defined by three or more divergently oriented read pairs that lie within 25 kb of one another. We excluded duplications indicated with divergent read pairs in the reference strain, which are indicative of technical challenges or reference mis-assembly. We also excluded duplicates which were present in D. erecta, resulting in a high quality data set of newly derived tandem duplications that are segregating in natural populations. Duplications were clustered across strains within a threshold distance of 200 bp and the maximum span of divergently oriented reads across all strains were used to define the span of each duplication. We then identified gene sequences captured by tandem duplications using RNA-seq based gene models previously described in Rogers et al [20].

RNA-seq samples were prepared from virgin flies collected within 2 hrs. of eclosion, then aged 2-5 days post eclosion before dissection. We dissected ovaries and headless carcass for adult females, and testes plus glands for adult males. Samples were flash frozen in liquid nitrogen and stored at -80℃ before extraction in trizol. Illumina sequencing libraries were prepared using the Nextrera library preparation kit, and were sequenced on an Illumina HiSeq 2500. Fastq data were aligned to the D. yakuba reference genome using Tophat v.2.0.6 and Bowtie2 v.2.0.2 [67]. Site specific changes in gene expression were determined using a Hidden Markov Model that implements the underlying statistical model of the Cufflinks suite [23]. Sequence data are available in the NCBI SRA under PRJNA269314 and PRJNA196536. Code is available at

Sample preparation and RNA-sequencing

We gathered RNA-seq data for 15 samples and the reference genome (S13 Table). Fly stocks were incubated under controlled conditions at 25℃ and 40% humidity. Virgin flies were collected within 2 hrs. of eclosion, then aged 2-5 days post eclosion before dissection. We dissected samples in isotonic Ringers solution, using female ovaries and headless gonadectomized carcass from two adult flies as well as testes plus glands and male headless gonadectomized carcass for four adult flies for each sample RNA prep. We collected three biological replicates of the D. yakuba reference, and one replicate per sample strain for 15 samples of D. yakuba. Samples were flash frozen in liquid nitrogen immediately after dissection, and and stored in 0.2ml Trizol at -80℃. All samples were homogenized in 0.5ml Trizol Reagent (Invitrogen) with plastic pestle in 1.5ml tube, mixed with 0.1ml chloroform, and centrifuged 12,000g 15min at 4°C, as Trizol RNA extraction protocol. The RNAs in the supernatant about 0.4ml were then collected and purified with Direct-Zol RNA MiniPrep Kit (Zymo), followed the protocol. The total RNAs were eluted in 65μL RNase-Free H2O. About 1μg purified RNAs were treated with 2μL Turbo DNase (Invitrogen) in 65μL reaction, incubated 15min at room temperature with gentle shaking. These RNAs were further purified with RNA Clean and Concentrator-5 (Zymo). One extra wash with fresh 80% ethanol after the final wash step was added into the original protocol. The treated RNAs were eluted with 15μL RNAse-Free H2O, and stored at -80℃.

The amplified cDNAs were prepared from 100ng DNase treated RNA with Ovation RNA-Seq System V2 (Nugen) and modified protocol. The preparations followed the protocol to the step of SPIA Amplification (Single Primer Isothermal Amplification). The amplified cDNAs were first purified with Purelink PCR Purification Kit (Invitrogen, HC Binding Buffer) and eluted in 100μL EB (Invitrogen). These cDNAs were purified again to 25μL EB with DNA Clean and Concentrator -5 Kit (Zymo) for Nextera library preparation. About 43ng cDNAs were used to construct libraries with Nextera DNA Sample Preparation Kit (Illumina) and modified protocol. After Tagmentation, Purelink PCR Purification Kit with HC Binding Buffer was used for purification and eluted with 30μL EB or H2O. The products (libraries) of final PCR amplification were purified with DNA Clean and Concentractor-5 and eluted in 20μL EB. The average library lengths roughly 500bp were estimated from profiles of Bioanalyzer (Agilent) with DNA HS Assay. All libraries were normalized to 2-10nM based on real-time PCR method with Kapa Library Quant Kits (Kapa Biosystems). The qualities and quantities of these RNAs, cDNAs and final libraries were measured from Bioanalyzer with RNA HS or DNA HS Assays and Qubit (Invitrogen) with RNA HS or DNA HS Reagents, respectively. Samples were barcoded and sequenced in 4-plex with 76 bp reads on an Illumina HiSeq 2500 using standard Illumina barcodes, resulting in high coverage with thousands of reads for Adh, the most highly expressed gene in Drosophila (S7 Fig). We sequenced one replicate per sample strain as well as three biological replicates of each reference strain for all tissues. Female tissues for sample strains and one replicate of the reference genome were sequenced with single end reads, while two replicates of reference genome female tissues and all male tissue samples were sequenced with paired end reads.

Reference expression patterns

Expression patterns in the reference genome, indicative of the ancestral, unduplicated state, were established according to Rogers et al. [20]. Briefly, sequences were mapped to the genome using Tophat v.2.0.6 and Bowtie2 v.2.0.2, using reference annotations as a guide, ignoring reads which fell outside reference annotations (-G). We estimated transcript abundances and tested for differential expression at an FDR ≤ 0.1 using Cuffdiff from Cufflinks v. 2.0.2 with quantile normalized expression values (-N), again using only reads which aligned to annotated gene sequences. All other parameters were set to default. We compared female ovaries to female carcass and male testes to male carcass for the reference strain replicates to determine tissue biased expression prior to duplication. Overrepresentation and underrepresentation of genes with tissue biased expression were established by resampling 10,000 replicates of randomly selected genes.

Duplicated gene sequences

We used gene models developed from RNA-seq guided reannotation of the D. yakuba reference genome [20]. The maximum span of divergently oriented reads was considered the bounds of duplication, similar to previous analysis [17] using FlyBase gene models [16]. These revised gene models include 5′ and 3′ UTRs, and are essential to correctly establish the effects tandem duplicates will have on gene structures. These revised gene models show greater concordance with D. melanogaster, resulting in an additional 1000 D. melanogaster genes with an ortholog in D. yakuba compared to previous gene annotations [20]. We additionally identify 1340 lineage specific genes in D. yakuba, hundreds of which display expression bias across tissues [20].

Differential expression testing using cuffdiff

Sequences for each reference replicate and barcoded sample strain were mapped to the genome using Tophat v.2.0.6 and Bowtie2 v.2.0.2, using reference annotations [20] as a guide on the D. yakuba r1.3 reference genome, ignoring reads which fell outside reference annotations (-G). We estimated transcript abundances and tested for differential expression in an all-by-all comparison at an FDR ≤ 0.1 using Cuffdiff from Cufflinks v. 2.0.2 with quantile normalized expression values (-N), again using only reads which aligned to annotated gene sequences with all other parameters set to default. Reference replicates were grouped for differential expression testing in Cuffdiff. For each tissue the total number of duplications displaying increases in expression for whole gene duplication and for background rates based on expression changes for unduplicated genes were compared using a chi-squared test with 1 degree of freedom.

Test of dosage-sharing

One hypothesis for the lack of gene expression changes among whole gene duplications is that secondary mutations might result in asymmetric silencing of one duplicate copy. If duplicate copies have differentiated from one another, this should be apparent in large numbers of seemingly heterozygous sites in the genomic SNP data. To test for differential expression among copies of whole gene duplication, we identified all putatively ‘heterozygous’ sites that might indicate differentiating SNPs across copies. Using samtools mpileup (v. 1.3) and bcftools consensus caller (v.1.3) with parameters set to default, we identified all putatively heterozygous sites in the genomic sequences for each strain. We then generated SNP calls using identical criteria for RNA sequencing data. The number of reads supporting heterozygous calls for the reference sequence and SNP sequence were then compared using a Fisher’s exact test. Only SNPs with at least 10 reads covering the site in both genomic and RNA sequencing datasets were used for differential expression testing. Sites which exhibited significant differential expression of SNPs in at least one strain that housed a duplication were considered candidates for differential expression of duplicate copies. Similar signals could be produced by allele specific expression even at unduplicated sites. We filtered out all sites that displayed such allele specific expression in strains that did not contain the duplication in question, as these are unlikely to reflect processes specific the duplication.

HMM for expression patterns

Coverage in mapped RNA-seq data per site for each strain was calculated using samtools depth. Sample strains show variable FPKM based on cuffdiff analysis (S8 and S9 Figs), which might potentially influence power to detect differential expression. To reduce the influence of coverage differences across samples and generate more robust expression calls [68], we quantile normalized each chromosome in R so that coverage per site across all strains has the same mean and variance for a given chromosome in a given tissue. Mean quantile-normalized coverage among regions corresponding to annotated exon sequences was 61 X. This quantile normalized coverage depth per site was used as input for a Hidden Markov Model (HMM) to identify site specific changes in gene expression, offering differential expression testing independent of gene models and exon annotations. This gene-model free expression testing is essential for discovering the regulatory impacts of complex mutations such as chimeric genes, recruited non-coding sequence, and duplication-deletion constructs all of which do not respect gene boundaries. This HMM also performs comparative hypothesis testing, choosing the most likely expression state for each site, rather than simply testing adherence to a null statistical model, an important methodological advantage.

The HMM attempts to identify three underlying states: decreased expression, stable expression, and increased expression. Initial state probabilities were set according to π0 and transition probabilities were set according to T, where row and column indices 0, 1, 2 are indicative of decreased, stable, and increased expression, respectively. Initial probabilities are set such that the singleton state is initially most likely and states are initially most likely to remain constant during transitions.

Very low transition probabilities can have a chilling effect on output of HMMs, which might potentially bias results away from detecting expression changes, a major hypothesis that is tested in the current work. However, results with alternate transition matrices defined by the Baum-Welch algorithm do not differ qualitatively from those presented in the main text (S14 Table). This is equally true for de novo genes.

Emission probabilities were modeled as follows: We compare the ratio of quantile normalized coverage per site for each sample strain to the mean for the three reference replicates. We assume the natural log of the fold change is normally distributed. Under a null model of no expression change, we can assume mean and variance in the sample will be equal to the mean and variance in the reference replicates, and use the delta method to approximate the variance, a common method of variance estimation in differential expression testing [23]. Under such an approximation, the variance of the natural log of the fold change is equal to where σ2 is the observed variance in quantile normalized coverage for the reference variance and μ is the observed mean quantile normalized coverage in the reference replicates. For stable expression, the distribution of the natural log of the mean fold change should be centered about 1, corresponding to no expression difference.

For increased expression we again assume a normal distribution for the log fold change, but assuming a true mean quantile normalized coverage at the upper critical value of the distribution under no difference in gene expression. For decreased expression we again model the log fold change as a normal distribution, but assume a true mean of quantile normalized coverage at the lower critical value of the distribution under no difference in gene expression. We model the likelihood of the data given no change in expression as the probability of a test statistic with an absolute value as large or larger than the observed, given a normal distribution of the log mean fold change. For sites with increased expression, we model emission probabilities as the probability of a test statistic at least as high as that observed. For sites with decreased expression, we model emission probabilities the probability of a test statistic at least as low as that observed.

The log fold-change distribution for emission probabilities is unable to accurately assign likelihood of upregulated expression if the mean coverage in all reference strains is close to zero. In cases where the reference strain mean for three replicates was less than 0.5, if sample strains exhibited coverage greater than 5 or more reads, we assigned a probability of upregulation of 0.95 as these indicate clear signs of upregulation of silenced sequence, but otherwise assigned a probability of stable expression of 0.95. State decoding was performed using the Forward-Backward algorithm, which maximizes the number of correctly predicted states [69]. The choice to maximize predictions per site rather than the most likely path (using the Viterbi algorithm) is important to maintain decoding of independent results across sites given the use of the HMM in site-specific differential expression testing. The use of high coverage RNA-seq data is essential for accurate performance of the HMM to detect site specific changes in expression and applications in lower coverage sequencing may have reduced power. Plots of HMM output with quantile normalized RNA-seq data show that the HMM detects increased and decreased expression for modest expression differences (S4 Fig).

For each chimeric gene and whole gene duplication, we used the HMM output by tissue to define genes where duplicated sequence has been significantly upregulated in response to tandem duplication. We require that each gene or gene fragment have at least 50% of annotated exon sequence upregulated, considering only blocks of upregulated sequence 50 bp or longer. For putative cases of de novo gene creation, we identified blocks of upregulated sequence 50 bp or longer which do not overlap with annotated exons, and which do not have quantile normalized coverage above 2.0 in the three reference replicates. We then retained only cases that spanned at least 200 bp of the tandem duplication, in accordance with methods used by Zhao et al. [27]. Performance of the HMM to call sites with increased and decreased expression is shown in S4 Fig. Genes with signals of expression changes in at least one strain were considered to be upregulated.

Mean fold change comparisons

To further establish regulatory profiles for each chimeric gene and whole gene duplication, we additionally estimated the mean fold change across all sites. This data are independent of HMM performance and gives a detailed portrait of the quantile normalized coverage data. We estimate mean coverage per site across all sites in sample and reference for a given chimera segment in a given strain. We consider segments independently as parental genes may have differing levels of ancestral expression in the reference strain. The ratio of mean coverage in the sample to mean coverage in the reference is then recorded as mean fold change per site, placing a lower bound on reference coverage level of one read per site. The mean fold change for each chimeric gene and each duplicate gene is plotted in Fig 1. The mean fold change for chimeric genes were compared to the mean fold change at the same gene fragments in strains that lacked the duplication in question in individual tissues using a Wilcoxon rank sum test.

Supporting information

S1 Table. Genes upregulated using cuffdiff by tissue.


S2 Table. Whole gene duplications with upregulated expression using cuffdiff.


S3 Table. Functions of whole gene duplications with upregulated expression.


S4 Table. Genes upregulated using cuffdiff tissue, singleton variants only.


S5 Table. SNPs in whole gene duplications with significantly asymmetric expression in male tissues.


S6 Table. SNPs in whole gene duplications with significantly asymmetric expression in female tissues.


S9 Table. Length of ‘de novo’ gene segments.


S10 Table. FPKM for recruited non-coding parental genes.


S12 Table. Ancestral expression patterns for variants in ≤ strains.


S14 Table. Upregulated genes using Baum-Welch transition probabilities.


S1 Fig. Mean fold change for chimeric genes in sample strains vs. reference for strains containing chimeras or whole gene duplicates (red) and unmutated sample strains for the same regions (grey).

Chimeric genes are more likely to result in high mean fold change than unmutated counterparts in all tissues. Whole gene duplicates create multifold expression changes more rarely.


S2 Fig. Mean fold change using FPKM normalized data for chimeric genes in sample strains vs. reference for strains containing chimeras or whole gene duplicates (red) and unmutated sample strains for the same regions (grey).

Chimeric genes are more likely to result in high mean fold change than unmutated counterparts in all tissues. Whole gene duplicates create multifold expression changes more rarely.


S3 Fig. Expression change in a sample strain containing a whole gene duplication of GE26133 (reference FPKM = 22.0725, sample FPKM = 58.6217, uncorrected P = 0.00263417, corrected P = 0.0420917).

The tandem duplication also captures the entire gene sequence of GE26134, as well as portions of GE26132 and GE24588. The duplicate exhibits greater than two-fold expression of GE26133 in the sample strain containing the duplication. It is unclear whether the expression change is a direct consequence of duplication, secondary mutation, environmental effects, or other stochastic variation in expression.


S4 Fig. HMM performance in quantile normalized coverage data.

Quantile normalized coverage in a single sample vs. the mean of quantile normalized coverage in the reference for sites with upregulated sequence are plotted in red, while that of down regulated sequence is shown in blue for 500,000 bp beginning at 6.5 Mb on chromosome 3L for sites with quantile normalized coverage ≤ 500. Sites with no expression change identified using the HMM are not shown. The case of equal expression is shown with the black solid line, while two-fold coverage increase in the sample are indicated with the dashed line. Even modest increases in expression can be identified with the HMM, suggesting that its ability to detect site level differences in high coverage RNA-seq data is high.


S5 Fig. Genomic DNA sequencing coverage in the sample (red) and resequenced reference (grey) [17] and RNA-seq HMM expression output for a region experiencing a secondary deletion after duplication.

The deleted segment is supported by a decrease in genome coverage as well as 104 long-spanning Illumina sequencing reads. Coverage increases two-fold to three-fold in the duplicated segment, and is not supportive of higher level copy number that might explain the increase in expression as defined by RNA-seq data. HMM output for the region with increased expression in RNA-seq data is shown in blue, for comparison. The region the gene segment with the expression change corresponds well with the region displaying elevated genomic sequencing coverage given the structure of ancestral gene models (see Fig 3).


S6 Fig. Formation of alternative gene structures through tandem duplications.

A) A tandem duplication captures the 5′ segment of GE18453 and the 3′ segment of GE18451. The tandem duplication unites these gene segments to form a novel open reading frame distinct from the parental genes. Shuffling of regulatory elements in the 5′ and 3′ ends results in a new regulatory profile for the chimera. The tandem duplication also copies the full gene sequence of GE18452. B) A tandem duplication captures the 5′ end of GE24349, placing it next to previously untranscribed sequence. The promoter and UTR of GE24349 drives expression of a previously untranscribed region.


S7 Fig. Normalized coverage in RNA-seq data for Adh in 15 sample strains and 3 replicates of the reference.

RNA-seq data shows differentiation between intron and exon sequence and spans the entire length of the the transcript.


S8 Fig. FPKM values in RNA-seq data in female tissues for 15 sample strains.

Coverage varies across strains, but is generally high with thousands of reads for the most highly expressed genes. To reduce variability in coverage and generate more robust differential expression calls, we quantile normalized coverage inputs for the HMM.


S9 Fig. FPKM values in RNA-seq data in male tissues for 15 sample strains.

Coverage varies across strains, but is generally high with thousands of reads for the most highly expressed genes. To reduce variability in coverage and generate more robust differential expression calls, we quantile normalized coverage inputs for the HMM.



We would like to thank Rahul Warrior for providing temperature and humidity controlled incubator space and Rachel Martin for an emergency source of liquid nitrogen. Andrew Foran provided technical assistance. Jaleal Sanjak provided helpful discussions on statistics and simulations.

Author Contributions

  1. Conceptualization: RLR.
  2. Data curation: RLR KRT.
  3. Formal analysis: RLR.
  4. Funding acquisition: RLR KRT.
  5. Investigation: RLR LS.
  6. Methodology: RLR KRT LS.
  7. Project administration: RLR KRT.
  8. Software: RLR.
  9. Supervision: KRT.
  10. Writing – original draft: RLR KRT LS.
  11. Writing – review & editing: RLR KRT.


  1. 1. Ohno S (1970) Evolution by gene duplication. London: George Alien & Unwin Ltd. Berlin, Heidelberg and New York: Springer-Verlag.
  2. 2. Conant GC, Wolfe KH (2008) Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet 9: 938–950. pmid:19015656
  3. 3. Wagner GP, Amemiya C, Ruddle F (2003) Hox cluster duplications and the opportunity for evolutionary novelties. Proceedings of the National Academy of Sciences 100: 14603–14606. pmid:14638945
  4. 4. Hardison RC (2012) Evolution of hemoglobin and its genes. Cold Spring Harbor perspectives in medicine 2: a011627.
  5. 5. Lynch VJ (2007) Inventing an arsenal: adaptive evolution and neofunctionalization of snake venom phospholipase a 2 genes. BMC evolutionary biology 7: 2. pmid:17233905
  6. 6. Kondrashov FA, Kondrashov AS (2006) Role of selection in fixation of gene duplications. J Theor Biol 239: 141–151. pmid:16242725
  7. 7. Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290: 1151–1155. pmid:11073452
  8. 8. Rogers RL, Bedford T, Hartl DL (2009) Formation and longevity of chimeric and duplicate genes in Drosophila melanogaster. Genetics 181: 313–322. pmid:19015547
  9. 9. Hahn MW, Han MV, Han SG (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genetics 3: e197. pmid:17997610
  10. 10. Huminiecki L, Wolfe KH (2004) Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse. Genome research 14: 1870–1879. pmid:15466287
  11. 11. Rippey C, Walsh T, Gulsuner S, Brodsky M, Nord AS, et al. (2013) Formation of chimeric genes by copy-number variation as a mutational mechanism in schizophrenia. Am J Hum Genet 93: 697–710. pmid:24094746
  12. 12. Rowe A, Akam M (1988) The structure and expression of a hybrid homeotic gene. EMBO J 7: 1107–1114. pmid:16453833
  13. 13. Rogers RL, Bedford T, Lyons AM, Hartl DL (2010) Adaptive impact of the chimeric gene Quetzalcoatl in Drosophila melanogaster. Proc Natl Acad Sci USA 107: 10943–10948. pmid:20534482
  14. 14. Rogers RL, Hartl DL (2012) Chimeric genes as a source of rapid evolution in Drosophila melanogaster. Mol Biol Evol 29: 517–529. pmid:21771717
  15. 15. Zichner T, Garfield DA, Rausch T, Stutz AM, Cannavo E, et al. (2013) Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res 23: 568–579. pmid:23222910
  16. 16. Drosophila Twelve Genomes Consortium (2007) Evolution of genes and genomes on the Drosophila phylogeny. Nature 450: 203–218.pmid:17994087
  17. 17. Rogers RL, Cridland JM, Shao L, Hu TT, Andolfatto P, et al. (2014) Landscape of Standing Variation for Tandem Duplications in Drosophila yakuba and Drosophila simulans. Mol Biol Evol 31: 1750–1766. pmid:24710518
  18. 18. Reinhardt JA, Wanjiru BM, Brant AT, Saelao P, Begun DJ, et al. (2013) De novo ORFs in Drosophila are important to organismal fitness and evolved rapidly from previously non-coding sequences. PLoS Genet 9: e1003860. pmid:24146629
  19. 19. Rogers RL, Cridland JM, Shao L, Hu TT, Andolfatto P, et al. (2014) Tandem duplications and the limits of natural selection in drosophila yakuba and drosophila simulans. arXiv preprint arXiv:14050518. pmid:26176952
  20. 20. Rogers RL, Shao L, Sanjak JS, Andolfatto P, Thornton KR (2014) Revised annotations, sex-biased expression, and lineage-specific genes in the drosophila melanogaster group. G3: Genes— Genomes— Genetics: g3–114. pmid:25273863
  21. 21. Zhou J, Lemos B, Dopman EB, Hartl DL (2011) Copy-number variation: the balance between gene dosage and expression in Drosophila melanogaster. Genome Biol Evol 3: 1014–1024. pmid:21979154
  22. 22. Birchler JA, Veitia RA (2012) Gene balance hypothesis: connecting issues of dosage sensitivity across biological disciplines. Proc Natl Acad Sci USA 109: 14746–14753. pmid:22908297
  23. 23. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7: 562–578. pmid:22383036
  24. 24. Arkhipova IR (1995) Promoter elements in drosophila melanogaster revealed by sequence analysis. Genetics 139: 1359–1369. pmid:7768444
  25. 25. Gilbert W (1978) Why genes in pieces? Nature 271: 501. pmid:622185
  26. 26. Kaessmann H (2010) Origins, evolution, and phenotypic impact of new genes. Genome Res 20: 1313–1326. pmid:20651121
  27. 27. Zhao L, Saelao P, Jones CD, Begun DJ (2014) Origin and spread of de novo genes in Drosophila melanogaster populations. Science 343: 769–772. pmid:24457212
  28. 28. Carvunis AR, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, et al. (2012) Proto-genes and de novo gene birth. Nature 487: 370–374. pmid:22722833
  29. 29. Stenberg P, Lundberg LE, Johansson AM, Rydén P, Svensson MJ, et al. (2009) Buffering of segmental and chromosomal aneuploidies in drosophila melanogaster. PLoS Genet 5: e1000465. pmid:19412336
  30. 30. Gibson G, van Helden S (1997) Is function of the drosophila homeotic gene ultrabithorax canalized? Genetics 147: 1155–1168. pmid:9383059
  31. 31. Schrider DR, Hahn MW, Begun DJ (2016) Parallel evolution of copy-number variation across continents in drosophila melanogaster. Molecular biology and evolution: msw014. pmid:26809315
  32. 32. Gout JF, Lynch M (2015) Maintenance and loss of duplicated genes by dosage subfunctionalization. Molecular biology and evolution: msv095. pmid:25908670
  33. 33. Woodwark C, Bateman A (2011) The characterisation of three types of genes that overlie copy number variable regions. PLoS ONE 6: e14814. pmid:21637334
  34. 34. Henrichsen CN, Vinckenbosch N, Zollner S, Chaignat E, Pradervand S, et al. (2009) Segmental copy number variation shapes tissue transcriptomes. Nat Genet 41: 424–429. pmid:19270705
  35. 35. Schuster-Bockler B, Conrad D, Bateman A (2010) Dosage sensitivity shapes the evolution of copy-number varied regions. PLoS ONE 5: e9474. pmid:20224824
  36. 36. Loehlin DW, Carroll SB (2016) Expression of tandem gene duplicates is often greater than twofold. Proceedings of the National Academy of Sciences 113: 5988–5992. pmid:27162370
  37. 37. Malone JH, Cho DY, Mattiuzzo NR, Artieri CG, Jiang L, et al. (2012) Mediation of Drosophila autosomal dosage effects and compensation by network interactions. Genome Biol 13: r28. pmid:22531030
  38. 38. McLysaght A, Makino T, Grayton HM, Tropeano M, Mitchell KJ, et al. (2014) Ohnologs are overrepresented in pathogenic copy number mutations. Proceedings of the National Academy of Sciences 111: 361–366. pmid:24368850
  39. 39. Hu X, Worton RG (1992) Partial gene duplication as a cause of human disease. Hum Mutat 1: 3–12. pmid:1301188
  40. 40. Cardoso-Moreira M, Arguello JR, Gottipati S, Harshman LG, Grenier JK, et al. (2016) Evidence for the fixation of gene duplications by positive selection in drosophila. Genome research 26: 787–798. pmid:27197209
  41. 41. Burow DA, Umeh-Garcia MC, True MB, Bakhaj CD, Ardell DH, et al. (2015) Dynamic regulation of mrna decay during neural development. Neural development 10: 1. pmid:25896902
  42. 42. Badis G, Saveanu C, Fromont-Racine M, Jacquier A (2004) Targeted mrna degradation by deadenylation-independent decapping. Molecular cell 15: 5–15. pmid:15225544
  43. 43. Patthy L (1999) Genome evolution and the evolution of exon-shuffling–a review. Gene 238: 103–114. pmid:10570989
  44. 44. Patthy L (2003) Modular assembly of genes and the evolution of new functions. Genetica 118: 217–231. pmid:12868611
  45. 45. Conant GC, Wolfe KH (2006) Functional partitioning of yeast co-expression networks after genome duplication. PLoS Biol 4: e109. pmid:16555924
  46. 46. Lemos B, Araripe LO, Fontanillas P, Hartl DL (2008) Dominance and the evolutionary accumulation of cis- and trans-effects on gene expression. Proc Natl Acad Sci USA 105: 14471–14476. pmid:18791071
  47. 47. Ayroles JF, Carbone MA, Stone EA, Jordan KW, Lyman RF, et al. (2009) Systems genetics of complex traits in Drosophila melanogaster. Nat Genet 41: 299–307. pmid:19234471
  48. 48. Wu X, Sharp PA (2013) Divergent transcription: a driving force for new gene origination? Cell 155: 990–996. pmid:24267885
  49. 49. Almada AE, Wu X, Kriz AJ, Burge CB, Sharp PA (2013) Promoter directionality is controlled by U1 snRNP and polyadenylation signals. Nature 499: 360–363. pmid:23792564
  50. 50. Chen S, Zhang YE, Long M (2010) New genes in Drosophila quickly become essential. Science 330: 1682–1685. pmid:21164016
  51. 51. Long M, Langley CH (1993) Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260: 91–95. pmid:7682012
  52. 52. Jones CD, Begun DJ (2005) Parallel evolution of chimeric fusion genes. Proceedings of the National Academy of Sciences, USA 102: 11373–11378. pmid:16076957
  53. 53. Jones CD, Custer AW, Begun DJ (2005) Origin and evolution of a chimeric fusion gene in Drosophila subobscura, D. madeirensis and D. guanche. Genetics 170: 207–219. pmid:15781692
  54. 54. Babushok DV, Ohshima K, Ostertag EM, Chen X, Wang Y, et al. (2007) A novel testis ubiquitin-binding protein gene arose by exon shuffling in hominoids. Genome Res 17: 1129–1138. pmid:17623810
  55. 55. Rogers RL (2015) Chromosomal rearrangements as barriers to genetic homogenization between archaic and modern humans. Molecular Biology and Evolution 32: 3064–3078. pmid:26399483
  56. 56. Ionita-Laza I, Rogers AJ, Lange C, Raby BA, Lee C (2009) Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics 93: 22–26. pmid:18822366
  57. 57. Inaki K, Liu ET (2012) Structural mutations in cancer: mechanistic and functional insights. Trends Genet 28: 550–559. pmid:22901976
  58. 58. Buchanan JA, Scherer SW (2008) Contemplating effects of genomic structural variation. Genet Med 10: 639–647. pmid:18978673
  59. 59. Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M (2008) Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320: 1629–1631. pmid:18535209
  60. 60. Cardoso-Moreira M, Emerson JJ, Clark AG, Long M (2011) Drosophila duplication hotspots are associated with late-replicating regions of the genome. PLoS Genet 7: e1002340. pmid:22072977
  61. 61. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, et al. (2010) Origins and functional impact of copy number variation in the human genome. Nature 464: 704–712. pmid:19812545
  62. 62. Pritchard JK (2001) Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet 69: 124–137. pmid:11404818
  63. 63. Svejstrup JQ (2002) Mechanisms of transcription-coupled DNA repair. Nat Rev Mol Cell Biol 3: 21–29. pmid:11823795
  64. 64. Hanawalt PC, Spivak G (2008) Transcription-coupled DNA repair: two decades of progress and surprises. Nat Rev Mol Cell Biol 9: 958–970. pmid:19023283
  65. 65. Keightley PD, Trivedi U, Thomson M, Oliver F, Kumar S, et al. (2009) Analysis of the genome sequences of three Drosophila melanogaster spontaneous mutation accumulation lines. Genome Res 19: 1195–1201. pmid:19439516
  66. 66. Sekelsky JJ, Brodsky MH, Burtis KC (2000) DNA repair in Drosophila: insights from the Drosophila genome sequence. J Cell Biol 150: F31–36. pmid:10908583
  67. 67. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, et al. (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14: R36. pmid:23618408
  68. 68. Bolstad BM, Irizarry RA, Åstrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185–193. pmid:12538238
  69. 69. Durbin R (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press.