Tandem duplications lead to novel expression patterns through exon shuffling in D. yakuba

One common hypothesis to explain the impacts of tandem duplications is that whole gene duplications commonly produce additive changes in gene expression due to copy number changes. Here, we use genome wide RNA-seq data from a population sample of Drosophila yakuba to test this ‘gene dosage’ hypothesis. We observe little evidence of expression changes in response to whole transcript duplication capturing 5ʹ and 3ʹ UTRs. Among whole gene duplications, we observe evidence that dosage sharing across copies is likely to be common. The lack of expression changes after whole gene duplication suggests that the majority of genes are subject to tight regulatory control and therefore not sensitive to changes in gene copy number. Rather, we observe changes in expression level due to both shuffling of regulatory elements and the creation of chimeric structures via tandem duplication. Additionally, we observe 30 de novo gene structures arising from tandem duplications, 23 of which form with expression in the testes. Thus, the value of tandem duplications is likely to be more intricate than simple changes in gene dosage. The common regulatory effects from chimeric gene formation after tandem duplication may explain their contribution to genome evolution. Author Summary The enclosed work shows that whole gene duplications rarely affect gene expression, in contrast to widely held views that the adaptive value of duplicate genes is related to additive changes in gene expression due to gene copy number. We further explain how tandem duplications that create shuffled gene structures can force upregulation of gene sequences, de novo gene creation, and multifold changes in transcript levels. These results show that tandem duplications can produce new genes that are a source of immediate novelty associated with more extreme expression changes than previously suggested by theory. Further, these gene expression changes are a potential source of both beneficial and pathogenic mutations, immediately relevant to clinical and medical genetics in humans and other metazoans.

Introduction the duplicated segment not affected by the deletion, indicating that this segment is not multi-copy to a level that would explain the observed expression change (SI Appendix, Figure S5). Tandem 191 duplications that do not respect gene boundaries can also create chimeric gene sequences via 192 exon-shuffling (28) (SI Appendix, Figure S6A). In contrast to whole gene duplications, chimeric 193 gene structures often result in expression changes. Among the 15 lines we identified 76 chimeric 194 genes arising from tandem duplication. Of these a total of 24 chimeras display increased expression 195 for 50% or more of exonic sequence within the duplicated gene segment (either 5 or 3 ). These  Table S8). Thus, we suggest that chimeric constructs and other

Recruitment of non-coding sequence and de novo gene origination
In addition to chimeric gene structures, duplicated gene fragments that capture the 5 portion of a 221 transcript have the potential to activate neighboring sequences that were previously untranscribed, 222 thereby creating the potential for de novo genes (SI Appendix, Figure S6B). We observe signs  Table 1). Parental genes for these cases of de novo gene formation include XX.

230
In the absence of information about genome structure these will appear to be de novo gene 231 creation, but with clearly defined boundaries of tandem duplications we can clarify that shuffling 232 of 5 segments of transcripts is one potential mechanism for activation of previously untranscribed 233 regions. Among these putative cases of de novo activation, 23 are identified in the testes (Table   234 1), consistent with the out-of-the-testes hypothesis observed for new genes (29, 15). The mean 235 size of these de novo expressed regions is 385 bp, with no evidence of significant size differences 236 across tissues (F = 0.798, df = 2 P = 0.458; Table S9). For single transcripts, however, there 237 can be variation in length across tissues, possibly reflecting isoform switching across tissues or 238 general imprecision (Table S9). Reference genome expression level for parental genes that contribute 239 to de novo gene formation are given in Table S10. These results offer one potential molecular 240 mechanism to explain previously observed de novo gene origination, which is expected to have  duplications that can contribute to genome evolution and population level variation.

336
One hypothesis to explain the evolution of network structure after whole gene duplication 337 involves loss of expression or interaction after polyploidy (46). However, we have found that 338 upregulation, not silencing, is a common result of tandem duplication, indicating that such results 339 reflect either major differences between polyploidy and gene expression or that present interaction de novo genes that has been proposed is antisense transcription from divergent promotors (49, 50).

355
These results offer a second mechanism that relies on canonical promoters, transcription start 356 signals, and translation start signals with genome shuffling to serve as drivers of new gene sequences.

357
These newly originated exons outside annotated gene sequences have a mean length of 385 bp.

358
These are slightly shorter than previous assays of de novo genes (30), although these numbers do 359 not include length of copied gene fragments. are captured by duplications in at least one sample strain, suggesting that genes that are originally 427 highly expressed are more likely to be associated with duplications ( Figure 5, Table S11). Even 428 limiting the genes surveyed to genes that are identified in only one or two strains, expression still 429 appears to be elevated above the genome wide background (Table S11). Thus, we suggest that genes

Methods
Identifying tandem duplications and gene expression changes 453 We identified tandem duplications using paired-end Illumina genomic sequencing, as previously 454 described (14). Briefly, tandem duplications were defined by three or more divergently oriented read 455 pairs that lie within 25 kb of one another. We excluded duplications indicated with divergent read 456 pairs in the reference strain, which are indicative of technical challenges or reference mis-assembly. 457 We also excluded duplicates which were present in D. erecta, resulting in a high quality data set  Sample preparation and RNA-sequencing 475 We gathered RNA-seq data for 15 samples and the reference genome (Table S13)  Reference replicates were grouped for differential expression testing in Cuffdiff. For each tissue the 541 total number of duplications displaying increases in expression for whole gene duplication and for 542 background rates were compared using a chi-squared test with 1 degree of freedom. the site in both genomic and RNA sequencing datasets were used for differential expression testing.

555
Sites which exhibited significant differential expression of SNPs in at least one strain that housed 556 a duplication were considered candidates for differential expression of duplicate copies. Similar 557 signals could be produced by allele specific expression even at unduplicated sites. We filtered out 558 all sites that displayed such allele specific expression in strains that did not contain the duplication 559 in question, as these are unlikely to reflect processes specific the duplication.

HMM for expression patterns 561
Coverage in mapped RNA-seq data per site for each strain was calculated using samtools depth.  (Table S14). This is 588 equally true for de novo genes.          TableS1.pdf - Table S1 673 TableS2.pdf - Table S2 674 TableS3.pdf - Table S3 675 TableS4.pdf - Table S4 676 TableS5.pdf - Table S5 677 TableS6.pdf - Table S6 678 TableS7.pdf - Table S7 679 TableS8.pdf - Table S8 680 TableS9.pdf - Table S9 681 TableS10.pdf - Table S10 682 TableS11.pdf - Table S11 683 TableS12.pdf - Table S12 684 TableS13.pdf - Table S13 685 TableS14.pdf -Table S14   A tandem duplication that does not respect gene boundaries unites the 5 end of GE18453 with the 3 end of GE18451 to produce a chimeric gene on chromosome 2L. Plot shows quantile normalized coverage in RNA seq data for sample (red) and reference (grey) with HMM output (blue) on chromosome 2L for female carcass. The chimera displays a change in transcript levels, while transcript levels for parental gene sequence are not altered. Sites with upregulated or downregulated sequence as defined by HMM output is shown in blue, using the right axis. HMM state calls for sites with unchanged expression are not shown. The region spanned by the tandem duplication is shaded in grey. The region spanned by the chimeric gene shows high-level upregulation. The whole gene duplication of GE18452 does not display a significant change in mRNA levels but rather falls within the bounds of expression profiles for reference replicates (Ref FPKM=19.9; Sample FPKM=24.5; uncorrected P = 0.52; corrected P = 1.0). Figure 3: Duplication followed by secondary deletion, as indicated by a total of 104 long-spanning read pairs, leads to an expression change in a gene fragment of GE21202 on chromosome 3L. Plot shows normalized coverage in RNA seq data for sample (red) and reference (grey) with HMM output (blue) on chromosome 3L. Only the sample strain with the deletion shows such upregulation. Transcript levels increase by greater than two-fold, beyond changes that would be produced by additive changes in gene dosage. Sites with upregulated or downregulated sequence as defined by HMM output is shown in blue, using the right axis. HMM state calls for sites with unchanged expression are not shown. HMM output for upregulated regions match well with the predicted gene structures formed by this complex mutation. The region spanned by the tandem duplication is shaded in grey. Figure 4: Tandem duplication creates a de novo gene on chromosome 3R. The 5 end of GE24349 is duplicated and placed adjacent to formerly untranscribed sequence, producing transcription and putative de novo gene creation. The reference strain does not show transcription in the region (grey) and no other sample strain exhibits upregulated sequence across the region. Sites with upregulated or downregulated sequence as defined by HMM output is shown in blue, using the right axis. HMM state calls for sites with unchanged expression are not shown. The region spanned by the tandem duplication is shaded in grey. The tandem duplication activates a previously untranscribed region from roughly 14703500 -14705000 bp. There is also upregulation in some exons for GE24349, possibly indicating a longer fusion transcript that reads through to the end of the nearest adjacent 3 UTR.      Table S5: SNPs in whole gene duplications with significantly asymmetric expression in male tissues        Figure S1: Mean fold change for chimeric genes in sample strains vs. reference for strains containing chimeras or whole gene duplicates (red) and unmutated sample strains for the same regions (grey). Chimeric genes are more likely to result in high mean fold change than unmutated counterparts in all tissues. Whole gene duplicates create multifold expression changes more rarely.  Figure S2: Mean fold change using FPKM normalized data for chimeric genes in sample strains vs. reference for strains containing chimeras or whole gene duplicates (red) and unmutated sample strains for the same regions (grey). Chimeric genes are more likely to result in high mean fold change than unmutated counterparts in all tissues. Whole gene duplicates create multifold expression changes more rarely. Figure S3: Expression change in a sample strain containing a whole gene duplication of GE26133 (reference FPKM=22.0725, sample FPKM=58.6217, uncorrected P = 0.00263417, corrected P = 0.0420917). The tandem duplication also captures the entire gene sequence of GE26134, as well as portions of GE26132 and GE24588. The duplicate exhibits greater than two-fold expression of GE26133 in the sample strain containing the duplication. It is unclear whether the expression change is a direct consequence of duplication, secondary mutation, environmental effects, or other stochastic variation in expression. Figure S4: HMM Performance in quantile normalized coverage data. Quantile normalized coverage in a single sample vs. the mean of quantile normalized coverage in the reference for sites with upregulated sequence are plotted in red, while that of down regulated sequence is shown in blue for 500,000 bp beginning at 6.5 Mb on chromosome 3L for sites with quantile normalized coverage ≤ 500. Sites with no expression change identified using the HMM are not shown. The case of equal expression is shown with the black solid line, while two-fold coverage increase in the sample are indicated with the dashed line. Even modest increases in expression can be identified with the HMM, suggesting that its ability to detect site level differences in high coverage RNA-seq data is high. Figure S5: Genomic DNA sequencing coverage in the sample (red) and resequenced reference (grey) (14) and RNA-seq HMM Expression output for a region experiencing a secondary deletion after duplication. The deleted segment is supported by a decrease in genome coverage as well as 104 long-spanning Illumina sequencing reads. Coverage increases two-fold to three-fold in the duplicated segment, and is not supportive of higher level copy number that might explain the increase in expression as defined by RNA-seq data. HMM output for the region with increased expression in RNA-seq data is shown in blue, for comparison. The region the gene segment with the expression change corresponds well with the region displaying elevated genomic sequencing coverage given the structure of ancestral gene models (see Figure 3).  RNA-seq data shows differentiation between intron and exon sequence and spans the entire length of the the transcript. q q q q q q q q q q q q q q q FPKM in Female Ovary by strain Figure S8: FPKM values in RNA-seq data in female tissues for 15 sample strains. Coverage varies across strains, but is generally high with thousands of reads for the most highly expressed genes. To reduce variability in coverage and generate more robust differential expression calls, we quantile normalized coverage inputs for the HMM. q q q q q q q q q q q q q q q FPKM in Male Testes by strain Figure S9: FPKM values in RNA-seq data in male tissues for 15 sample strains. Coverage varies across strains, but is generally high with thousands of reads for the most highly expressed genes. To reduce variability in coverage and generate more robust differential expression calls, we quantile normalized coverage inputs for the HMM.