Advertisement
  • Loading metrics

Chromosomal rearrangements as a source of new gene formation in Drosophila yakuba

  • Nicholas B. Stewart,

    Roles Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, North Carolina, United States of America, Department of Biological Sciences, Ft Hays State University, Ft Hays, Kansas, United States of America

  • Rebekah L. Rogers

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – review & editing

    rebekah.rogers@uncc.edu

    Affiliation Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, North Carolina, United States of America

Chromosomal rearrangements as a source of new gene formation in Drosophila yakuba

  • Nicholas B. Stewart, 
  • Rebekah L. Rogers
PLOS
x

Abstract

The origins of new genes are among the most fundamental questions in evolutionary biology. Our understanding of the ways that new genetic material appears and how that genetic material shapes population variation remains incomplete. De novo genes and duplicate genes are a key source of new genetic material on which selection acts. To better understand the origins of these new gene sequences, we explored the ways that structural variation might alter expression patterns and form novel transcripts. We provide evidence that chromosomal rearrangements are a source of novel genetic variation that facilitates the formation of de novo exons in Drosophila. We identify 51 cases of de novo exon formation created by chromosomal rearrangements in 14 strains of D. yakuba. These new genes inherit transcription start signals and open reading frames when the 5’ end of existing genes are combined with previously untranscribed regions. Such new genes would appear with novel peptide sequences, without the necessity for secondary transitions from non-coding RNA to protein. This mechanism of new peptide formations contrasts with canonical theory of de novo gene progression requiring non-coding intermediaries that must acquire new mutations prior to loss via pseudogenization. Hence, these mutations offer a means to de novo gene creation and protein sequence formation in a single mutational step, answering a long standing open question concerning new gene formation. We further identify gene expression changes to 134 existing genes, indicating that these mutations can alter gene regulation. Population variability for chromosomal rearrangements is considerable, with 2368 rearrangements observed across 14 inbred lines. More rearrangements were identified on the X chromosome than any of the autosomes, suggesting the X is more susceptible to chromosome alterations. Together, these results suggest that chromosomal rearrangements are a source of variation in populations that is likely to be important to explain genetic and therefore phenotypic diversity.

Author summary

De novo genes and duplicate genes are a key source of new genetic material on which selection acts. To better understand the origins of these new gene sequences, we explored the ways that structural variation might alter expression patterns and form novel transcripts. These new genes can immediately form new proteins, without the need for multiple mutations during new gene formation. We find that chromosomal rearrangements can form new gene sequences when pieces of genes are copied and moved to new locations. These rearrangements are often found on the X chromosome suggesting rapid shuffling of sex chromosomes in Drosophila.

Introduction

Understanding the origins of new genes is essential for a complete description of evolutionary processes. Mutation generates the raw genetic material that can contribute to phenotypic diversity in natural populations. Without this new genetic material, selection cannot produce change. When new genes are required for adaptation to new and changing environments, where do they come from? How do they arise? Proposed sources of new genetic material include duplicate genes [1, 2], chimeric genes [35], de novo genes [69], and domesticated transposable elements [10]. Deep sequencing of genomes has made it trivial to identify single nucleotide polymorphisms (SNPs) in population genetic data [11, 12]. In contrast, structural variants and duplications remain understudied, in part because they are more difficult to identify in sequence data. With improvements in throughput and quality of next generation sequencing, we can begin to explore the full effects of these complex mutations in nature.

Chromosomal rearrangements contribute to genomic divergence across species. While organisms exhibit striking similarity in genome content, genome organization becomes scrambled over time breaking syntenic blocks [1315]. Gene movement due to chromosomal rearrangements is known to influence gene expression in primates [16]. Such natural variation from genomic neighborhood is similar to positional effects observed in transgenic constructs [17]. Yet, the implications of these mutations go far beyond quantitative changes in mRNA levels. Mutations that copy and shuffle pieces of DNA can produce new gene sequences. They have the potential to form whole gene duplications, chimeric genes or alternative gene constructs. The full spectrum of new gene creation from these mutations is not fully explored.

Recent work has identified new gene formation through de novo exon creation when duplicated segments do not respect gene boundaries [18]. These new genes may ascribe a genetic cause to patterns that mimic de novo gene creation. Similar cases of new gene formation were observed when inversions modified gene sequences at breakpoints [19]. However, whole genome studies of rearrangements and new gene formation in natural populations have been lacking. We hypothesize that chromosomal rearrangements may form similar new gene structures when they copy or move pieces of DNA around the genome.

Drosophila remain an excellent model system for genomic analysis. Their genomes are compact with little repetitive DNA content, and easily sequenced [20]. Among the Drosophila, D. yakuba houses an unusually large number of chromosomal rearrangements based on reference strain comparisons [20]. Yet, the complexity of population variation for chromosomal rearrangements that might give rise to the observed divergence remains unseen. Here, we use whole genome population resequencing data with paired-end Illumina reads [21] to identify genome structure changes that are segregating in natural populations of D. yakuba. Pairing these mutation scans with high throughput gene expression data [18], we identify regulatory changes that are produced via chromosomal rearrangements. Across 14 inbred lines of D. yakuba we used abnormal paired-end read mapping to identify chromosomal rearrangements between and within chromosomes. These mutations may be caused by ectopic recombination, TE movement, inversion formation, template switching during DNA synthesis, ectopic DNA repair, or retrogene formation. These mutations all have the unifying feature that they copy or move DNA from one location to another. We describe the number, types, and locations of rearrangements in population sequence data. Using RNA sequence data from these lines we identified incidences where rearrangements may be creating de novo exons and chimeric constructs that create new exons in genomic regions previously devoid of expression. These results suggest that chromosomal rearrangements are a key source of new gene creation that reshapes genome content and organization in nature.

Results

Abundance of chromosomal rearrangements

We used abnormally mapping Illumina paired-end sequence reads to survey chromosomal rearrangements in a previously sequenced population of Drosophila yakuba [21]. We remapped paired-end reads to the reference genome r.1.05 of Drosophila yakuba [20] and Wolbachia endoparasite sequence NC_002978.6. After removing PCR duplicates, the average depth of coverage of each line varied between 12X to 93X coverage (Table 1). We identified regions that are supported by at least 4 independent read-pairs that map at least 1 Mb away from each other on the same chromosome or on separate chromosomes, similar to methods previously implemented in human genetics [22] (see Fig 1). We identified 2368 total rearrangements among the 14 lines of Drosophila yakuba: 1697 rearrangements between chromosomes and 671 within chromosomes. These rearrangements lie within 1kb of 1202 genes.

thumbnail
Fig 1. Example of paired end reads mapped abnormally to the reference genome.

A) CY17C Chromosome sequenced with paired reads with 325 bp insert size. CY17C has an insertion of sequence from 2L into chromosome arm 3R. B) Each end was then aligned to the reference genome. Left reads that mapped around 3R:902100 had paired right reads mapped to regions near 2L:12261000 (red arrows). Additionally, right reads that mapped around 3R:902600 paired with left reads mapping around 2L:12261150 (blue arrows). This indicates that the region between 2L:12261000–12261150 has been inserted into 3R: 902000–902600 in line CY17C. Each rearrangement needs at least 4 abnormally mapping read pairs to be considered.

https://doi.org/10.1371/journal.pgen.1008314.g001

thumbnail
Table 1. Number of chromosomal rearrangements found in the 14 lines.

Our ability to identify rearrangements was associated with sequence coverage depth. Number of rearrangements for each line was then predicted at coverage depth of 93.7X coverage using a linear regression model.

https://doi.org/10.1371/journal.pgen.1008314.t001

Rearrangements as facilitators of new gene formation

Structural variants and duplications can form new exon sequences in regions that were previously untranscribed [18, 22]. These new transcripts appear when gene fragments carrying promoter sequences can drive transcription of new exons with new open reading frames. Formation of such chimeric constructs with de novo exons can offer a source of new transcripts within the genome. Like most cases of new gene formation, we expect many of these new variants to be transient rather than stably incorporated into the genome [2325]. However, as a substrate of genetic novelty, they may occasionally contribute to adaptive changes and phenotypic variation in nature. New genes often appear with expression in the germline [7, 26, 27]. To explore such cases of new open reading frames in the tissues most likely to be affected, we compared rearrangement calls with previously published RNASeq data from testes and accessory glands, male carcasses, ovaries, and female carcasses [18].

To identify new chimeric transcripts formed at rearrangement breakpoints, we used Tophat fusion search [28] to find split reads and abnormally mapping read pairs in RNAseq data (Fig 2). These methods were developed to identify trans-spliced transcripts, but should also identify support for chimeric transcripts produced when genome sequences have been rearranged. We matched structure calls with Tophat fusion [28] calls that are within 1 kb of both sides of rearrangements called in genomic DNA sequences. We used RNASeq read depth to further infer structure of de novo transcripts created by chromosome rearrangements (Fig 3) (Fig 4) (S1 Fig) (S2 Fig).

thumbnail
Fig 2. Example of paired end sequence reads mapping from RNASeq data.

A) CY17C has an insertion of sequence from 2L into chromosome arm 3R. This insert placed a previously untranscribed region within a previously transcribed gene. Paired end reads were generated from cDNA. B) The paired end reads of this RNA transcript will map to separate chromosomes on the reference sequence, and split read mapping may be seen at the breakpoints. Three total misaligned RNASeq pairs and/or split reads is needed to be considered a formation of a new gene.

https://doi.org/10.1371/journal.pgen.1008314.g002

thumbnail
Fig 3. New gene formation through genome rearrangement on chromosome 3R and 2L.

Observing sequence depth of the RNA we can infer relative expression and identify newly transcribed regions in lines that have rearrangement calls. Relative RNA Coverage depth was calculated from Tophat RNASeq alignments by dividing the read depth at each base by the total number of reads mapped. Two regions that have 2 genomic rearrangement calls and Tophat fusion calls supporting the formation of a de novo gene. A) Diagram showing the predicted sequence movement based on the Trinity Transcript blast. An insertion of the sequence from 2L:12260976 in-between 902154 and 902563 has moved a segment of previously untranscribed DNA to a region with active transcription on 3R. RNA transcript assembled by Trinity confirms the observed coverage pattern in RNASeq data. The transcript starts near 3R:902000, the middle section mapped between 2L:12260976–12261178 and the final section then maps near 3R:902500. B) and C) The grey coverage lines are RNA sequence coverage from 3 reference RNASeq replicates which do not have this rearrangement. D) and E) RNA sequence coverage of line CY17C which has the rearrangement present. F) and G) CY17C has a two genomic rearrangement calls between 2L:12260976–12261229 matching with 3R:901825–902154 (red arrows) and 3R:902563–902607 (blue arrows). Grey boxes represent the Trinity transcript aligned to the reference genome.

https://doi.org/10.1371/journal.pgen.1008314.g003

thumbnail
Fig 4.

A) Diagram showing the predicted sequence movement based on the Trinity Transcript blast. A rearrangement joining the sequence from 2L:5435633–5436212 902154 and 2L:5435462–5436047 has moved a segment of previously untranscribed DNA to a region with active transcription on 2L. B) and C) The grey coverage lines are RNA sequence coverage from 3 reference RNASeq replicates which do not have this rearrangement. D) and E) CY28A4 has the rearrangement and increased transcription in region 2L:5435462–5436047. F) and G) CY28A4 has a rearrangement calls between 2L:13246986–13247746 matching with 2L:5435633–5436212 (red arrows) and Square boxes in represent the Trinity transcript aligned to the reference genome. The black boxes represent exons of a preexisting gene (1.g484.t1).

https://doi.org/10.1371/journal.pgen.1008314.g004

In 14 inbred lines, we isolated 51 putative new genes created by rearrangements. A total of 43 genome structure calls match RNASeq fusion calls that indicate expression in testes. Among such new genes, 32 also show expression in male somatic tissue, while 10 of them are expressed exclusively in the testes. A total of 42 fusion genes were found in male somatic tissue, 33 of which were shared in testes and 9 expressed exclusively in the male somatic tissue. A total of 40 out of 51 transcripts incorporate the start codon of a pre-existing gene (S1 Text). These data suggest that the majority of new genes that form do so by copying or shuffling 5’ promoters and translation signals of genes to drive transcription in new regions. Some 37% (19/51) of the possible new genes identified are singletons (S3 Fig). This pattern is consistent with most of the rearrangements and new gene formation being relatively young and possibly detrimental on average. Additionally, 29% (15/51) of the rearrangements may be ancestral, whereas the reference had a rearrangement that modified a preexisting gene.

We observe more new RNA fusion calls at loci with structural rearrangements in testes and male carcasses than ovaries and female carcasses (Fig 5). Testes and male carcass RNA were sequenced using paired-end sequencing while ovaries and female carcass RNA were sequenced using single end sequencing methods. The use of single end data from previously published work reduces the ability to identify fusion transcripts through split read mapping in females (S1 Text). We do not see differences between germline and somatic tissue in males (ANOVA, F(1,13) = 0.04, P>0.8). In females there is a difference between gametic and somatic tissues (ANOVA, F(1,13) = 4.379, P<0.05) though samples sizes are small. In total, we found new transcripts produced by 19 rearrangements expressed in females: 14 expressed in ovaries and 14 expressed female carcass. These data do not allow us to identify every rearrangement in the population, largely due to limits in sequencing coverage. Estimates reported here are conservative, representing the minimum number of instances of new gene formation.

thumbnail
Fig 5.

Distribution of new genes per strain identified in testes, male carcass, ovaries, and female carcass based on 14 inbred lines in males (A) and females (B). A total of 51 new genes were identified across all 14 strains in all tissues. We not see a difference in the number of new genes expressed between male gametic and somatic tissue (ANOVA, F(1,13) = 0.04, P>0.8). While there is a significant difference between ovaries and female tissue (ANOVA, F(1,13) = 4.379, P<0.05), the values are low for each line (including being 0 for multiple samples). This suggest that the male comparison is more indicative of the ratio between somatic and gametic tissue.

https://doi.org/10.1371/journal.pgen.1008314.g005

We confirmed cases of new gene formation using reference free transcriptome assembly program Trinity v.2.4.0. These transcriptomes were assembled then aligned using BLASTn to the D. yakuba reference. A total of 38 de novo transcripts (75%) were confirmed by Trinity reference-free transcriptome assembly [29]. However, 5 out of the 13 transcripts that could not be confirmed appear in regions that exhibit multiple rearrangements thus making it difficult to confirm with Trinity. Also, Trinity confirmation rates may be reduced when very small exons fail to align to the reference genome at stringent thresholds (S2 Fig). Thus, 75% represents a minimum confirmation rate.

Of the 38 transcripts that were confirmed with Trinity, start codons were located before the breakpoint in 34 (89%) of the transcripts. This would suggest that most of the putative new genes identified are chimeric. Hence, most rearrangements appear to incorporate the 5’ UTR and start codon of a pre-existing transcript, thereby forming de novo exons. These chimeric constructs with de novo exons are a source of new transcript formation that can contribute to variation for gene content in natural populations. As with most new gene formation, we expect many of these new genes to be transient [2325]. However, some small subset may form genetic variation that may be useful for adaptive change.

Regulatory changes and chromosomal rearrangements

Chromosomal rearrangements can cause expression changes even when exon sequences remain unmodified [16]. To explore such regulatory changes, we used Cuffdiff from the Tophat/Cufflinks gene expression testing suite [30, 31] to identify genes that have significant change of expression compared to the reference strain. We identified 134 genes within 1kb of a rearrangement that had significant expression differences in at least one tissue compared to the reference strain. These include 41 genes in the testes, 51 in male carcass, 50 in ovaries, and 36 in female carcass that show differential expression associated with rearrangements. Most changes in gene expression associated with chromosomal rearrangements produce decreased expression (S2 Table). Such gene expression changes have the potential to induce phenotypic changes in natural populations.

Population diversity for chromosomal rearrangements

The number of rearrangements identified per line varies from 96 to 455 total rearrangements in a single strain (Table 1). Low coverage PacBio long molecule data confirmed 80–97% of mutations per strain suggesting a low false positive rate. Sequencing coverage has a strong effect on false negative rates and confirmation rates (see S1 Text). Mutations were polarized against the ancestral state using a BLASTn against D. erecta. We identified 112 (4.7%) rearrangements that represent new mutations in the D. yakuba reference. A total of 54 out of 2368 rearrangements could not be polarized using the existing reference assembly (S1 Text). These are excluded from the site frequency spectrum below. The SFS corrected for false negatives shows that the majority of variants are singletons (Fig 6). This is expected if most of the rearrangements are young and/or have negative fitness.

thumbnail
Fig 6. Site frequency spectrum of rearrangements found in the 14 lines.

Most of the rearrangements are singletons. However, there is a slight increase in number of rearrangements found in at least 11 of the 14 lines.

https://doi.org/10.1371/journal.pgen.1008314.g006

If rearrangements create novel gene structures or alter gene expression, they may cause phenotypic effects that are subject to natural selection. We wondered whether signatures of selective sweeps might be observed at loci containing rearrangements. Sweep like signals of negative Tajima’s D representing highly skewed SFS for the region are not overrepresented among rearrangements (S1 Text). Some rearrangements showed Tajima’s D in the bottom 5% of all windows in spite of being singleton variants. These likely represent rearrangements that appeared after the incidence of the sweep. These low frequency variants are not candidates for adaptive changes. However, we observe 10 rearrangements found in at least 75% of lines that are also associated with sweep-like signals (S1 Text). Hence, rearrangements do not appear to be selectively favored as a class, though some individual rearrangements could be adaptive.

Association with transposable elements

Transposable elements are known to facilitate chromosomal rearrangements in Drosophila [32]. They move DNA from one location to another, sometimes creating duplications. TEs can also facilitate ectopic recombination as repetitive sequence mis-pairs during meiosis or mitosis [32]. We compared our rearrangement calls (corrected for false negatives) with TE calls in these lines described previously [21, 33]. We found that 694 rearrangement calls have a TE within 1 kb to one of the sites of the rearrangement and 215 rearrangements have a TE within 1 kb of both sites of the rearrangement. Overall 23.7% (1124/4736) of the rearrangement sites lie within 1 kb of a TE. We found 349 (14.7%) rearrangements have reads that overlap directly with at TE. These rearrangements are confirmed at 86.5–100% in PacBio data, similar the genome wide average. Transposable element content in Drosophila is limited compared with other animals. Only 5.5% of the reference genome is composed of TEs [20], though TEs may accumulate in poorly assembled heterochromatic regions. Yet, these selfish genetic elements appear to contribute significantly to polymorphic changes in genome content and organization.

Genomic distribution of chromosome rearrangement breakpoints

Previous work has noted that the X chromosome is a source of newly transposed transcripts, and sex chromosomes are prone to rearrangements due to repetitive content. An excess of genome structure variants involving the sex chromosomes would leave signals of at least 1 breakpoint lying on the X. We identified the distribution of rearrangement breakpoints within each chromosome arm (Fig 7). We standardized the abundance of rearrangements by the length of chromosome arm. We excluded the 4th chromosome (Muller element F) from our analysis. For rearrangements within a chromosome we excluded abnormally mapping read-pairs less than 1 Mb apart. The chromosome arms have unequal abundance of rearrangement breakpoints per base pair (MANOVA, F(4, 52) = 12.35, P<10−11) (Fig 7). Within-chromosome rearrangements account for 28% of rearrangements, roughly proportional to the amount of the genome housed in a major chromosome arm. These results suggest that the landing place for rearrangements is not biased towards or away from the same chromosome arm.

thumbnail
Fig 7. Number of rearrangement breakpoints per base pair on each chromosome arm for inbred lines of D. yakuba.

Total number of rearrangement sites on each chomosome varied (ANOVA, F(4,52) = 43.42, P<10−15). This is mostly do to the the fact that the X chromosome has significantly more rearrangement breakpoints than the autosomes (Tukey HSD for each comparison involving the X, P<10−6). Chromosome 3R had significantly fewer rearrangements than the X, 2L and 2R (Tukey HSD, P < 0.05).

https://doi.org/10.1371/journal.pgen.1008314.g007

The X chromosome has significantly more rearrangement breakpoints per base pair than the 4 major autosomes arms (P<10−6 for each pairwise comparison; S3 Table). The data reveal that 3R has a reduced number of rearrangement sites compared to the X (P<10−7; S3 Table), 2L (P<0.05; S3 Table), and 2R (P<0.002; S3 Table). The excess of rearrangements on the X is consistent with previous findings of an abundance of tandem duplications located on the X in D. yakuba [21]. The X chromosome has more repetitive regions that are more susceptible to ectopic recombination (34, 35). When we distinguish rearrangements based on whether they move DNA across different chromosome arms or affect distant regions on single chromosome arms, the X is still overrepresented (S1 Text, S6 Fig, S4 and S5 Tables).

We identified 4 ‘hotspots’ of TE movement that had over 30 rearrangement breakpoints in a 5kb span across 14 lines (Fig 8, S7 Fig). One of these hotspots on 2R lies adjacent to a known inversion breakpoint that is expected to suppress recombination. These hotspots contain sequences matching TE families, consistent with TE proliferation (S1 Text). Most rearrangements at hotspots are singleton variants, and each line has fewer than 10 rearrangements. These results suggest recurrent, independent mutations affecting specific regions of the genome.

thumbnail
Fig 8. Many rearrangements lie in the same region making it hard to fully elucidate the nature of a particular rearrangement.

For instance, in line CY21B3 has 5 rearrangement calls (represented by the connecting lines of the two large sections of chromosome 2R) associated with two regions 2R:7002500–7005500 and 2R:9895000–9902000. 4 separate small regions that are separated by at least 1 sequencing insert size (325 bp) within 2R:7002500–7005500 have reads that pair with 3 separate small regions between 2R:9895000–990200. All the lines always show at least one of these rearrangements but generally each line has 2–3 separate rearrangement calls between regions 2R:7002500–7005500 and 2R:9895000–9902000. The 9.9Mb breakpoint lies close to the known inversion breakpoint on 2R where recombination is suppressed.

https://doi.org/10.1371/journal.pgen.1008314.g008

Complex variation

Many of our rearrangements are found in clustered pairs, most likely reflecting two breakpoints of an insertion. If the insertion is large enough our methods will separate them into two different rearrangement calls. In other cases where the insertion is small and roughly equal to the read length, our methods make only a single rearrangement call. Some rearrangements appear to be more complex than a simple rearrangement of one sequence transferring to a new location, a challenge for paired-end read mapping. Among the data, one example stands out as an unusually labile region. Chromosome 2R houses a 2.5kb region (2R:7003000–7005000) that has up to 5 rearrangements with a 7kb region 2MB up stream (2R:9895000–9902000) (Fig 8). All lines have at least one rearrangement in this region, and 13/14 of the lines have supporting RNASeq data. This region may have undergone a recent selective sweep (2R:7003000–7005000; Tajima’s D = -2.1364) (2R:9895000–9902000; Tajima’s D = -1.8177). Due to the multiple rearrangements affecting this single region, it is difficult to localize changes to transcripts and gene expression using Illumina data. This region was identified previously as containing an inversion [34]. The multiple rearrangement calls suggest that the inversion possibly is accompanied by multiple duplication events which is also consistent with targeted analysis of this region [34]. Regions such as this one represents dynamic genome sequence with multiple changes in a short time. Further research of complex regions, especially with emerging long read technology, may allow for a better understanding of how genes are affected by multiple relative recent changes [35]. Such future work may provide an even more complete account for the consequences of chromosomal rearrangements on gene expression and new gene formation.

Discussion

Chromosomal rearrangements are a source of standing variation

We used paired-end Illumina sequence reads to identify chromosomal rearrangements in 14 sample strains derived from natural populations of D. yakuba. We identified genes at these locations they might affect. We identified 2368 rearrangement events within these lines of D. yakuba, indicating there is a substantial standing variation segregating in populations that may provide genetic material for adaptation.

Standing variation is expected to play a considerable role in evolutionary change and adaptive evolution [36]. This variation provides the genetic diversity for a population to quickly adapt to new niches. We further provide evidence that there is significant variation in the presence and locations of rearrangements affecting the standing variation within populations. Also, it appears that the genetic variation from rearrangements are dynamic complex. Some sites appear to have multiple rearrangement events and copy number changes are observed at some rearrangement breakpoints (S1 Text, S7 Fig). Further sequencing with long read technology would help advance the understanding of complex locations that are subject to multiple structural changes [35].

The conservative nature of our study offers a lower bound on the number of rearrangements that are in the genome. We required that rearrangements be supported with at least 4 abnormally mapping read pairs. There may be other mutations with lesser support that did not meet these thresholds. At least one case of a new gene being formed that did not meet the standards of our conservative approach, despite strong evidence in high coverage RNASeq data (S2 Fig). Hence, the full span of real biological variation is likely to be far richer than the limited portrait described here. Taken together this suggest that new gene formation and regulatory changes are an underestimated source of variation in natural populations.

Chromosomal rearrangements are a source of new transcripts

Previous theory has struggled to explain the ways that de novo genes might derive new open reading frames. The canonical progression of new gene formation suggests that many new genes appear as non-coding RNAs due to spontaneous gain of promoters to facilitate transcription [37]. New transcripts would need to acquire translation signals to become fully formed new protein coding genes [9, 37, 38]. Alternative explanations have suggested that pre-existing ORFs in the genome may be primed for translation even prior to transcription [6, 8]. This mechanism raises the question of how translation signals might be recruited prior to transcription.

Here, we present evidence of de novo exons due to chromosomal rearrangements carrying promoters and translation start signals to new locations. New genes that result from such processes offer a clear genetic mechanism to explain new transcription. They also explain how translation signals can be acquired during de novo gene creation, changing expression and protein structure of new genes without multiple intermediary steps. The immediate progression to fully fledged coding sequences can explain how new genes form and how they might produce coding sequences without the need for secondary or tertiary mutations. With fewer mutational steps these genes are certain to form new proteins so long as translation start signals are captured. Hence, these mutations can explain the formation of new peptides without the possibility of loss through pseudogenization or deletion during protogene stages. We have identified 51 possible instances of the creation of de novo exons created from chromosomal rearrangements. Studies of tandem duplications uncovered over 100 combined new genes and 66 duplicated genes, suggesting that tandem duplications may affect gene novelty more than rearrangements in D. yakuba [21].

Genetic principles of rearrangements and new gene formation are likely to extend beyond the Drosophila model. At least one case of a new exon formation through rearrangement has been documented in humans where gene fragments drive expression on previously untranscribed regions [22]. Hence, understanding of these genetic changes in model organisms is likely to offer important information that can be used for future studies on humans. Chromosomal rearrangements in humans are associated with cancers and infertility [3942], and associated changes in gene copy number or chimera formation can influence risk of disease or evolutionary potential [43]. Additionally, population diversity for new genetic content is essential to explain phenotypic variation within species in nature. Regulatory effects of gene relocation, new protein formation through chimeric genes, and de novo exon formation contribute to genetic changes across organisms. These genetic modifications, including new gene formation serve as a substrate of genetic novelty that is likely to be important for adaptation to new environments. As environments fluctuate, emerging new genetic material may become essential to facilitate phenotypic change. Surveys of standing variation in genome structure and gene content will therefore lead to better understanding of natural variation, adaptation, and disease.

Chromosomal rearrangements are commonly associated with transposable elements

Chromosomal rearrangements can be the result of multiple mechanisms including ectopic recombination, ectopic DNA repair or gene conversion, template switching during DNA synthesis, and transposable element movement. Transposable elements are a major mechanism of the rearrangements identified. We find 38% (909/2368) rearrangements have at least one TE within 1kb. Less than 5% of the major chromosome arms within D. melanogaster are transposable elements [44]. Transposable elements have been hypothesized as major players in genetic novelty and catalysts in remodeling gene regulation networks [45]. They often contribute sequence homology that can facilitate ectopic recombination. Yet, only 23.5% (12/51) rearrangements that may have formed a de novo exon are associated with TEs. This suggests that another mechanism such as gene conversion or ectopic recombination is responsible for the new genes formed. However, TEs and rearrangements could influence gene expression without changing the transcript. We found 134 genes that have significant differential expression from the reference within 1kb of identified rearrangements. Of these 134 rearrangements that are associated with genes, 74 (55%) are also associated with TEs. This suggests that TEs could be catalysts for the changes in gene expression in the genes that have altered expression in association with rearrangements.

Genomic distribution

Sex chromosomes are subject to rapid rearrangement due to high repetitive content and selection to relocate gene content to autosomes. Consistent with these patterns, we observe an excess of rearrangements associated with the X chromosome in D. yakuba. We observe significantly more rearrangement sites on the X chromosome compared to the autosomes. This is consistent with previous findings that show the X chromosome has more structural variants in Drosophila [21, 46]. In D. melanogaster the X chromosome has more repetitive content [4749], unique gene density [50], and smaller populations size [49, 51]. The X chromosome has lower levels of background selection, and contains an excess of sex specific genes [52, 53] compared to autosomes. Among rearrangements creating new transcripts we do not find an overrepresentation of breakpoints associated with the X chromosome, in contradiction with the “out-of-the-X” hypothesis of new gene formation [7, 26]. Power may be limited to detect these effects with small numbers of new genes. Still, it is clear that X chromosome dynamics are unique, making it a prime resource to investigate the role of rearrangements in genome evolution.

In addition to the excess of mutations on the X chromosome, we identify 4 ‘hotspots’ of recurrent, independent mutation. Here, structural variants reshape variation at a single locus, with multiple low frequency variants segregating at the same region. The fact that single regions are mutated independently with unique breakpoints suggests either hypermutability or dynamics of selection on independent mutations similar to proposed ‘soft sweeps’. A similar set of ‘hotspots’ has previously been noted for TE insertions at the locus of klarsicht in Drosophila [54] and in the evolution of pesticide resistance [5557]. Whether this locus represents a region subject to strong selection or is rather exceptionally labile remains to be determined. We observe mutations that rearrange sequences within chromosome arms rather than across independent chromosomes are proportional to the amount of DNA housed within the same chromosome arm. These results contrast with gene conversion data in mammals, showing that within-chromosome rearrangements are favored over cross-chromosome recombination during gene conversion [58].

Methods

Fly lines and genome sequencing

We used fastq sequences from previously published genomes (PRJNA215876, also available at https://drive.google.com/drive/u/0/folders/0Bxy-54SBqeekakFpeFBib3BXcVE) of 7 isofemale Drosophila yakuba lines from Nairobi, Kenya and 7 isofemale lines from Nguti, Cameroon (collected by P. Andolfatto 2002) [21]. The reference strain is UCSD stock center 14021–0261.01, and the genome sequence is previously described in Drosophila Twelve Genomes Consortium (2007). Genome sequencing for the 14 isofemale lines are previously described in ref. 13. Briefly, the wild-caught strains and the D. yakuba reference stock were sequenced with three lanes of paired-end sequencing at the UC Irvine Genomics High Throughput Facility (http://dmaf.biochem.uci.edu).

Sequence alignment and the identification of chromosomal rearrangements

We mapped paired-end genomic reads to the reference genome of D. yakuba r1.5 [20] and the Wolbachia endoparasite sequence (NC_002978.6) using bwa v/0.7.12 [59] using permissive parameters to allow mapping in the face of high heterozygosity in Drosophila (bwa aln -l 16500 -n 0.01 -o 2). The resulting paired-ends were resolved using “sampe” module of bwa to produce bam files. Each bam file was then sorted using samtools sort v/1.6 [59]. These Illumina paired-end sequences were made with PCR amplified libraries [21]. PCR duplicates can give false confidence in rearrangements through amplification of ligation products that do not represent independent DNA molecules. We used samtools rmdup to remove PCR duplicates. To identify genome structure changes, we used paired-end reads that were at least 1Mb away from each other or located on separate chromosomes (Fig 1). These abnormally mapped paired-end reads indicate possible rearrangements within or between chromosomes. Between 1 Mb and 100kb there may be some rearrangements but there are also inversions, moderately sized duplications (some with secondary deletions). A 1Mb threshold may exclude some variation, but allows greater clarity with respect to mutations and mechanisms that might generate mutations. We selected this stringent cut off to reduce the possibility of inversions being identified rather than translocations. To be considered as a possible rearrangement, a minimum of 4 independent reads must show the same paired-end read pattern (Fig 1). To be clustered together, sets of paired-end reads must be mapped within a distance smaller than the insert size of the library (325 bp) to each other on both rearrangement points. Only rearrangements involving major chromosome arms were considered. All heterochromatic or unplaced chromosomes were excluded.

Sequencing coverage is a major factor in false negative rates [21] (S8 Fig). When 4 supporting read-pairs are required to call mutations, we observe a strong correlation between depth and number of rearrangements (R2 = 0.8223, P<4.8x10-6) (S4 Fig). When only 3 supporting read-pairs are required to call mutations, there is no correlation between depth and rearrangement calls (R2 = 0.009632, P>0.3) (S9 Fig). Four lines showed an unexpectedly large number of rearrangements when only 3 supporting read-pairs are used (S9 Fig). These sequence data were collected in early Illumina preparations before kit-based sequencing prep was available. Ligation of multiple inserts with high DNA concentration is likely to have produced this pattern. When 4 supporting read-pairs are required, the number of mutation calls fit into expected relative numbers between the lines negating the effects of errant insert ligation.

Estimating false negatives and false positives

The number of structure calls is strongly correlated with depth of coverage of each line (S4 Fig). We estimated the number of reads of each line that would be expected at 93.7X coverage using a linear regression model between number read calls and depth of sequencing. In low coverage data, paired-end read may underestimate rearrangement numbers by as much as 50%. All flies sequenced were female. Hence, there should not be significant biases against identification of rearrangements involving the X chromosome compared to the autosomes. The lack of coverage in highly repetitive heterochromatic regions will limit ability to identify rearrangements at those loci. However, our goal was to find rearrangements that change gene structure or expression, while heterochromatic regions are generally less gene dense. Requiring 4 supporting read-pairs may lead to many false negatives in low coverage data. To identify specific cases of false negatives, we surveyed each confirmed rearrangement in each line that did not have a positive call that rearrangement. If these other sample lines had 1–3 reads supporting a rearrangement we considered it a false negative in that sample strain. This is expected to identify many of the false negatives, but there may be false negatives that may not have had a single abnormally paired read supporting it which we failed to identify.

False positive rates were determined by using previously published long read PacBio sequences [21]. PacBio sequencing was done for 4 lines NY73, NY66, CY17C, and CY21B3. This sequencing experiment was done in the early stages of long read sequencing and thus coverage depth for each line is between 5X and 10X. We matched PacBio sequence reads to the D. yakuba reference using a BLASTn with the repetitive DNA filter turned off and an E-value cutoff of 10−10. If a single molecule read matched in a BLASTn within 2kb of both sides of the genomic rearrangement call it was considered confirmed. The number of rearrangements that were not confirmed divided the number of the total rearrangements for that line provides us with an estimate of the false positive rate.

Polarization of the ancestral state

All the rearrangements identified are polymorphic in populations and are expected to be relatively new changes. However, each rearrangement was determined relative to the reference strain. Therefore, it is possible that the rearrangement identified could represent a new rearrangement or the ancestral state that has been rearranged in the reference strain. To polarize the rearrangements, we acquired sequences 1kb upstream and 1kb downstream of each rearrangement site. These sequences were then matched to the D. erecta reference genome using a blastn [20].

If the two sides of a rearrangement aligned within 2kb of each other on the same chromosome in D. erecta, it was determined that the rearrangement call is the ancestral allele and the reference has the derived allele. Rearrangements that are shared across species will accumulate nucleotide differences. Therefore, hits must have a minimum of 85% nucleotide identity and must span at least a segment of rearrangement call breakpoints as defined by abnormally mapping Illumina reads. Rearrangements are commonly associated with transposable elements and repetitive element, so if the two sides of a rearrangement map close to each other in more than 10 locations the ancestral state could not be determined.

Gene expression changes

We used previously published RNA sequences (13, 14) to identify gene expression changes and new gene formation associated with genome structure changes. Briefly RNASeq samples were prepared from virgin flies collected within 2 hrs. of eclosion, then aged 2–5 days post eclosion before dissection. Available data includes ovaries and headless carcass for adult females, and testes plus accessory glands (abbreviated hereafter as testes) and headless carcass for adult males. Sequence data are available in the NCBI SRA under PRJNA269314 and PRJNA196536.

We aligned RNASeq fastq data to the D. yakuba reference genome using Tophat v.2.1.0 and Bowtie2 v.2.2.9 [60]. We utilized Tophat-fusion search algorithm [61] to identify transcripts that represent fusion gene products either between chromosomes or rearrangements within chromosomes. To confirm fusion events, RNASeq fastq data were assembled reference-free into a transcriptome using Trinity v.2.4.0 [29]. Each transcriptome was then matched to the D. yakuba reference using a BLASTn with the repetitive DNA filter turned off and an E-value cutoff of 10−10. All genomic mutations are identified as differences between sample and reference strains. Hence, the RNAseq coverage in the reference serves as a ‘control’ to help identify new genes formed at rearrangement breakpoints.

Identifying fusion transcripts and gene expression changes

Genomic rearrangement calls were matched to fusion calls from Tophat fusion [61] for testes, ovaries, male carcass, and female carcass. If the two sides of a supported rearrangement were within 1kb of the three Tophat fusion reads or read-pairs (Fig 2), the rearrangement was considered candidate de novo exons. Genes annotations in D. yakuba r1.5 within 1 kb of each location of the RNA supported genomic rearrangement calls were identified. Rearrangements where one side is located near a gene and the other side is not, were of particular interest for the creation of de novo exons.

Gene expression at each rearrangement was quantified using coverage depth divided by total mapped reads, analogous to FPKM correction. Each of the four tissues described above (testes, male carcass, ovaries, female carcass) were screened for sequence expression differences associated with the rearrangements. Regions that have unique expression patterns associated with rearrangement calls are considered new transcripts. When a rearrangement brings together a gene and a noncoding locus and there is new transcription in the noncoding region is indicative of new genes.

These new genes were further confirmed using the reference free transcript assembler, Trinity. Each transcript was compared to the D. yakuba references using BLASTn with the repetitive DNA filter turned off and an E-value cutoff of 10−10. Transcripts that matched to both ends of the rearrangement was considered confirmation.

We used previously published data [18] from the Cuffdiff program of the Cufflinks differential expression program [30] to search for regulatory changes in genes near chromosomal rearrangements. These data rely on previously published gene and transcript annotations from the same RNASeq data [62]. We compared gene expression of each gene versus the reference strain. Genes that were within 1kb of a chromosome rearrangement call and had significant change from the reference strain were identified.

Gene ontology

Gene ontology was analyzed using DAVID GO analysis software (http://david.abcc.ncifcrf.gov) [63, 64]. We surveyed for overrepresentation of genes within differing functional pathways. Functional groups with an enrichment score greater than 2 were reported. Functional genetic data for D. yakuba remains sparse. To determine functional categories represented, we identified D. melanogaster orthologs as classified in FlyBase and used these as input for gene ontology analysis.

Differences between chromosomes

We analyzed differences among chromosomes using an ANOVA and Tukeys HSD tests using random block design using line as the treatment blocks. To tabulate rearrangement sites among the chromosomes, each rearrangement that was within a singular chromosome arm counted as 2 sites on that chromosomal arm while rearrangements between chromosomes counted as 1 site on each of the chromosomes involved. Differences between the chromosome arms involving rearrangements within a chromosome arm and between chromosome rearrangements were identified individually using an ANOVA and Tukeys’ HSD tests using the same random block design.

Population genetics

Estimates of θπ, θW, and Tajima’s D in 5kb windows for this of D. yakuba (https://github.com/ThorntonLab/DrosophilaPopGenData-Rogers2015) were previously described in ref 46. These estimates excluded sites with missing data, ambiguous sequence, or heterozygous sites. We report population genetic statistics for each window containing rearrangements and new genes in the data presented here.

Supporting information

S1 Fig. RNAseq depth at a rearrangement breakpoint suggesting new gene formation rearrangements on chromosome 2L.

https://doi.org/10.1371/journal.pgen.1008314.s002

(PDF)

S2 Fig. RNAseq depth at a rearrangement breakpoint suggesting new gene formation via rearrangement on chromosomes 2L and 3L.

https://doi.org/10.1371/journal.pgen.1008314.s003

(PDF)

S3 Fig. Site frequency spectrum of rearrangements associated with fusion transcripts.

https://doi.org/10.1371/journal.pgen.1008314.s004

(PDF)

S4 Fig. Association between sequence coverage and number of rearrangements identified.

https://doi.org/10.1371/journal.pgen.1008314.s005

(PDF)

S5 Fig. Confirmation rates depend on sequence coverage.

https://doi.org/10.1371/journal.pgen.1008314.s006

(PDF)

S6 Fig. Number of rearrangements per bp by chromosome.

https://doi.org/10.1371/journal.pgen.1008314.s007

(PDF)

S7 Fig. Distribution of rearrangements by chromosome.

https://doi.org/10.1371/journal.pgen.1008314.s008

(PDF)

S8 Fig. False negatives vs sequencing coverage.

https://doi.org/10.1371/journal.pgen.1008314.s009

(PDF)

S9 Fig. Methods performance under less stringent read-pair support.

https://doi.org/10.1371/journal.pgen.1008314.s010

(PDF)

S1 Table. Number of new genes on each chromosome in each strain.

https://doi.org/10.1371/journal.pgen.1008314.s011

(PDF)

S2 Table. Number of genes adjacent to rearrangements.

https://doi.org/10.1371/journal.pgen.1008314.s012

(PDF)

S3 Table. Comparison of rearrangements per base pair by chromosome.

https://doi.org/10.1371/journal.pgen.1008314.s013

(PDF)

S4 Table. Comparison of rearrangements per base pair for rearrangements within chromosomes.

https://doi.org/10.1371/journal.pgen.1008314.s014

(PDF)

S5 Table. Comparison of rearrangements per base pair for rearrangements across chromosomes.

https://doi.org/10.1371/journal.pgen.1008314.s015

(PDF)

S6 Table. Total number of rearrangements with coverage increases.

https://doi.org/10.1371/journal.pgen.1008314.s016

(PDF)

S7 Table. Rearrangements with parallel read pairs.

https://doi.org/10.1371/journal.pgen.1008314.s017

(PDF)

S8 Table. Gene ontology terms for genes associated with rearrangements.

https://doi.org/10.1371/journal.pgen.1008314.s018

(PDF)

References

  1. 1. Conant GC, Wolfe KH. Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet. 2008;9(12):938–50. pmid:19015656.
  2. 2. Ohno S. Evolution by gene duplication. Berlin, New York,: Springer-Verlag; 1970. xv, 160 p. p.
  3. 3. Long M, Langley CH. Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science. 1993;260(5104):91–5. pmid:7682012.
  4. 4. Rogers RL, Hartl DL. Chimeric genes as a source of rapid evolution in Drosophila melanogaster. Mol Biol Evol. 2012;29(2):517–29. Epub 2011/07/21. pmid:21771717; PubMed Central PMCID: PMC3350314.
  5. 5. Zhou Q, Zhang G, Zhang Y, Xu S, Zhao R, Zhan Z, et al. On the origin of new genes in Drosophila. Genome Res. 2008;18(9):1446–55. pmid:18550802; PubMed Central PMCID: PMC2527705.
  6. 6. Begun DJ, Lindfors HA, Thompson ME, Holloway AK. Recently evolved genes identified from Drosophila yakuba and D. erecta accessory gland expressed sequence tags. Genetics. 2006;172(3):1675–81. pmid:16361246; PubMed Central PMCID: PMC1456303.
  7. 7. Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ. Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci U S A. 2006;103(26):9935–9. pmid:16777968; PubMed Central PMCID: PMC1502557.
  8. 8. Zhao L, Saelao P, Jones CD, Begun DJ. Origin and spread of de novo genes in Drosophila melanogaster populations. Science. 2014;343(6172):769–72. pmid:24457212; PubMed Central PMCID: PMC4391638.
  9. 9. Schlotterer C. Genes from scratch—the evolutionary fate of de novo genes. Trends Genet. 2015;31(4):215–9. pmid:25773713; PubMed Central PMCID: PMC4383367.
  10. 10. Aminetzach YT, Macpherson JM, Petrov DA. Pesticide resistance via transposition-mediated adaptive gene truncation in Drosophila. Science. 2005;309(5735):764–7. pmid:16051794.
  11. 11. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303. pmid:20644199; PubMed Central PMCID: PMC2928508.
  12. 12. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987–93. pmid:21903627; PubMed Central PMCID: PMC3198575.
  13. 13. Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, Mauceli E, et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature. 2004;431(7011):946–57. pmid:15496914.
  14. 14. Putnam NH, Butts T, Ferrier DE, Furlong RF, Hellsten U, Kawashima T, et al. The amphioxus genome and the evolution of the chordate karyotype. Nature. 2008;453(7198):1064–71. pmid:18563158.
  15. 15. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. pmid:11237011.
  16. 16. De S, Teichmann SA, Babu MM. The impact of genomic neighborhood on the evolution of human and chimpanzee transcriptome. Genome Res. 2009;19(5):785–94. pmid:19233772; PubMed Central PMCID: PMC2675967.
  17. 17. Wilson C, Bellen HJ, Gehring WJ. Position effects on eukaryotic gene expression. Annu Rev Cell Biol. 1990;6:679–714. pmid:2275824.
  18. 18. Rogers RL, Shao L, Thornton KR. Tandem duplications lead to novel expression patterns through exon shuffling in Drosophila yakuba. PLoS Genet. 2017;13(5):e1006795. Epub 2017/05/23. pmid:28531189; PubMed Central PMCID: PMC5460883.
  19. 19. Guillen Y, Ruiz A. Gene alterations at Drosophila inversion breakpoints provide prima facie evidence for natural selection as an explanation for rapid chromosomal evolution. BMC Genomics. 2012;13:53. pmid:22296923; PubMed Central PMCID: PMC3355041.
  20. 20. Consortium DG. Evolution of genes and genomes on the Drosophila phylogeny. nature. 2007;450:203–18. pmid:17994087
  21. 21. Rogers RL, Cridland JM, Shao L, Hu TT, Andolfatto P, Thornton KR. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. Mol Biol Evol. 2014;31(7):1750–66. Epub 2014/04/09. pmid:24710518; PubMed Central PMCID: PMC4069613.
  22. 22. Rogers RL. Chromosomal Rearrangements as Barriers to Genetic Homogenization between Archaic and Modern Humans. Mol Biol Evol. 2015;32(12):3064–78. pmid:26399483; PubMed Central PMCID: PMC5009956.
  23. 23. Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290(5494):1151–5. pmid:11073452.
  24. 24. Hahn MW, Han MV, Han SG. Gene family evolution across 12 Drosophila genomes. PLoS Genet. 2007;3(11):e197. pmid:17997610; PubMed Central PMCID: PMC2065885.
  25. 25. Rogers RL, Bedford T, Hartl DL. Formation and longevity of chimeric and duplicate genes in Drosophila melanogaster. Genetics. 2009;181(1):313–22. Epub 2008/11/19. pmid:19015547; PubMed Central PMCID: PMC2621179.
  26. 26. Betran E, Thornton K, Long M. Retroposed new genes out of the X in Drosophila. Genome Res. 2002;12(12):1854–9. pmid:12466289; PubMed Central PMCID: PMC187566.
  27. 27. Assis R, Bachtrog D. Neofunctionalization of young duplicate genes in Drosophila. Proc Natl Acad Sci U S A. 2013;110(43):17409–14. pmid:24101476; PubMed Central PMCID: PMC3808614.
  28. 28. Kim DSSL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biology. 2011;12.
  29. 29. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology. 2011;29(7):644. pmid:21572440
  30. 30. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562–78. pmid:22383036; PubMed Central PMCID: PMC3334321.
  31. 31. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–11. pmid:19289445; PubMed Central PMCID: PMC2672628.
  32. 32. Montgomery EA, Huang SM, Langley CH, Judd BH. Chromosome rearrangement by ectopic recombination in Drosophila melanogaster: genome structure and evolution. Genetics. 1991;129(4):1085–98. pmid:1783293; PubMed Central PMCID: PMC1204773.
  33. 33. Cridland JM. Structural Variation in the Genomes of Drosophila: University of California, Irvine; 2012.
  34. 34. Ranz JM, Maurin D, Chan YS, von Grotthuss M, Hillier LW, Roote J, et al. Principles of genome evolution in the Drosophila melanogaster species group. PLoS Biol. 2007;5(6):e152. pmid:17550304; PubMed Central PMCID: PMC1885836.
  35. 35. Chakraborty M, VanKuren NW, Zhao R, Zhang X, Kalsow S, Emerson JJ. Hidden genetic variation shapes the structure of functional elements in Drosophila. Nat Genet. 2018;50(1):20–5. pmid:29255259; PubMed Central PMCID: PMC5742068.
  36. 36. Barrett RD, Schluter D. Adaptation from standing genetic variation. Trends Ecol Evol. 2008;23(1):38–44. Epub 2007/11/17. pmid:18006185.
  37. 37. Carvunis AR, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Simonis N, et al. Proto-genes and de novo gene birth. Nature. 2012;487(7407):370–4. pmid:22722833; PubMed Central PMCID: PMC3401362.
  38. 38. Siepel A. Darwinian alchemy: Human genes from noncoding DNA. Genome Res. 2009;19(10):1693–5. pmid:19797681; PubMed Central PMCID: PMC2765273.
  39. 39. Martin SA, Hewish M, Lord CJ, Ashworth A. Genomic instability and the selection of treatments for cancer. J Pathol. 2010;220(2):281–9. pmid:19890832.
  40. 40. Inaki K, Liu ET. Structural mutations in cancer: mechanistic and functional insights. Trends Genet. 2012;28(11):550–9. pmid:22901976.
  41. 41. De Braekeleer M, Dao TN. Cytogenetic studies in couples experiencing repeated pregnancy losses. Hum Reprod. 1990;5(5):519–28. pmid:2203803.
  42. 42. Martin RH. Cytogenetic determinants of male fertility. Hum Reprod Update. 2008;14(4):379–90. pmid:18535003; PubMed Central PMCID: PMC2423221.
  43. 43. Ionita-Laza I, Rogers AJ, Lange C, Raby BA, Lee C. Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics. 2009;93(1):22–6. pmid:18822366; PubMed Central PMCID: PMC2631358.
  44. 44. Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, et al. The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 2002;3(12):RESEARCH0084. pmid:12537573; PubMed Central PMCID: PMC151186.
  45. 45. Chuong EB, Elde NC, Feschotte C. Regulatory activities of transposable elements: from conflicts to benefits. Nature Reviews Genetics. 2017;18(2):71. pmid:27867194
  46. 46. Cardoso-Moreira M, Emerson JJ, Clark AG, Long M. Drosophila duplication hotspots are associated with late-replicating regions of the genome. PLoS Genet. 2011;7(11):e1002340. Epub 2011/11/11. pmid:22072977; PubMed Central PMCID: PMC3207856.
  47. 47. Bachtrog D, Weiss S, Zangerl B, Brem G, Schlötterer C. Distribution of dinucleotide microsatellites in the Drosophila melanogaster genome. Molecular Biology and Evolution. 1999;16(5):602–10. pmid:10335653
  48. 48. Mackay TF, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, et al. The Drosophila melanogaster Genetic Reference Panel. Nature. 2012;482(7384):173–8. Epub 2012/02/10. pmid:22318601; PubMed Central PMCID: PMC3683990.
  49. 49. Andolfatto P. Contrasting patterns of X-linked and autosomal nucleotide variation in Drosophila melanogaster and Drosophila simulans. Molecular Biology and Evolution. 2001;18(3):279–90. pmid:11230529
  50. 50. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287(5461):2185–95. pmid:10731132
  51. 51. Wright S. Evolution in Mendelian populations. Genetics. 1931;16(2):97–159. pmid:17246615
  52. 52. Ranz JM, Castillo-Davis CI, Meiklejohn CD, Hartl DL. Sex-dependent gene expression and evolution of the Drosophila transcriptome. Science. 2003;300(5626):1742–5. pmid:12805547
  53. 53. Huylmans AK, Parsch J. Variation in the X:Autosome Distribution of Male-Biased Genes among Drosophila melanogaster Tissues and Its Relationship with Dosage Compensation. Genome Biol Evol. 2015;7(7):1960–71. Epub 2015/06/26. pmid:26108491; PubMed Central PMCID: PMC4524484.
  54. 54. Cridland JM, Macdonald SJ, Long AD, Thornton KR. Abundance and distribution of transposable elements in two Drosophila QTL mapping resources. Mol Biol Evol. 2013;30(10):2311–27. pmid:23883524; PubMed Central PMCID: PMC3773372.
  55. 55. Karasov T, Messer PW, Petrov DA. Evidence that adaptation in Drosophila is not limited by mutation at single sites. PLoS Genet. 2010;6(6):e1000924. pmid:20585551; PubMed Central PMCID: PMC2887467.
  56. 56. Magwire MM, Bayer F, Webster CL, Cao C, Jiggins FM. Successive increases in the resistance of Drosophila to viral infection through a transposon insertion followed by a Duplication. PLoS Genet. 2011;7(10):e1002337. pmid:22028673; PubMed Central PMCID: PMC3197678.
  57. 57. Schmidt JM, Good RT, Appleton B, Sherrard J, Raymant GC, Bogwitz MR, et al. Copy number variation and transposable elements feature in recent, ongoing adaptation at the Cyp6g1 locus. PLoS Genet. 2010;6(6):e1000998. pmid:20585622; PubMed Central PMCID: PMC2891717.
  58. 58. Ezawa K, S OO, Saitou N, Investigators ST-NY. Proceedings of the SMBE Tri-National Young Investigators' Workshop 2005. Genome-wide search of gene conversions in duplicated genes of mouse and rat. Mol Biol Evol. 2006;23(5):927–40. pmid:16407460.
  59. 59. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. Epub 2009/05/20. pmid:19451168; PubMed Central PMCID: PMC2705234.
  60. 60. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. pmid:22388286; PubMed Central PMCID: PMC3322381.
  61. 61. Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 2011;12(8):R72. pmid:21835007; PubMed Central PMCID: PMC3245612.
  62. 62. Rogers RL, Shao L, Sanjak JS, Andolfatto P, Thornton KR. Revised annotations, sex-biased expression, and lineage-specific genes in the Drosophila melanogaster group. G3 (Bethesda). 2014;4(12):2345–51. Epub 2014/10/03. pmid:25273863; PubMed Central PMCID: PMC4267930.
  63. 63. Huang da W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13. pmid:19033363; PubMed Central PMCID: PMC2615629.
  64. 64. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009;4(1):44–57. pmid:19131956