Drosophila Functional Elements Are Embedded in Structurally Constrained Sequences

Modern functional genomics uncovered numerous functional elements in metazoan genomes. Nevertheless, only a small fraction of the typical non-exonic genome contains elements that code for function directly. On the other hand, a much larger fraction of the genome is associated with significant evolutionary constraints, suggesting that much of the non-exonic genome is weakly functional. Here we show that in flies, local (30–70 bp) conserved sequence elements that are associated with multiple regulatory functions serve as focal points to a pattern of punctuated regional increase in G/C nucleotide frequencies. We show that this pattern, which covers a region tenfold larger than the conserved elements themselves, is an evolutionary consequence of a shift in the balance between gain and loss of G/C nucleotides and that it is correlated with nucleosome occupancy across multiple classes of epigenetic state. Evidence for compensatory evolution and analysis of SNP allele frequencies show that the evolutionary regime underlying this balance shift is likely to be non-neutral. These data suggest that current gaps in our understanding of genome function and evolutionary dynamics are explicable by a model of sparse sequence elements directly encoding for function, embedded into structural sequences that help to define the local and global epigenomic context of such functional elements.


Introduction
The molecular function of metazoan genomes has been studied extensively in the last decades, using progressively more extensive and sensitive techniques for profiling genome activity, modeling epigenomic organization and perturbing genome sequences. Genomes have been found to encode regulatory information affecting diverse functions, including gene expression, chromatin structure, recombination and replication. Despite this progress, only a small percentage of the e.g., fly, or human genome is annotated specifically with a well-defined molecular role. Comparative genomics and population genetics studies, however, estimate that 10-15% [1] of the human non-exonic genome evolves under natural selection. The gap between the two estimates is intriguing; what is the function of dozens of megabases with significant fitness contribution that lay in-between?
The genome of the yeast Saccharomyces cerevisiae, which is 200-fold smaller than the human genome, is packed with genes (covering 73% of the genome) leaving almost all of the intergenic genome to be in close proximity (89% within 1 kb or less) to a transcription start site or 39 UTR element. Nevertheless, regulatory sequences with direct functional annotation (e.g. transcription factor binding sites) cover just a small minority of the implicated regulatory sequence. The remaining sequences have seemingly little encoding specificity and are typically viewed as wrappers for the ''real'' regulatory encoding or evolutionary leftovers of ancient functional sequences. Recent progress with the mapping and analysis of local chromatin structure next to transcription start sites showed that nucleosome packaging is correlated with transcription, and likely affecting gene expression and other biological processes. These data suggest that the functionality of regulatory sequences is modulated by the organization of nucleosomes around or over them [2,3]. Interestingly, chromatin organization was found to be associated with the underlying sequence composition, and in particular, GC rich sequences were shown to package nucleosomes more efficiently than AT rich sequences (in-vitro and in-vivo) [4][5][6][7][8][9]. It was suggested that the local GC content (scale of 20 bp) in yeast evolves under selective compensatory dynamics [10][11][12] that stabilize hotspots of high AT content which are also low in nucleosome occupancy. In yeast, a major portion of the nonexonic sequences that are not directly coding for regulatory interactions is involved with (and selected for) defining the structure (i.e. nucleosome organization) of the genome.
Larger genomes encode regulatory information in a more dispersed and sparse fashion. The Drosophila genome is 10-fold larger than the yeast and 20-fold smaller than the human genome. The fly genome was one of the first to be screened genetically and sequenced completely [13]. Recently the genome was thoroughly profiled epigenetically [14,15] and physically [16]. In spite of these efforts, the function of only a small portion of the genomic sequence has been described in detail, while a much larger fraction of more than 40-60% of the fly intergenic genome is estimated to evolve under selection [17][18][19][20]. While recent studies have suggested that fly nucleosome organization is also correlated with the local GC content [21], it is not clear to what extent fly genome sequences control and define the physical structure of the genome at the nucleosome level and beyond, and whether structural considerations can bridge the gap between evolutionary and epigenomic estimations of the genome's functional content [22].
In this study, we show that a substantial fraction of the fly genome is likely to evolve under ''structural'' constraints. We identify putative regulatory sequences by screening highly conserved elements using our recently published parameter rich probabilistic evolutionary model [23] and find that while conserved sequences are AT rich, they are surrounded by GC rich sequences showing high levels of nucleosome occupancy. Using a combination of analysis of divergence among fly species and population genetics in Drosophila melanogaster, we support the idea that about 25% of the fly non-exonic genome consists of sequences that embed highly conserved elements, and likely to evolve under weak selection that maintains localized GC content (i.e. in a scale of 20 bp). According to our model, a punctuated GC content increase around CEs is created and stabilized by the evolutionary dynamics of compensatory GC loss and GC gain substitutions. Considering structural and epigenomic constraints in models of genome evolution may contribute significantly to our understanding of genome function and how this function evolves.

Results
Non-uniform nucleotide composition and accelerated evolutionary substitution rate around fly promoters and DHS elements Genome composition is associated with the underlying genomic functional annotation across multiple taxa, from yeast to mammals [24][25][26]. In particular, the distribution of GC content around gene promoters is known to be significantly different from the genomic background [25]. In flies, promoters show a well-known asymmetric skew in GC content ( Figure 1A), with low levels upstream of the TSS (minima at 2190 bp) and an increase toward the gene body. In contrast, DNase I hypersensitive elements (DHSs), which are strongly correlated with functional regulatory elements [27][28][29][30] show a regional elevation in local GC content (scale of 20 bp) peaking in the 200-400 bp around the DHS element center [31,32] (Figure 1B, See Figure S1A-S1C for other classes of regulatory elements). To explore the evolutionary mechanisms underlying these well studied patterns we used a parameter-rich, context-aware evolutionary model [23], and estimated the substitution rates around DHS elements and promoters. As shown in Figure 1C, evolution within DHS elements is predictably slower than around them [33]. Perhaps more surprisingly, divergence rates around the DHS sites increase to peak levels around 500 bp from the center, before declining to background levels at ,1000 bp. A similar evolutionary acceleration is observed ,150 bp upstream to TSS ( Figure S1D). These simple observations raised some intriguing questions regarding the evolutionary origins and functional significance of the sequences surrounding functional elements in flies, prompting us to explore the evolutionary dynamics within functional elements and around them in more detail.

Conserved elements in flies are characterized by punctuated elevation in local GC content
The evolutionary model we used in order to characterize substitution rates in Drosophila genomes was designed to take full account of substitution rates variation among lineages and genomic contexts [23]. This algorithm learns evolutionary substitutions that are parameterized by flanking nucleotides and performs accurate inference of ancestral divergence within a phylogenetic tree encompassing 12 Drosophila species. This modeling and inference strategy ensures proper control for global biases in substitution rates at high or low GC content regions. We defined foci of genomic conservation as contiguous regions showing at least two-fold decrease in normalized divergence, identifying 67780 conserved elements (CEs) with an average size of 50 bp (and a standard deviation of 19.5, Figure 1D), covering in total ,3% of the D. melanogaster genome (full list is available at http://compgenomics.weizmann.ac.il/tanay/?page_id = 459). These conserved elements are enriched within enhancer associated transcription factors binding sites ( Figure S1E), DHS and other epigenetically primed elements [31] (Table S1). Additionally, the evolutionary model allowed us to define CEs at a single base pair resolution and in a tissue independent fashion, something not possible using current epigenomic data. Remarkably, the high spatial precision led to a refined model of nucleotide composition around functional elements. Regionally elevated GC frequencies were observed as before, but this trend was punctuated by elements spanning 20-30 bp with a very high AT content ( Figure 1E, Figure S1F). Furthermore, the symmetry of the compositions around CEs was skewed, with more A's and C's than G's and T's downstream the elements (see inset of Figure 1E). This observation significantly extends previous reports on associations between conserved sequences and local GC content [34][35][36], taking full advantage of the multi-species data and precision of the evolutionary model. In summary, by studying functional elements at a high spatial resolution, and using divergence statistics to define element boundaries rather than low-resolution epigenetic data, we show that the regional increase of local GC content around these elements is spatially anchored around foci of high AT content. This suggested that high GC content is a property of the genomic context of functional elements rather than a feature of the elements themselves, raising questions as to the possible evolutionary and functional mechanisms forming and using it.

Author Summary
A key challenge in functional genomics is to predict evolutionary dynamics from functional annotation of the genome and vice versa. Modern epigenomic studies helped assign function to numerous new sequence elements, but left most of the genome essentially uncharacterized. Evolutionary genomics, on the other hand, consistently suggests that a much larger fraction of the un-annotated genome evolves under selective pressure. We hypothesize that this function-selection gap can be attributed to sequences that facilitate the physical organization of functional elements, such as transcription factor binding sites, within chromosomes. We exemplify this by studying in detail the sequences embedding small conserved elements (CEs) in Drosophila. We show that, while CEs have typically high AT content, high GC content levels around them are maintained by a non-neutral evolutionary balance between gain and loss of GC nucleotides. This non-uniform pattern is highly correlated with nucleosome organization around CEs, potentially imposing an evolutionary constraint on as much as one quarter of the genome. We suggest this can at least partly explain the above function-selection gap. Weak evolutionary constraints on ''structural'' sequences (at scales ranging from one nucleosome to recently described multi-megabase topological domains) may affect genome evolution just like structural motifs shape protein evolution.
Epigenetic clustering shows that punctuated nucleotide profiles are common among multiple functional classes The conserved elements we characterized above represent a heterogeneous set of functional and epigenetic behaviors. We wished to test if the remarkable nucleotide preferences around CEs are specific to particular epigenetic classes or observed globally. To this end, we clustered CEs given the association of the underlying loci with a representative set of epigenetic profiles (DHS, H3K4Me1, H3K4Me3, H3K27Ac, ORC, Su(Hw), PH, H3K27Me3, and H3K9Me3 (see methods)). We derived classes of CEs that were associated with several previously characterized epigenetic behaviors, including classes associated with promoter marks (Figure 2A, cluster i), enhancer marks (cluster ii,iii) Polycomb repressive domains (cluster iv-viii), origin of replication (cluster ix), and Su(Hw) insulator sites (cluster x,xi). We also identified a class of CEs that lacked enrichment in any of the markers (cluster xii). Detailed analysis summarized in Table S2 shows that this class also lacked enrichment in a comprehensive set of epigenomic markers [15,27,37,38]. Importantly, regardless of the epigenetic context, and despite significant differences in the basal GC content (high at promoters and enhancers, low at uncharacterized elements), all classes of conserved elements exhibited the pattern of punctuated GC elevation that was demonstrated globally ( Figure 2B). To verify that the nucleotide patterns we detected are observed at individual loci rather than being a result of some averaging effect, we clustered sequences at a single base pair resolution directly by their spatial GC profile in and around the CE ( Figure S2). The clusters identified variations over the punctuated GC-elevation patterns (e.g. breaking symmetry to a particular side), but showed that only about 6% of the CE are lacking the punctuated GC elevation pattern. Importantly, analysis of nucleosome occupancy around CEs in the different classes unanimously demonstrated nucleosome depletion at the CE, with elevated nucleosome occupancy in the immediately surrounding regions ( Figure 2C, Figure S2). This observation is consistent with previous reports on nucleosome occupancy distributions at specific epigenetic contexts [4,21,22], and suggests that coupling between sequence composition and nucleosome occupancy is a general phenomenon around CEs. The origins and functional implications of this coupling are not fully understood, but its extent raises hypothesis that the punctuated GC elevation pattern affects the accessibility and chromosomal structure around functional elements.

Punctuated GC content at CEs is correlated with nonneutral GC dynamics
Analysis of substitution rates around CEs showed that as expected, the highly conserved CEs are flanked by sequences with increasingly more diverged substitution rates ( Figure S3A). We observed an increase in divergence toward a peak higher than the background level at around 400 bp from the CE, in a way analogous to the substitution dynamics around DHS elements and TSSs ( Figure 1C, Figure S1D). Regional decrease in substitution rates toward the CE is observed for both GC gaining and GC losing substitution types ( Figure 3A), but the ratio between the rates of these two processes recapitulated the punctuated nucleotide composition described above with an AT gaining regime at the CE, and a GC gaining regime peaking at 40 bp from the CE center ( Figure 3B). This indicates that GC patterns around CEs are a result of a dynamic evolutionary balance rather than static conservation. To test if these non-uniform substitution dynamics represent a neutral mutational process or a more complex scenario, we analyzed spatial coupling between GC gaining and losing in the Drosophila genomes. We reconfirmed that on a scale of 1 kb, substitutions of the same type tend to cluster Nucleotide frequency (y-axis) is depicted versus the distance to the nearest transcription start sites (TSS) (x-axis), obtained from UCSC genome browser. As previously reported, the frequency of A and T reaches a peak 200 bp upstream (left) to the TSS and decreases to the genomic average near the TSS. Also pronounced are AT asymmetry and GC asymmetry downstream to the TSS. B) Nucleotide composition around DHS. Similarly to A) but depicting nucleotide compositions relative to the center of the nearest DNase I hypersensitive sites (DHSs). DHS sites were defined as TSS distal (distance.1 kb) loci which are densely covered by DNase cutting sites (top 0.992 percentile, n = 2652) in at least one of four embryonic stages (stage number 5, 10, 11 and 14). C) Divergence around DHS sites. Shown are inferred (solid black line) and expected (given our substitution model, gray dashed line) substitution rates as a function of the distance to the nearest DHS sites. Rates were estimated from multiple alignments of 12 drosophila genomes (methods) using an evolutionary model that takes into account nucleotide composition biases and other context effects. The minor increase in expected substitution rate on DHSs (a consequence of higher GC content over DHSs), stands in marked contrast to the observed conservation pattern. D) Size distribution of conserved elements. CEs were identified at single base pair resolution as regions showing at least two-fold reduction in divergence compared to the expected rate. The size distribution of the inferred element is depicted, showing that most elements are smaller than 50 bp. E) Nucleotide compositions around conserved elements (CEs) are plotted over two distance scales (1 kb in top figure, 100 bp in inset). Regional (2300 to 300 bp) increase in GC content is observed in the larger scale, but the pattern is punctuated with high AT content (220 to 20 bp) in the smaller scale. doi:10.1371/journal.pgen.1003512.g001 together ( Figure S3B) [39][40][41][42], generating hotspots that increase their GC content during evolution on a given lineage. On a much finer scale (40 bp or less), the data suggest that this regional background regime is dominated by a stronger coupling between opposing substitution types ( Figure 3C and inset of Figure S3B), such that G/C losing substitution are frequently matched by a nearby G/C gaining substitutions (and vice-versa). Coupling of opposing substitutions is predicted by theoretical models of compensatory evolution in which weak selection pushes forward substitutions that compensate for prior fitness reduction caused by proximal substitutions [12,43,44]. We conclude that regions flanking CEs evolve under non neutral evolutionary dynamics that are potentially stabilizing variable local GC content (on a scale of ,20 bp) through compensatory coupling between GC gaining and losing substitutions. The non-neutrality of this regime (reminiscent of the compensatory dynamics in yeast promoters [12]), raises the possibility that sequences embedding fly CEs are functionally constrained to preserve some structural genomic role.
Local GC content is associated with allele frequency of G/C gaining and G/C losing SNPs Within-species quantification of frequency of rare alleles can provide a robust indication for a non-neutral evolutionary regime [45], less affected by variability in mutation rate than quantification of divergence rate between species. To further examine the evolutionary dynamics at loci with high or low local GC content (and in particular in sequences embedding CEs), we analyzed allele frequencies at 1.3 million D. melanogaster SNPs [20,46]. We classified SNPs according to their mutation type (GC gaining/ losing, transition/transversion), and stratified the analysis given the local (20 bp) GC content around the SNP locus. As can be seen in Figure 3D, the frequency of rare alleles in D. melanogaster is strongly correlated with the local GC content in a mutation-type dependent fashion. High frequency of rare alleles is observed when losing a G/C in a GC rich element, or when losing an A/T in an A/T rich element. Since previous studies had suggested that GC polymorphisms are associated with GC content on larger scales than those analyzed here [47] we stratified the allele frequency analysis using GC content estimation at the 200 bp scale ( Figure S3C). We observed that localized GC content correlates with allele frequencies, even when conditioning for regional variation in GC content, pointing toward effects that work on a scale of few dozen bp, such as those characterized around CEs ( Figure 1E). Taken together, divergence and polymorphism data are both consistent with a model of selection on local GC content in flies, which is particularly pronounced in the immediate context of CEs, and is correlated with chromosomal structure but not with any other known coding or TF specificities.

Microsatellites are uniformly enriched near conserved elements
The analysis discussed above focused on evolutionary dynamics at the nucleotide composition level. This approach provides the Figure 2. Epigenetic clustering of conserved elements. A) Classifying CEs according to their epigenetic context. Average enrichments (color coded, white-low, red-high) of selected epigenetic marks (rows) within CEs (columns) are depicted for groups of CEs clustered according to their epigenomic profiles. Due to size limitations, only 100 randomly chosen CEs are shown for each group. B) Average nucleotide compositions around CE clusters. While the basal GC content is variable between the epigenomic clusters, the punctuated GC elevation near conserved elements is pronounced in all of them. C) Nucleosome enrichment around CEs. Average nucleosome occupancy percentile (y-axis) around the CE centers is depicted for each cluster. A consistent depletion in occupancy over the CE center is observed for MNase-seq data of Weber et al. [69] in S2 cells. doi:10.1371/journal.pgen.1003512.g002 simplest description of the structure and evolution of sequences embedding CEs in flies, but it may be biased or incomplete if more complex sequence elements are evolving non-neutrally around CEs. Indeed, several frequent sequence motifs, most notably poly-A/T tracts and CA repeats were suggested before to modulate genome structure through regulation of epigenomic organization [48][49][50][51][52]. In some cases, (e.g., yeast poly-A element), this idea is supported with perturbation experiments [2], but in other cases the role of low-complexity sequence elements is difficult to support directly. We performed exhaustive screening of the k-mer spectrum (Table S3) around CEs, aiming at the identification of sequence patterns that are distributed non-uniformly around the elements center. Although evolutionary unstable ( Figure S4A-S4D), some microsatellite elements like CA repeats are abundant near conserved elements (Figure 4, black curves). Other short repeats (e.g (TA) n , poly A/T tracts) were not found to be frequent in these loci. Despite the seemingly specific enrichment of CA repeats in the context of CEs, normalization of microsatellite frequency given the observed nucleotide compositions around CEs (Figure 4, red curves) indicates that short repeats ((CA) n , (TA) n , (CAA) n , (A/T) n ) are in fact enriched around CEs at remarkably similar levels. This suggests that enrichment of simple repeats is a general property of the sequences flanking CEs, and is unrelated to the properties of particular dinucleotides. Several mechanisms may underlie this phenomenon, including higher rates of short insertion and deletion mutations in regions with an active chromatin structure [53], or alternatively, reduction of the efficiency of purifying selection next to functional element sequences (genetic draft) which is leading to higher sensitivity to microsatellite infiltration [54].

Local GC content correlates with allele frequencies in mammalian genomes
Finally, we surveyed the association between local GC content and allele frequencies in human and mouse, aiming to test if similar trends as those defining the embedding of CEs in flies may affect mammalian genome structure. In mammalian genomes functional elements are sparser, and the estimated overall evolutionary constraint is considerable [1,55] but not as global as it is in flies. Moreover, selection-like effects, which are frequently associated with biased gene conversion [41] are affecting the evolutionary dynamics of large scale (.1 kb) GC content distributions. As shown in Figure S5, the distributions of rare allele frequencies in human (4.05 million SNPs, Figure S5A) and mouse (63.8 million SNPs, Figure S5B), are highly dependent on both regional and local G/C content. Similarly to fly, GC losing minor alleles are rarer at elements with high local (20 bp) GC content, independently of the regional (200 bp) GC content  Figure S3A). B) GC gain and loss balance. Plotted is the ratio of GC gain rate (combining all substitution types A/TRG/C) and total GC gain and loss rate at varying distance from CEs. C) Coupling between GC gain and loss. Using pairwise alignment of D. melanogaster and D. yakuba, we identified for each genomic distance (X axis) all pairs of loci within this distance that are both diverged between the two species. We then computed the fraction of such pairs that compensate an excess of GC in D. yakuba in one locus with an excess of GC in D. melanogaster in the other locus, out of all pairs of diverged loci. This non-parametric test shows that pairs of opposite GC changing substitutions are spatially more coupled as the distance between them decreases. Gray polygon represents binomial 95% confidence interval. For a parametric version of this test see Figure S3B. D) Frequency of rare alleles is correlated with GC content on a scale of 20 bp. We grouped SNPs according to mutation type and plotted the average frequency of rare allele p(0.01,minor allele frequency,0.05) for different levels of local GC content (20 bp). Binomial 95% confidence intervals are depicted as gray curves. See Figure S3D for analysis that is stratified for larger scale (200 bp) GC content. doi:10.1371/journal.pgen.1003512.g003 around them. The opposite trend is observed for GC gaining SNPs. One can hypothesize that structural constraints such as those suggested by the Drosophila data are also active around regulatory elements in mammalian genomes, although their intensity may be decreased due to the sparser genomic structure. Indeed, recent reports on the evolutionary dynamics near mammalian nucleosomes [56] support this idea.

Discussion
Base pair resolution analysis of evolutionary conservation and nucleotide composition in D. melanogaster We have combined inference of context-dependent substitution rates in Drosophila species, with epigenomic data and sequence analysis to show how highly localized, 30-70 bp conserved elements (CEs) in the non-exonic fly genomes are embedded into the genomic sequence in an organized and structured fashion. We showed that CEs are typically located within a few hundred bp of elevated GC content regions, but that the CE itself is punctuating this pattern with a sharp increase in AT content. The substitution patterns supporting this highly non-uniform organization are modulating the processes of GC gain and GC loss such that the immediate flanking of the conserved elements has GC gaining substitutions in relative excess. We discovered local compensatory coupling of GC gaining and GC losing substitution on a scale of up to 20-40 bp (and in contrast to the previously reported regional clustering of substitution types that we observed on a scale of about 100-1000 bp). Furthermore, analysis of polymorphic sites identified a remarkable association between highly local (20 bp) GC content and the allele frequencies of GC gaining and GC losing mutations, a phenomenon that is recapitulated in analysis of human and mouse data.
GC content is amongst the most thoroughly explored and modeled genomic phenomena. It is known to be correlated with recombination rates [57], replication time [58], and a plethora of other proven and hypothesized effects. Importantly, our observations here, which target the highly localized organization of regulatory factors and nucleosome within a scale of few dozen base pairs, are significant even when considering the larger scale GC content trends studied before. In conclusion, short non-exonic conserved sequence elements in flies are focal points of nonrandom and non-uniform genomic organizational units that may cover as much as one quarter of the fly genome ( Figure S5C). The evolutionary process that gives rise to these organizational features is unlikely to be neutral, implicating a large fraction of the genome with natural selection and weak, but significant functionality.

Neutral and non-neutral dynamics near conserved elements: The case of microsatellites
While multiple lines of evidence (divergence patterns, allele frequencies, compensatory coupling) suggested that the evolution of sequences surrounding functional elements in flies is affected by selection, complete characterization of the evolutionary dynamics in these regions is still challenging. For example, powerful selection on the functional elements themselves may result in recurrent selective sweeps or other effects originating from the linkage between the elements' cores and their sequence contexts. These may alter the divergence and polymorphism patterns around the cores of functional elements. Furthermore, DNA replication and repair processes may be affected by the element's polarized chromatin structure, resulting in higher or lower rate of mutation, or in variation of the mutation spectrum. Our data suggest that these considerations play a major role in at least one aspect of the sequence composition around functional elements, namely, the documented enrichment of (CA) n and other microsatellite repeats [48][49][50][51][52]. We showed that the specificity of microsatellite abundance around CEs (e.g., many (CA) n , but few (TA) n ) evens out when considering the nucleotide composition in these regions. Instead of enrichment for specific repeats, that could have suggested these elements have specific functional roles, we observed consistent increase in the frequency of all microsatellites families. In contrast to the patterns of punctuated GC content increase discussed above, CE-linked repeat elements could not be associated with signatures of non-neutral evolution. We therefore hypothesize that the accumulation of microsatellites around CEs is explained by either an increased mutation rate, or decrease in the efficiency of purifying selection due to the powerful evolutionary dynamics and epigenomic characteristics of functional elements.

Functional and structural organization of metazoan genomes
Genomic sequences are the outcome of a multi-faceted set of requirements, molded into form through an indirect and complex evolutionary process. Sequences contain codes that facilitate their functions, but in order to transform these codes into physical molecules, sequences must also contain structural elements that allow the codes to be interpreted [51,59,60]. For example, complementary constraints on function and structure are well known to affect the form of protein coding sequences [61], and these can consequentially be decomposed into structural motifs and functional hotspots. In the study of genome structure and function such considerations are still largely unexplored. Nevertheless, the accumulating evidence on local (e.g. nucleosome organization, [4,21,62]) and global (e.g. chromosomal conformation, [16,63]) structural features of metazoan genomes are showing that while only a small fraction of the genome is directly coding for classical functions (protein coding, regulatory elements), the large remaining fraction of the genome is crucially important for proper function. As shown here, this view is now supported by the sequence composition and evolutionary dynamics detected at sequences embedding functional elements in Drosophila. We hypothesize that classical enhancer elements, which encompass several hundred base pairs and are capable of driving tissue specific transcription or repression given appropriate trans-factor activity, may contain one or few small (order of 10 bp) code words capable of recruiting such factors, embedded into larger and more diffused sequences that support the epigenetic organization of the enhancer region. Such epigenetic organization may involve local nucleosome occupancy and stability, as well as more complex interaction of the enhancer with remote elements (as in promoterenhancer interaction [64,65], or contacts between Polycomb response elements [66]). The resulting model for genome structure and function ( Figure 5) can explain many enigmas of modern genomics, including the discrepancy between the estimated intensity of selection on large genomes and the fraction of properly annotated sequences, and the highly non-uniform sequence content distributions (e.g. GC content, simple repeats) in different genomic regions. We hypothesize that by viewing the genome as a physical molecule, not only the container of abstract information, we will be able to understand and interpret normal or aberrant gene function at new levels, especially when combining sequence analysis with extensive linear and three-dimensional epigenomic data.

Sequence and epigenomic data
Multiple alignments (dm3 version), gene annotation data and annotation of repetitive elements were downloaded from the UCSC genome browser [67]. DHS data of five embryonic stages [27] were downloaded from SRA (SRA012889). Replicates were merged and mapped to dm3 using bowtie [68]. modEncode [15] Chip-Seq tracks were downloaded from modEncode website. Nucleosome data [69] were downloaded from GEO (GSM539582). We also used polycomb group chip-chip data of Schuettengruber et al. [37]. Experimentally validated transcription factor binding sites were downloaded from the REDfly website [70]. Genomic loci which are annotated as exonic or repetitive (defined by UCSC genome browser as LTR, LINE, RNA, DNA, satellite or unknown) were excluded from further analysis to minimize biases associated with genome alignments and short read mapping.

Normalization of epigenomic marks
The epigenomic data was smoothed by averaging the data values in each non-overlapping adjacent twenty base pairs. DHS data were then smoothed in 200 bp windows (with overlaps). All data were then normalized using the following non parametric transformation: for Weber et al: ð Þ x norm~{ log 2 max 1{ rank(x raw ) max(rank(x raw )) ,0:0001 for DHS, modEncode and PcG data ð Þ

Peak definitions
DHS sites were defined as TSS distal non exonic loci (TSS distance.1000) of DNase I levels higher than the 0.992 percentile in at least one of the embryonic stages (stage number 5, 10, 11 or 14). Note that this set was defined to represent the distribution of active regulatory regions, but is not intended to be generating an exhaustive set of elements. We used only TSS distal to DHS solely for the purpose of comparison between DHS and TSS patterns. dRing peaks were defined as non-exonic loci with dRing levels higher than the 0.995 percentile in S2 or BG3 cell lines. H3K27Ac peaks were defined as non-exonic loci with H3K27Ac levels above the 0.995 percentile in one or more embryonic stage (E0-4 h, E4-8 h, E8-12 h, E12-16 h, E16-20 h, E20-24 h).

Screening of conserved elements
We used our recently described context-dependent substitution model to compute conservation scores at one bp resolution across the Drosophila clade (http://compgenomics.weizmann.ac.il/tanay/ ?page_id = 169). As described in Chachick et al [23]  ð Þ ) and lineage (parent(i)Ri). The algorithm then performs ancestral inference and computes accurate posterior probabilities for single nucleotide and triplets of nucleotide at each genomic position and for combinations of ancestral triplets in parent and child species. Based on these posterior probabilities, we computed the number of substitutions that occurred at each genomic locus (integrating over all of the Drosophila lineages) and the number of substitution expected given the probabilistic model. By comparing observed and expected number of substitutions in overlapping windows of 20 bp we estimated the degree of evolutionary conservation while controlling for biases associated with variable divergence rates in different nucleotide contexts.
Conserved hotspots were defined as non-exonic loci where the estimated number of observed substitutions is two-fold smaller from the expected number of substitutions. Following expansion of each hotspot by 10 bp to each direction we merged resulting overlapping intervals. Conserved elements (CEs) were then defined as unified genomic intervals spanning over 30 bp. Average profile of observed and expected divergence scores around regulatory elements are shown in Figure 1C and Figure  S1D.

Plotting sequence profiles around DHS, CEs, and other peaks
Given a loci set L we define, for each genomic location i its minimal distance from L as: Nucleotide composition profiles of Figure 1A, 1B, 1E; Figure 2B; Figure S1A-S1C, S1F; and Figure S2 were generated by measuring the nucleotide frequencies of nonexonic sequences as a function of d(I,L), where L represented the centers of loci of each sets (DHSs, CEs, or other peaks). Note that this procedure retains the chromosomal polarization of these loci and assures that each nucleotide will be counted once.

Clustering by sequence data
To investigate the distribution of nucleotide compositions around CEs at high resolution, it was first important to neutralize the effect of the highly abundant (WS) 2 microsatellites (e.g. (CA) n , (GA) n ). These sequence elements generate strong signatures that were studied separately (see Figure 4). After extracting sequences within a distance of 100 bp from centers of all CEs, each two basepairs were transformed to {0,1,2} by the number of G and C, thereby averaging out the signature of repeats with a 2 bp periodicity. This procedure yielded 67780 vectors of 100 numbers between 0 and 2, which we clustered into 20 groups using a standard k-means algorithm with a Euclidean metric. The clustering did not yield sets which were enriched with 3-bp repeats, and thus it was not needed to further smooth the sequences.

Compute spatial trends of GC gain and GC loss
We used our context aware and lineage specific evolutionary probabilistic model to estimate posterior probabilities for individual nucleotides p S l parent i ð Þ and pairs of parent-child nucleotides at each genomic locus p S l i ,S l parent i ð Þ [71]. For a given set of loci L, (e.g. all loci of a certain distance from CE), we define: Depending on the analysis, loci were further grouped to compute the evolutionary trends around them.

A non-parametric test for spatial association between substitutions
Aligned sequences of D. melanogaster and D. yakuba were extracted from the fly 12 species multiple alignment, omitting loci of insertion and deletion between these species. We computed the GC content difference at each locus (aligning two nucleotides) as follows: We then selected a distance parameter d, excluded loci of NA values, and computed the coupling score for d as: Pr d~PlEnon{exonic D l {D lzd P lEnon{exonic D l j jz D lzd j j using only loci for which the range [l,l+d] had at most four bp gaps and further restricting loci to e.g., sequences flanking CEs. We estimated confidence intervals for observed coupling scores ( Figure  3C) assuming a binomial distribution (n~P lEnon{exonic D l z D lzd À Á =2, a = .05).

A parameter rich test for compensation between substitutions
To validate the spatial association between GC gain and GC loss substitutions, we used posterior substitution probabilities, computed as outlined above, and focused on the dynamics in the D. yakuba lineage. Non exonic genomic sequences were screened for loci of GC gain and GC loss on the D. yakuba lineage by the following criteria:

Fly SNPs analysis
Fly SNP data were downloaded from DGRP website (http://www.hgsc.bcm.tmc.edu/projects/dgrp/ freeze1_July_2010/). This dataset [20] consist of genomic sequences of 162 inbred D. melanogaster lines. SNPs of more than two distinct alleles were excluded. To minimize the contribution of sequencing errors, we also excluded very rare SNPs (p,0.01) and SNPs with low average coverage (,10). We defined the major and minor alleles of each locus by their relative abundance in the population. Non exonic and nonrepetitive SNPs were divided into 20 bins by the local GC content (20 bp) of their genomic location according to the reference genome.
Frequency of rare allele of NRN9 SNP is defined as (N, N9 may be any non-identical nucleotides):

Mammalian SNPs analysis
Human SNP data [72] were downloaded from http://hapmap. ncbi.nlm.nih.gov/downloads/genotypes. Mouse SNP [73] data were downloaded from http://www.sanger.ac.uk/resources/ mouse/genomes/. SNPs that fell within repetitive elements (as defined by UCSC genome browser) were excluded. We defined the major and minor alleles of each locus by their relative abundance in the population. Non-exonic and non-repetitive SNPs were divided into cubic bins by the regional GC content of two scales: 20 bp (5 bins)61 kb (5bins).
Frequency of rare allele of a NRN9 SNP is defined as: T BinGC1kb k k Figure S1 Fly regulatory elements are characterized by increased GC content and non-uniform evolutionary dynamics. A-C) Nucleotide composition around regulatory elements is shown versus the distance (x-axis) to the center of the nearest peak of A) H3K4Me1, B) H3K27Ac and C) dRing. The data showed that GC content is enriched over a region spanning ,0.5 kb around regulatory elements of different classes. D) Increased substitution rates in fly promoters. Shown is substitution probability inferred from analysis of 12 drosophila species as a function of distance from the nearest TSS. While the divergence is lower near TSSs (more conserved), the 1 kb region upstream to TSS is highly diverged comparing to the background rate of evolution, reaching the maximum rate ,150 bp upstream to TSS. E) Enrichment of transcription factor binding sites in CEs. Enrichment of REDfly experimentally validated transcription factor binding sites [70] is shown versus the distance from CE centers. Enrichment score is defined as the log2 ratio of the number of observed REDfly sites within a given distance range and the expected number of sites under uniform genomic distribution. F) Punctuated GC pattern in DHSs with weaker conservation. Shown are nucleotide frequencies versus the distance (x-axis) from the loci of minimum divergence within 1587/2652 DHSs that lack a CE given our current thresholds. To compute the pattern, we centered all elements on the positions with highest conservation score. Although these loci were not classified as CEs according to a stringent threshold, their punctuated GC pattern is significant. (TIF) Figure S2 Classes of CE sequence compositions support the universality of the punctuated GC elevation pattern. CEs were clustered using k-means given their nucleotide composition patterns (Methods). Shown are nucleotide compositions and nucleosome occupancy profiles (y-axis) versus distance to the center of the nearest CE (x-axis) for 20 clusters inferred. We note that the clusters' nucleotide composition profile is varying in basal GC content and in symmetry, but that with the exception of one cluster (XI) all groups of CEs showed GC elevation in their proximity. (TIF) Figure S3 Evolutionary dynamics are associated with local GC content. A) Inferred substitution rates around CEs. Shown are substitution rates of all mutation types versus the distance from the nearest center of conserved element averaged over the melanogaster group lineages. Solid lines are color coded according to the substitution types ARC (black), ARG (red), ART (green), CRA (blue), CRG (cyan), CRT (magenta). Dotted lines represent the complementing mutations. B) Association between GC gain and loss. We quantified the spatial coupling between GC gain and loss substitutions over the D. yakuba lineage, by estimating the rate of one type of substitution (GC gain in black, GC loss in red), conditioned on the existence of another type of substitution at a certain distance from it (X axis, Methods). The ratio between the two types of conditionings provides indication to the extent of coupling showing opposite trends for GC losing and gaining substitutions. Strong compensatory coupling was observed locally (at distances of less than 20 bp, see inset), while on the long range (e.g. .2 kb) we observed coupling of GC losing events (i.e. red curve is below 1). C) Stratification on regional GC content shows that fly allele frequency is strongly associated with local GC content. Shown is frequency of rare alleles (y-axis, see methods) against the local GC content (20 bp, x-axis). SNPs were divided into 5 groups according to their regional 200 bp GC content (columns) and by the mutation type (row). Gray dashed curves represent the 95% confidence intervals. (TIF) Figure S4 Divergence rates around microsatellites. Observed divergence rates (solid curves) and expected divergence rates (dashed curves, based on global genomic rates) versus the distance (x-axis) to the nearest A) (A) 6 B) (AT) 3 or (TA) 3 C) (CA) 3 or (AC) 3 D) (GA )3 or (AG) 3 . Due to their strong enrichment around CEs, the sequence flanking e.g., CA repeats are highly implicated in conservation. As shown in Figure 4, proper normalization suggests that despite this, CA repeats are not more strongly enriched around CEs than other classes of tandem repeats. (TIF) Figure S5 A-B) Polymorphism patterns in human (A) and mouse (B) support non neutral evolution of GC content in mammals.

Supporting Information
Shown is the frequency of rare alleles p(minor allele,0.082) of different mutation types (rows) versus local (20 bp) GC content (Xaxis). Analysis is stratified over regional (1 kb) GC content (columns of panels). Gray curves represent 95% binomial confidence intervals. While biased gene conversion or recombination intensities are known to be correlated with GC content and can thereby indirectly contribute to the association between GC content and the frequency of rare alleles, the data here suggest that small scale GC content is significantly linked with an increase in rare allele frequency independently of these effects. C) Large fraction of the genome is affected by conserved elements and associated structural sequences. Table S1 Enrichment of different epigenetic marks over CEs. Log2 ratios of averages of 315 analyzed epigenomic marks (column A) between CEs and randomized intervals of the same lengths (column B) and comparing to the regional levels (1 kb shift of the intervals) of the same marks (column C). (XLS)