^{1}

^{1}

^{2}

^{1}

^{3}

^{1}

^{*}

Analyzed the data: EK AB ES AT. Wrote the paper: EK AB ES AT.

The authors have declared that no competing interests exist.

Evolution maintains organismal fitness by preserving genomic information. This is widely assumed to involve conservation of specific genomic loci among species. Many genomic encodings are now recognized to integrate small contributions from multiple genomic positions into quantitative dispersed codes, but the evolutionary dynamics of such codes are still poorly understood. Here we show that in yeast, sequences that quantitatively affect nucleosome occupancy evolve under compensatory dynamics that maintain heterogeneous levels of A+T content through spatially coupled A/T-losing and A/T-gaining substitutions. Evolutionary modeling combined with data on yeast polymorphisms supports the idea that these substitution dynamics are a consequence of weak selection. This shows that compensatory evolution, so far believed to affect specific groups of epistatically linked loci like paired RNA bases, is a widespread phenomenon in the yeast genome, affecting the majority of intergenic sequences in it. The model thus derived suggests that compensation is inevitable when evolution conserves quantitative and dispersed genomic functions.

Purifying selection is a major force in conserving genomic features. It pushes deleterious mutations to extinction while conserving the specific DNA sequence. Here we show that a large proportion of the yeast genome evolves under compensatory dynamics that conserve genomic properties while modifying the genomic sequence. Such compensatory evolution conserves the local G+C content of the genome, which influences nucleosome organization. Since purifying selection is too weak to eliminate every weakly deleterious mutation in nucleosome bound or unbound sequences, the local G+C content is frequently stabilized by compensatory G+C gaining and G+C losing mutations in proximal loci. Theoretical analysis shows that compensatory evolution is inevitable when natural selection is weak and the genomic feature is distributed over many loci. These results imply that sequence conservation may not always be equated with overall selection. They demonstrate that cycles of weakly deleterious substitutions followed by positive selection for corrective mutations, which were so far studied mostly in RNA coding genes, are observed broadly and profoundly affect genome evolution.

With the complete sequencing of a large number of genomes, and with the rapid progress in the development and application of methodologies for functional annotation of whole genomes

A simple experimentally characterized example of a dispersed genomic encoding involves the effect of DNA sequence on nucleosome organization

Here we analyze patterns of divergence and polymorphisms in yeast intergenic sequences to substantiate an extended model of selection on a dispersed genomic encoding. The analysis shows that yeast low nucleosome occupancy sequences have maintained a high A+T content throughout the evolution of the

The global G+C content of the yeast intergenic genome is about 35% (

Shown is the distribution of G+C content in small (20 bp) bins across intergenic sequences in the

To study the evolutionary dynamics that underlie G+C content heterogeneity and nucleosome occupancy in the yeast genome, we inferred substitution rates and ancestral sequences in the

Shown are the inferred C to T substitution rates for the

Shown is a comparison of the rate of A/T gaining substitutions near inferred sites of A/T-losing (black) and A/T-gaining (red) substitution (

We first studied

An important assumption underlying our evolutionary analysis above is that the evolutionary regime operating in regions that are occupied (or unoccupied) by nucleosomes in the extant

One intriguing possibility that may explain the asymmetry between the rates of A/T-losing and A/T-gaining substitutions in low occupancy sequences is that while A/T-losing mutations are selected against, some can be sustained in the population. Consequently, positive selection is able to push to fixation corrective A/T-gaining mutations (possibly at different genomic positions). If this hypothesis is correct, we can predict that loci near sites of A/T-losing substitutions will be enriched with A/T-gaining substitutions and vice versa. Remarkably, the yeast divergence patterns confirm this prediction. The data reveal that rates of A/T-gaining substitution are accelerated next to sites of observed A/T loss (compared to rates near conserved loci,

The trinucleotide distributions of low occupancy TSS-distal sequences (over 200 bp from an annotated TSS) are generally similar to those in TSS-proximal loci, but some important differences are notable (

The data therefore support a compensatory substitution process that drives G+C content conservation in most TSS-distal loci, in a way analogous to the dynamics at TSS-proximal loci. This is demonstrated by the asymmetric rates of A/T gain and A/T loss, the conservation of G/C content and the compensatory substitution coupling at most ranges of nucleosome occupancy. An exception to this general trend is observed at some of the TSS-distal low occupancy loci. We hypothesize that during the evolution of the

To study the hypothesis that selection on dispersed nucleosome encodings drives asymmetric substitution patterns in yeasts, we devised a simple theoretical model (

We simulated the evolution of a fixed size population of small (20 bp) “genomes” in simple fitness landscapes that depend only on the G+C content of the sequence (

Our evolutionary analysis above supports the idea that high and low nucleosome occupancy sequences in yeast evolve under a selective pressure to maintain their G+C content, or a refined nucleosome sequence potential that is approximated by the average G+C content. According to this scenario, in low occupancy sequences, which are generally A+T-rich, A/T-losing substitutions are weakly selected against, while A/T-gaining substitutions are frequently pushed to fixation by an adaptive force. According to our simulations and to the standard population genetics theory, such selection on A/T-gaining and A/T-losing mutations should affect the distribution of allele frequencies in the population. In low occupancy loci, A/T-losing single nucleotide polymorphisms (SNPs) are expected to have lower allele frequencies than A/T-neutral SNPs, while A/T-gaining SNPs should have higher allele frequencies. Analysis of polymorphic sites in a sample of 39

Data

We classified yeast intergenic regions according to their nucleosome occupancy, and used evolutionary analysis of context-dependent substitution rates to reveal remarkable variability in the evolutionary dynamics of sequences bound and unbound to nucleosomes. Our analysis shows that low occupancy sequences lose A/T nucleotides slowly compared to high occupancy sequences, but gain A/T nucleotides at similar rates. We also observe spatial coupling between substitutions that gain A/Ts and substitutions that lose them, which suggests that a compensatory process preserves G+C content at both high and low occupancy loci. These observations are compatible with a model in which the local G+C content in yeast is conserved through weak quantitative selection. Such weak selection allows occasional fixation of substitutions that disrupt the optimal G+C content of the region, but then respond by adaptive evolution of corrective mutations at the mutated locus or at any of the surrounding genomic positions. Data on allele frequencies of yeast SNPs independently confirm the predictions of such a model. This set of observations proves that the G+C heterogeneity of yeast intergenic sequences is not a consequence of a neutral process and suggests that nucleosome organization may play a major role in this lack of neutrality.

The role of DNA encoded nucleosome occupancy in regulating gene expression is difficult to isolate experimentally, mostly due to the challenge of separating cause and effect inside the complex system involving nucleosomes, remodeling factors and TFs. Previous analysis identified an anti-correlation between nucleosome occupancy and genomic conservation in yeast

One source of evolutionary constraint on yeast intergenic sequences is their interaction with transcription factors. TF binding sites are known to be conserved among yeast species

We studied here a model of evolution as manipulating sequences in a complex fitness landscape that combines contributions from multiple coupled loci into a single

Multiple alignments of the

Our analysis focused on intergenic genome sequences which are defined based on the SGD gene annotations. Each intergenic locus was defined as

As described in the text, a refined context dependent substitution model is essential for the correct estimation of the different evolutionary dynamics in low G+C content, low occupancy loci and high G+C content, high occupancy loci. We therefore applied a flexible substitution model to perform ancestral inference and learn evolutionary parameters from alignment data (details available upon request). The model included parameters for the substitution rates at each of 16 possible contexts parameterized by the identities of the 3′ and 5′ flanking nucleotides. Independent substitution rates were assumed for each lineage in a phylogenetic tree which was fixed throughout the process. We note that the model does not assume parametric constraints on different substitution rates, and infers substitution rates on lineages, rather than a global substitution rate matrix and branch lengths. This approach has proved more robust given that a sufficient number of loci was available to learn robustly the parameters at each lineage, and given that the substitution process in the different lineages indicated gradual changes in dynamics that a model using a universal rate matrix could not have accounted for (for example, the extant G+C content in each of the species we used show some variability).

To perform ancestral inference, we used a customized loopy belief propagation algorithm on a factor graph approximation of the model

For analysis of the resulted model parameters, each context dependent substitution rate was averaged with its reverse complement. For example CAT->CCT is averaged with ATG->AGG. The averaged conditional probabilities are presented in

In order to estimate the theoretical regional G+C content of

To estimate the coupling between A/T gaining and A/T losing substitutions in the yeast genome, we used our probabilistic model to infer at each genomic position j the posterior probability of each type substitution in the lineage leading to species i from its ancestor (pai):

When s^{j}_{i} denotes the nucleotide at the j'th genomic position of the i'th species in the phylogeny, and s^{j}_{pai} denotes the sequence of the ancestor of this species at the same genomic position.

Given the posterior probabilities we computed for each genomic position j the expected numbers of A/T loss and A/T gain events in the sequence preceding it. This was done using a

Where the _{gain}_{loss}

^{k}_{pai} |
^{k}_{i} |
_{loss} |
_{gain} |

A/T | C/G | 1 | 0 |

C/G | A/T | 0 | 1 |

A/T | A/T | 0 | 0 |

C/G | C/G | 0 | 0 |

We then identified all positions with A/T divergence <-0.9 (A/T losing contexts), with A/T divergence >0.9 (A/T gaining contexts) and with conserved A/T content (background). For each such set we computed the probability of A/T gain and A/T loss substitutions using the same inferred posterior probabilities. By using this approach (conditional probability given the events in the preceding 5 bp) we ensured each substitution is counted precisely once. By computing the probabilities for similar events (e.g. A/T gain) given different contexts (A/T losing, A/T gaining, or background), we could robustly asses compensation patterns while controlling for the different basal rates of A/T gain and A/T loss and the general clustering of substitution in the genome.

To statistically assess the coupling between A/T divergence context and A/T losing/gaining substitutions in the

In addition we counted the numbers of A/T and C/G occurrences in these contexts:

We wished to test whether the spatial compensation effect is significant even given the general clustering of substitutions. Our null hypothesis was therefore:

We test it using bootstrapping with 100,000 resamples. At each resample, a set of

Analysis of the robustness of the observed compensation patterns for different values of the horizon parameter is shown in

To study the hypothesis that selection on dispersed nucleosome encodings drives asymmetric substitution patterns in yeasts, we devised a simple theoretical model. For clarity we describe here the version of the model for low occupancy sequences. For nucleosome DNA the model is the same apart from the fitness function.

First we used a Wright-Fischer dynamics on a population of

L – Genome size (20)

We note that the population expected θ parameter may be estimated from the above parameters (

The simulation was based on the following procedure:

_{A}_{G}_{A->G},N_{G->A}, such that the rate will be estimated as N_{A->G} /N_{A},_{G->A} /N_{G}

_{A}_{G}

_{A->G} or N_{G->A} (after the burn-in period) and updated the sequence R.

We end up with counts of A's (N_{A}_{G}_{A->G}_{G->A}

These rates are shown in

The _{GC} and the selection intensity

The

Next, we studied the above model analytically in the regime of low mutation rates. In this regime, drift is the dominating mechanism and we can model the process by assuming the population is represented by a single genome (or GC content). Given the definitions above, the rate at which mutations that increase the GC content enter the population is

While the rate of mutations that decrease the GC content is

In such drift dominating regime, the fixation probability of a new mutation is:

Where

Thus the set equations for the dynamics of

Solving this for the steady state

Where

As can be seen in

We used DNA sequences of 39

Heterogeneous G+C content. A) Shown is the probability density function of the regional G+C content (20 bp windows) over the intergenic S. cerevisiae sequence (black), over simulated intergenic genomes (red, see

(0.32 MB EPS)

Heterogeneous trinucleotides distribution over low and high nucleosome occupancy sequences. A-B) Shown are log ratios of trinucleotide frequencies in low and high occupancy sequences (Y axis) against trinucleotide frequencies in high occupancy sequences (X axis) over TSS proximal sequences (A) and TSS distal sequences (B). Each trinucleotide is depicted by three adjacent color coded squares. Pairs of reverse complimented trinucleotides are averaged and depicted together. In addition to the clear preference of A/T trinucleotides for low occupancy sequences (notice the abundant AAA), we note the differences in G/C trinculeotide preferences between the occupancy groups. (C,D) shown are the log ratios of trinucleotide frequencies (same as A,B) over TSS proximal sequences (C) and TSS distal sequences (D).

(0.39 MB EPS)

Yeast substitution rates are robustly correlated with the flanking nucleotides for all substitution types. Shown are the inferred substitution rates in TSS distal low occupancy sequences for the S. cerevisiae lineage (the gray lineage, x axis), and other sensu stricto lineages (color coded, Y axis), for 16 different flanking nucleotide contexts. The linear fit (dashed line) slopes for each lineage is roughly proportional to its branch length, but the model allows for differences in the substitution rates among lineages. A) A->C, T->G substitutions B) A->G, T->C substitutions C) A->T, T->A substitutions D) C->A, G->T substitutions E) C->G, G->C substitutions F) C->T, G->A substitutions.

(0.90 MB EPS)

A/T gain and loss substitution rates at low and high occupancy loci. Shown are ratios of all substitution rates in low vs. high occupancy loci (Y axis) plotted against the substitution rates at high occupancy loci (X axis) over TSS proximal (A) and distal sequences (B). Each point represents the rate of one substitution (color coded) in loci flanked by the 3′ and 5′ nucleotide depicted above the data point. C,D) Substitution rates by their A/T dynamics in TSS proximal (C) and distal (D) loci. Error bars depict the standard deviation. The trends are identical over transitions and transversions.

(0.66 MB EPS)

A/T gain and loss dynamics in different lineages of the sensu stricto clade. A-F) A/T loss and A/T gain rates over TSS distal (bars) and proximal (gray ticks) for the lineages leading to the following species: S. cerevisiae (A), S. paradoxus (B), S.mikatae (C), S. kudriazevii (D), the common ancestor of S. cerevisiae & S. paradoxus (E), and the common ancestor of S. cerevisiae & S. mikatae (F). G-L) Shown are the average G+C content of the following extant species and inferred ancestors, depicted for 10 levels of S. cerevisiae nucleosome occupancy (

(0.40 MB EPS)

G/C trinucleotides in TSS proximal low occupancy loci are more likely to be bound by a transcription factor. Shown is the fraction of G/C trinucleotides that are bound by one of the following transcription factors: REB1, UME6, MSN2, MBP1 within TSS distal high occupancy loci (-H), TSS distal low occupancy loci (-L), TSS proximal high occupancy loci (+H), and TSS proximal low occupancy loci (+L).

(0.25 MB EPS)

Coupling of A/T gaining and A/T losing substitutions at TSS-distal sequences. A) Shown is a comparison of the rate of A/T gaining substitutions near inferred sites of A/T losing (black) and A/T gaining (red) substitution, plotted for different ranges of nucleosomes occupancy (X axis). B) Similar analysis of A/T loss substitution rates around inferred A/T gain and A/T loss events.

(0.40 MB EPS)

Theoretical evolutionary model. A-H) Evolutionary simulation in high G+C fitness landscape. Shown are results of a simulation identical to the one described in

(0.40 MB EPS)

Allele frequency of A/T gain and A/T loss SNP's differences are robust to rare allele threshold. A–D) Minor allele frequency of non G/C contexts A/T loss, A/T gain and A/T neutral SNP's across low and high occupancy loci. Shown are fraction of minor alleles at low occupancy loci with frequencies smaller than 0.14 (A), fraction of minor alleles at high occupancy loci with frequencies smaller than 0.14 (B), fraction of minor alleles at low occupancy loci with frequencies smaller than 0.3 (C), fraction of minor alleles at high occupancy loci with frequencies smaller than 0.3 (D). E–F) Cumulative distribution function of non G/C, minor allele frequency of A/T loss, A/T gain and A/T neutral SNP's at low occupancy loci (E) and high occupancy loci (F).

(0.37 MB EPS)

Parsimonious inference validates substitution rates heterogeneity and spatial coupling of A/T gain and loss events. A–B) Shown are A/T gain (blue) and A/T loss (red) substitution rates of the S. cerevisiae lineage inferred using parsimony (flanking context independent). Data is shown for TSS distal (A) and proximal (B) DNA sequences of S. cerevisiae, S. paradoxus and S. mikatae. A/T losing rate is ∼50% decreased in low occupancy compared to high occupancy. C–D) Rates of A/T gain and loss events are spatially coupled. Shown is a comparison of the rate of A/T gaining substitution near parsimoniously inferred sites of AT losing (black) and AT gaining (red) substitution, plotted for different ranges of nucleosome occupancy (X axis) across TSS-distal (C) and TSS-proximal (D) loci. This analysis is consistent with the context dependent analysis. E-F) Similar analysis of A/T loss substitution rates around inferred A/T gain and A/T loss events across TSS-distal (E) and TSS-proximal (F) loci.

(0.39 MB EPS)

Spatial coupling between A/T gain and A/T loss (horizon of 1 bp). Shown are the results of an analysis similar to the one shown in

(0.42 MB EPS)

Spatial coupling between A/T gain and A/T loss (horizon of 3 bp. Shown are the results of an analysis similar to the one shown in

(0.39 MB EPS)

Spatial coupling between A/T gain and A/T loss (horizon of 10 bp). Shown are the results of an analysis similar to the one shown in

(0.46 MB EPS)

We thank the members of the Segal and Tanay labs for useful discussions and comments on the manuscript.