Biophysical Fitness Landscapes for Transcription Factor Binding Sites

Phenotypic states and evolutionary trajectories available to cell populations are ultimately dictated by complex interactions among DNA, RNA, proteins, and other molecular species. Here we study how evolution of gene regulation in a single-cell eukaryote S. cerevisiae is affected by interactions between transcription factors (TFs) and their cognate DNA sites. Our study is informed by a comprehensive collection of genomic binding sites and high-throughput in vitro measurements of TF-DNA binding interactions. Using an evolutionary model for monomorphic populations evolving on a fitness landscape, we infer fitness as a function of TF-DNA binding to show that the shape of the inferred fitness functions is in broad agreement with a simple functional form inspired by a thermodynamic model of two-state TF-DNA binding. However, the effective parameters of the model are not always consistent with physical values, indicating selection pressures beyond the biophysical constraints imposed by TF-DNA interactions. We find little statistical support for the fitness landscape in which each position in the binding site evolves independently, indicating that epistasis is common in the evolution of gene regulation. Finally, by correlating TF-DNA binding energies with biological properties of the sites or the genes they regulate, we are able to rule out several scenarios of site-specific selection, under which binding sites of the same TF would experience different selection pressures depending on their position in the genome. These findings support the existence of universal fitness landscapes which shape evolution of all sites for a given TF, and whose properties are determined in part by the physics of protein-DNA interactions.


Introduction
A powerful concept in evolution is the fitness landscape: for every possible genotype there is a number, known as the genotypic fitness, that characterizes the evolutionary success of that genotype [1]. Evolutionary success is typically quantified as the probability of surviving to reproduce, number of offspring, growth rate, or a related proxy [2,3]. The structure of the fitness landscape is key to understanding the evolutionary fates of populations.
Most traditional studies of molecular evolution rely on simplified models of fitness landscapes [3][4][5][6] or empirical reconstructions of landscapes based on limited experimental data [3,[7][8][9][10]. However, fitness landscapes are fundamentally shaped by an intricate network of interactions involving DNA, RNA, proteins, and other molecular species present in the cell. Thus we should be able to cast these landscapes in terms of biophysical properties such as binding affinities, molecular stabilites, and degradation rates. The increasing availability of quantitative highthroughput data on in vitro and in vivo molecular interactions has led to growing efforts aimed at developing models of evolution that explicitly incorporate the underlying biophysics [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. These models combine evolutionary theory with physical models of molecular systems, for example focusing on how protein folding stability or specificity of intermolecular interactions shapes the ensemble of accessible evolutionary pathways and steady-state distributions of biophysical phenotypes.
Evolution of gene regulation is particularly well-suited to this type of analysis. Gene activation and repression are mediated by binding of transcription factors (TFs) to their cognate genomic sites. These binding sites are short nucleotide sequences, typically 5-25 bp in length, in gene promoters that interact specifically with TF DNA-binding domains [26]. In eukaryotes, a given TF can have numerous binding sites in the genome, and many genes are regulated by several TFs [26,27]. Understanding TF-mediated regulation is key to understanding complex regulatory networks within eukaryotic cells 2 one of the main challenges facing molecular biology. Moreover, the availability of high-throughput data on the genomic locations of TF binding sites [28][29][30][31] and on TF-DNA energetics [32][33][34][35] make it possible to develop biophysical models of evolution of gene regulation.
Here we consider evolution of TF binding sites in the yeast Saccharomyces cerevisiae. We study how energetics of protein-DNA interactions affect the structure of the binding site fitness landscape. In a significant extension of previous work which analyzed a single yeast TF [22], we consider a collection of 25 S. cerevisiae TFs for which models of TF binding energetics were built using high-throughput in vitro measurements of TF-DNA interactions [35]. We focus on 12 TFs for which sufficient data on genomic sites [31] are also available. We use a model of monomorphic populations undergoing consecutive substitutions [19,[36][37][38] to infer fitness landscapes, as functions of TF-DNA binding energy, from observed distributions of TF binding sites in the yeast genome [22]. In contrast to the previous work [22], we rationalize these fitness landscapes in terms of a simple parametric model based on thermodynamics of TF-DNA binding, obtaining explicit values of effective evolutionary parameters. Our analysis sheds light on the genome-wide importance of TF-DNA interactions in regulatory site evolution.
Moreover, we investigate the hypothesis that universal biophysical constraints, rather than site-specific selective pressures, dominate evolution of regulatory sites. We test the relationship between TF binding energies and various biological properties, such as the essentiality of the corresponding gene [39]. We find no clear relationship between physical and biological properties of TF sites, which indicates that evolution of site energetics is largely insensitive to site-specific biological functions and is therefore driven by global biophysical constraints.

Results
1 Biophysical model of TF binding site evolution 1.1 Energetics of TF-DNA binding. The probability of a binding site to be bound by a TF is given by the Fermi-Dirac function of the free energy E of TF-DNA interaction [40]: where b phys is the physical inverse temperature (&1:69 (kcal=mol) {1 at room temperature), and m phys is the physical chemical potential, a function of the TF concentration. if E&m phys . In the mean-field approximation, each nucleotide makes an additive contribution to the total energy of the site [32]. These contributions are parameterized by an energy matrix, whose entries E s i i give the contribution to the total energy from the nucleotide s i at position i: Energy matrices can be readily generalized to more complex models of sequence-dependent energetics, such as those with contributions from dinucleotides, although here for simplicity we focus on the additive model. 1.2 Evolutionary model. We consider a population with a locus in the monomorphic limit: mutations in the locus are infrequent enough that each new mutation either fixes or goes extinct before a second mutation arises [36]. This approximation is valid in the limit u%(LN log N) {1 , where u is the mutation rate (probability of mutation per base per generation), L is the number of bases in the locus, and N is an effective population size [41]. Indeed, in the monomorphic limit the expected time between new mutations, (LNu) {1 , must be longer than the expected time over which fixation occurs, which is O(N) generations with probability 1=N for mutants that fix, and O(log N) with probability (N{1)=N for mutants that go extinct. Thus the total expected time before the mutant either fixes or goes extinct is O(log N) generations for N&1 [42]. Thus we must have (LNu) {1 &log N or, equivalently, u%(LN log N) {1 . We also assume that the locus is unlinked to the rest of the genome by recombination, and thus we can consider its evolution independently. In evolutionary steady state, the probability that the population has genotype s at the locus is given by [19,37,38] where F (s) is the multiplicative fitness (defined so that the total fitness of a set of independently evolving loci is a product of fitnesses for each one), p 0 (s) is the neutral distribution of sequences (steady state under no selection), and Z is a normalization constant. The exponent n is a ''scaling'' effective population size which is closely related to the standard variance effective population size [38]. For example, n~2(N{1) in the Wright-Fisher model and n~N{1 in the Moran model of population genetics [43]. Conceptually, both n and N measure the strength of genetic drift [36]. The distribution in Eq. 3 is valid for a wide class of population models [38] (see Methods for details). An analogy with statistical mechanics is suggested by rewriting Eq. 3 as a Boltzmann distribution [37]: Here the logarithm of fitness plays the role of a negative Hamiltonian, and the neutral distribution p 0 (s) plays the role of entropy. Typically we expect relatively few sequences with high fitness and many with low fitness; thus mutations will drive the population toward lower fitness, while selection will favor higher fitness. The balance between these two competing forces depends on the effective population size n, which controls the strength of random fluctuations and is analogous to inverse temperature in the Boltzmann distribution.
1.3 Biophysical model of binding site evolution. Since we are primarily interested in the biophysical aspects of binding site evolution, it is more convenient to consider evolution in the space of binding energies by projecting Eq. 3 via the sequence-energy mapping of Eq. 2: Here the binding site fitness F (E) depends only on the binding energy E. As a general ansatz, we will assume that the fitness

Author Summary
Specialized proteins called transcription factors turn genes on and off by binding to short stretches of DNA in their regulatory regions. Precise gene regulation is essential for cellular survival and proliferation, and its evolution and maintenance under mutational pressure are central issues in biology. Here we discuss how evolution of gene regulation is shaped by the need to maintain favorable binding energies between transcription factors and their genomic binding sites. We show that, surprisingly, transcription factor binding is not affected by many biological properties, such as the essentiality of the gene it regulates. Rather, all sites for a given factor appear to evolve under a universal set of constraints, which can be rationalized in terms of a simple model inspired by transcription factor -DNA binding thermodynamics.
depends on the binding energy through the physical binding probability p bound : F (E)~F (p bound (E)). Further, we assume that an organism with an always-bound site (p bound~1 , E?{?) has fitness 1, while an organism with a site that never binds (p bound~0 , E?z?) has fitness f 0 v1. Since real sites are somewhere in between these extremes, a simple hypothesis for the fitness function is an average of these two fitness values weighted by the thermodynamic probabilities of the site being bound or unbound: Equation 6 assumes that the fitness function depends linearly on the TF binding probability p bound , which equals the site's average occupancy. However, this linear dependence may be too restrictive. For example, it does not account for the scenario in which a cell only requires p bound to be above some minimum threshold p min , such that the fitness equals 1 when p bound wp min , and 0 otherwise. To include a wider range of fitness functions, we extend our model in Eq. 6 by treating b and m as effective fitting parameters (b eff and m eff ) that may deviate from their physical counterparts: When b eff~bphys and m eff~mphys , Eq. 7 is equivalent to Eq. 6 and fitness is linearly proportional to p bound , but deviations between these effective and physical parameters introduce nonlinear dependence of fitness on p bound . For example, the case in which p bound must only exceed a minimum threshold p min is equivalent to Eq. 7 with f 0~0 , b eff ??, and m eff~mphys zb {1 phys log((1{p min )=p min ). For the remainder of the paper, we will focus on the effective fitness function of Eq. 7 and infer its parameters from data. Thus for simplicity we will drop the explicit ''eff'' labels on b and m. An important feature of Eq. 3 is that we may invert it to obtain the fitness function in terms of the observed steady-state distributions p(s) and p 0 (s), or p(E) and p 0 (E) in energy space [19]: Thus, given a distribution of evolved binding site sequences p and a neutral distribution p 0 , we can use Eq. 8 to infer the logarithm of the fitness landscape up to an overall scale and shift. This can be done without any a priori knowledge of the shape of the fitness function. Moreover, given a specific functional form of F (E), such as the effective Fermi-Dirac fitness in Eq. 7, we can perform a maximum likelihood fit of the observed sequence distribution to infer values of parameters b, m, n, and f 0 . The resulting fitted function can be evaluated by comparison to the general inference in Eq. 8. When 1{f 0 %1, F (E) n contains an approximate degeneracy in terms of n(1{f 0 ):c, i.e., all fitness functions with constant c are approximately equivalent. Indeed, the steady-state distribution in Eq. 5 depends on the quantity F (E) n , which can be written as if c(1ze {b(E{m) ) {1 %n or, since 0ƒ(1ze {b(E{m) ) {1 ƒ1, if 1{f 0 %1. Therefore in this limit, the steady-state distribution p(s) depends only on the parameter c and not on f 0 and n separately. This degeneracy in the steady-state distribution is not surprising in light of the underlying population genetics, which also provides an interpretation of c. The quantity 1{f 0 is the selection coefficient s between the two phenotypes of the system, e.g., the bound and unbound states of the TF binding site. As discussed above, the quantity n is an effective population size, which sets the strength 1=n of genetic drift. When s%1 and n&1, steady-state properties of the population (e.g., allele frequency distribution, fixation probability) are described by the diffusion limit and mathematically depend only on Ns, or in our model, n(1{f 0 )~c [43,44], which quantifies the strength of selection relative to the strength of drift. When cw1, selection is strong relative to drift, while cv1 indicates that selection is relatively weak. Note that only the absolute magnitude of the selection coefficient s~1{f 0 is required to be small for this degeneracy to hold; the selection strength relative to drift, quantified by c, may still be large.
Two regions of parameter space also exhibit a degeneracy between m and c. If m&E for all site energies E, all of the observed sites are predicted to be highly occupied and p bound (E)& 1{e b(E{m) . We may thus approximate and thus all fitness functions with constant A 1~c e {bm are approximately equivalent. One can thus make m arbitrarily large (while holding A 1 fixed by varying c) without breaking the degeneracy. If m is decreased the degeneracy will eventually break as m&E is violated. A similar degeneracy appears when m%E, as then p bound (E)&e {b(E{m) ; if additionally 1{f 0 %1, then (We can remove an overall factor of f n 0 because the distribution p(E) in Eq. 5 is invariant under an overall rescaling of fitness.) Therefore all fitness functions with A 2~c e bm are approximately equivalent in this case. Here, m can be made arbitrarily negative without breaking the degeneracy.
Thus, parameter fits fall into three cases for different TFs: If m%E, TF-DNA binding energies fit to the right (exponential) end of the Fermi-Dirac function, and we cannot infer a unique m. Similarly, if m&E, TF-DNA binding energies fit to the left (high occupancy) side of the Fermi-Dirac function, and we again cannot infer m precisely. However, if m&E, neither degeneracy holds and a unique m can be inferred. Despite the fact that m cannot always be predicted, we can unambiguously classify each fit into one of these three cases.

Selection strength and its dependence on biophysical
parameters. We now consider how changes to biophysical parameters of the model affect the strength of selection on binding sites. The selection coefficient for a mutation with small change in energy DE is Therefore we can characterize local variations in the strength of selection by considerings s(E)~Dd log F =dED, the per-unit-energy local selection coefficient. For the Fermi-Dirac landscape, we obtaiñ We use the absolute value here since the sign of the selection coefficient is always unambiguous, as the Fermi-Dirac function decreases monotonically with energy. We can also ask how variations in b affect the local strength of selection. Variation ofs s(E) with b depends qualitatively on both E{m and whether f 0 is zero or nonzero. In Fig. 1 we show log F (E),s s(E), and the derivative Ls s Lb~z For f 0~0 ( This situation changes qualitatively in the regime E{mw0 when f 0 =0 (Fig. 1D-F). In this case, for sufficiently large b, increasing b weakens selection. This is different in the case of nonzero f 0 because on the high-energy tail, the fitness is converging to a nonzero number f 0 , and thus selection becomes asymptotically neutral. Hence, when f 0 =0, increasing b only strengthens selection very close to E{m~0. Using Eq. 14, the boundaries in Fig. 1F are given by the solutions of (f 0 z 2 {1)log z~(1zz)(1zf 0 z). This equation can be solved numerically to obtain two solutions, z Ã 1 v1 and z Ã 2 w1. The boundaries in Fig. 1F are thus given by the curves b(E{m)~log z Ã 1 for E{mv0 and b(E{m)~log z Ã 2 for E{mw0. 1.5 Assessment of model assumptions. Two main assumptions inherent in our evolutionary model are monomorphism and steady state. Here, we assess how violating these assumptions affects inference of evolutionary parameters b, m, n, and f 0 . To test this, we generate simulated data sets of binding site sequences evolving under a haploid asexual Wright-Fisher model with the Fermi-Dirac fitness function (Eq. 7; see Methods for details).
Deviations from the monomorphic limit. To test the effects of polymorphism on the accuracy of our predictions, we perform a set of simulations for a range of mutation rates u. Each simulation in the set follows the Wright-Fisher process to the steady state. We construct the observed distribution p obs by randomly choosing a single sequence from the final population of each simulation, which may not be monomorphic for larger u ( Fig. 2A). From p obs , we carry out maximum-likelihood inference of the fitness landscape as a function of energy using Eq. 5 (Fig. 2B), as described in Methods.
Additionally, for each u we record the average number of unique sequences present in the population in steady state and compute the total variation distance (TVD; Eq. 25 in Methods) between p obs (E) and the monomorphic prediction p(E) using Eq. 5 (Fig. 2C). The TVD ranges from zero for identical distributions to unity for completely non-overlapping dis tributions. As expected, at low mutation rates the steady-state distribution and the fitness function match monomorphic predictions well. At higher mutation rates, the TVD starts to increase and Eq. 3 overestimates the fitness of low-affinity sites. The population becomes polymorphic in this limit. With very high mutation rates, p obs approaches the neutral distribution p 0 since the population is largely composed of newly generated mutants which have not experienced selection. A condition for monomorphism in a neutrally evolving population is u%(LN log N) {1 [41]. Using N~1000 and L~10 as in our simulations yields u%1:4|10 {5 in the monomorphic limit, consistent with the results in Fig. 2C.
We also infer parameters b, m, and c with a maximum likelihood fit. As expected, all parameters converge to the exact values in the monomorphic limit ( Fig. 3A-C). When the population is not truly monomorphic, m and b tend to be underestimated on average, with larger variation in inferred values (larger error bars in Fig. 3A,B). For c, polymorphism has no clear bias on the average inferred value, although it also appears to increase the variation.
Deviations from evolutionary steady state. We perform another set of simulations to test the accuracy of our predictions in a population that has not yet reached steady state. We use the same fitness landscape and population size, but fix u to 10 {6 , within the monomorphic limit. At each point in time (measured as the number of generations), we construct p obs as described in Methods (Fig. 2D), and infer the fitness function (Fig. 2E). We also compute the TVD between the observed distribution p obs and the steadystate prediction (Fig. 2F). Over time p obs converges to the steady state (Eq. 3) and the TVD decays to zero, enabling accurate reconstruction of the fitness function in the region E{m&0 (although it still diverges from the exact function in the highenergy tail, where few sequences are available at steady state). The relaxation time is expected to be proportional to u {1 , or 10 6 generations, which is in agreement with Fig. 2F. As the population reaches steady state, accurate inference of the fitness function parameters becomes possible ( Fig. 3D-F). We see that parameters inferred from a population out of steady state tend to underestimate m and c, and overestimate b.

Transcription factor binding sites in yeast
We now turn to considering the evolution of TF binding sites in S. cerevisiae. How well does S. cerevisiae satisfy the assumptions of our evolutionary model? First of all, S. cerevisiae is not a purely haploid organism but rather goes through haploid and diploid stages. In S. paradoxus, most of the reproduction is haploid and asexual with 1000 generations spent in the haploid stage for each generation in the diploid stage, and heterozygosity is low [45]. Based on the analysis of yeast genomes, wild yeast populations show limited outcrossing and recombination and are geographically distinct [46]. Thus, S. cerevisiae may be regarded as haploid to a reasonable approximation, with sufficient recombination during the diploid stages to unlink TF binding sites. This is consistent with our model, which assumes a haploid population and independent evolution of binding sites.
We next consider whether natural populations of S. cerevisiae are within the mutation rate limits required for monomorphism. The  mutation rate for S. cerevisiae has been estimated to be 0:22|10 {9 mutations per bp per cell division [45]. Assuming binding site loci of length L~10, the bound on the effective population size N is 2:7|10 7 , below which the population will be monomorphic. This is close to the estimated effective population size of S. cerevisiae of &10 7 individuals [45], based on the analysis of neutral regions in the yeast genome. Thus it is plausible that S. cerevisiae population sizes are below or near the limit for monomorphism, justifying the use of Eq. 3. Furthermore, in S. cerevisiae and S. paradoxus the proportion of polymorphic sites in a population has been found to be about 0.001 [45,47,48], generally with no more than two alleles segregating at any one site [45]. According to this estimate, we expect about 1% of binding sites of length 10 bp to be polymorphic, corresponding to an average polymorphism of 1.01 in Fig. 2C.
For S. cerevisiae, the equilibration time estimate is u {1 &5|10 9 generations, or about 2|10 6 years with 8 generations per day [49]. This is several times less than the 5-10 million years of divergence time for the most recent speciation event with S. paradoxus [50]. Thus steady state may plausibly be reached over evolutionary times scales for a fast-reproducing organism like S. cerevisiae.
2.1 Site-specific selection. We obtain curated binding site locations in S. cerevisiae from Ref. [31] and energy matrices (EMs) from Ref. [35], as described in Methods. Besides the assumptions of monomorphism and steady state, we also require an ensemble of binding sites evolving under universal selection constraints if we are to infer the fitness landscape using Eq. 5. A collection of sites binding to the same TF is an obvious candidate, since these sites all experience the same physical interactions with the TF. However, it is possible that selection is largely site-specific: rather than evolving on the same fitness landscape, different sites for the same TF may be under different selection pressures depending on which genes they regulate, their position on the chromosome, etc. For example, genes under strong selection might require very reliable regulation, so that their upstream binding sites are selected for tight binding to TFs. In less essential genes, the requirement of high-affinity binding might be relaxed. Before directly applying the evolutionary model, we investigate several of these site-specific scenarios to determine if any are supported by the available data. We perform several direct tests of site-specific selection by searching for correlations between site TF-binding energies and other properties of the site or the gene it regulates.
We classify fitness effects of genes using knockout lethality, which is available in the Yeast Deletion Database [39,51]. This database classifies genes as either essential or nonessential based on the effects of gene knockout, and provides growth rates for nonessential gene knockouts under a variety of experimental conditions. We divide binding sites of each TF in our data set into two groups: those regulating essential genes and those regulating nonessential genes.
In Fig. 4A we compare mean binding energies of sites regulating essential genes with those regulating nonessential genes for each TF. Using a null model as described in Methods, we find no significant difference (at p~0:05 level) between the two groups of sites for any TF except RPN4 (p~0:048). The mean p-value of the null model over all TFs is 0.530; a small number of individuallysignificant p-values is expected as a consequence of multiple testing. In Fig. 4B we compare the variance of the energy of the sites regulating essential and nonessential genes; sites regulating essential genes may be selected for more specific values of binding energy if precise regulation is required. We find no overall trend: for some TFs sites regulating essential genes have more energy variation than those regulating nonessential genes, but for other TFs the situation is reversed.
For the sites regulating nonessential genes, we also correlate the site binding energy with the growth rate of a strain in which the regulated gene was knocked out (Table S1, column B). The Spearman rank correlation between each site's binding energy and the regulated gene's effect on growth rate produces a mean p-value of 0.562. We find no significant correlation for any TF at p~0:05 level except MSN2, with p~0:046. It is also possible that regulation of highly-expressed genes may be more tightly controlled. Indeed, gene expression level is weakly, though significantly, correlated with gene essentiality [52]. We compare the binding energy of sites to the overall expression level of their regulated genes measured in mid-log phase yeast cells cultured in YPD [52] (Table S1, column C), and again find no correlation using the Spearman rank correlation except for DAL80 (p~0:030), with mean p-value of 0.537.
Another measure of the selection pressures on genes is their rate of evolution as measured by K A =K S , the ratio of nonsynonymous to synonymous mutations in a given gene between species. According to the neutral theory of evolution, genes which evolve slowly must be under higher selective pressure, and therefore the sites regulating them might likewise experience stronger selective pressures. As described in Methods, we measure the K A =K S ratio between S. cerevisiae and S. paradoxus protein coding sequences, and compare it to the binding energy of the sites regulating those genes (Table S1, column D). We find very weak Spearman rank correlations for RPN4, GAT1, CAD1, and ATF2, all roughly with p~0:020. We find no other significant correlation at the p~0:05 level, with a mean p-value of 0.404.
Similarly, one might expect sites regulating essential genes to be more conserved. However, we find that the average Hamming distance between corresponding binding sites in S. cerevisiae and S. paradoxus [31] is no different for sites regulating essential genes than for those regulating nonessential genes, as shown in Fig. 4C. Using the null model described in Methods, most TFs are above p~0:05 with the exceptions of PDR3 (p~0:033), with an average p-value of 0.651. Similarly, there is no significant difference in the binding energies of these orthologous sites as determined from the EMs, as shown in Fig. 4D, except for PDR3 (p~0:014), with mean p-value of 0.691.
We can also consider how the essentiality of the TFs themselves affects the sequences of their binding sites; for example, essential TFs may constrain their binding sites to a more conserved sequence motif. We divide 125 TFs from Ref. [31] which had 10 or more sequences and for which essentiality information was available into 16 essential and 109 nonessential TFs using the Yeast Deletion Database [39,51], and calculate the sequence entropy of binding sites for each TF. The distribution of sequence entropies in Fig. 4E shows no significant difference between essential and nonessential TFs (p~0:9 for the null model).
Finally, it is possible that sites experience different selection pressures depending on their distance to the transcription start site (TSS). Again, we find no significant correlations between binding energy and distance to the TSS: Spearman rank correlation yields mean p-value of 0.560 and all p-values above 0.05 except RPN4 (p~0:032) (Table S1, column E). Overall, we find no systematic evidence that site-specific properties of binding sites determine their binding energies. These findings are in broad agreement with a previous report [22], which suggested that site-specific selection can be ruled out because of the significant variation in binding Figure 4. Tests of site-specific selection. We divide binding sites for each TF into two groups: those regulating essential genes and those regulating nonessential genes. (A) Comparison of mean binding energies of sites regulating essential (SET essential ) and nonessential genes (SET nonessential ) for each TF in the data set. (B) Comparison of variance in binding energies for sites regulating essential (V essential ) and nonessential (V nonessential ) genes. (C) Mean Hamming distance between corresponding sites in S. cerevisiae and S. paradoxus for sites regulating essential (SdT essential ) versus nonessential genes (SdT nonessential ). (D) Mean squared difference in binding energy between corresponding sites in S. cerevisiae and S. paradoxus for sites regulating essential (SDE 2 T essential ) versus nonessential genes (SDE 2 T nonessential ). In affinity between orthologous sites of different species, which was found to be consistent with the variance predicted by a model including only drift and site-independent selection.
2.2 Inference of biophysical fitness landscapes. The above analysis indicates that the evolution of binding site energies does not depend significantly on site-specific effects, suggesting that more universal principles govern the observed distribution of sites binding a given TF. Thus, we will fit a single fitness function to a collection of TF-bound sites via Eqs. 3 and 8. Of the 25 TFs considered in the previous section, here we focus on 12 TFs with §13 unique binding site sequences.
First we derive the neutral distribution p 0 (E) of site energies based on mono-and dinucleotide frequencies obtained from intergenic regions of the S. cerevisiae genome, as described in Methods. It has been suggested that L-mers not functioning as regulatory sites (e.g., located outside promoters) may be under evolutionary pressure not to bind TFs [53]; however, consistent with previous reports [22,54], we find that sequences sampled from the intergenic regions of the genome are close to the neutral distribution expected from mono-and dinucleotide frequencies, except for the expected enrichment at low energies due to functional binding sites. This distribution is shown in Fig. 5A for REB1 and in Table S2, column B for all other TFs.
Assuming the observed set of binding site energies for a TF adequately samples the distribution p(E), we can use our estimate of the neutral distribution p 0 (E) in Eq. 8 to reconstruct the fitness landscapes as a function of TF binding energy up to an overall scale and shift (Fig. 6). Although the fitness functions may be noisy due to imperfect sampling of p(E), they nevertheless provide important qualitative insights. In particular, in all cases fitness decreases monotonically as binding energy increases, indicating that stronger-binding sites are more fit. That is, we observe no fitness penalty for binding too strongly, at least within the range of energies spanned by p(E).
Fitted Fermi-Dirac landscapes. For each TF we perform a maximum-likelihood fit of the binding site data to the distribution in Eq. 3 with the Fermi-Dirac landscape of Eq. 7 (Fig. 5 for the REB1 example, Table S2 for all other TFs; see Methods for details). The model of Eq. 7 has four fitting parameters: b, m, n, and f 0 . However, as shown in Sec. 1.3, in the 1{f 0 %1 limit the fitness function depends on c~n(1{f 0 ) rather than f 0 and n separately. Thus we also carry out constrained ''non-lethal'' Fermi-Dirac fits in which f 0 is fixed at 0.99. Note that due to the cm degeneracy, in some cases m effectively fits to the limiting cases m?? or m?{? rather than a specific value. Because we only fit in the range {20vmv0 (see Methods), a value of m&{20 shows that the fit is subject to the m%E degeneracy, while m&0 shows that it is subject to the m&E degeneracy. As mentioned above, the input to each fit is a collection of genomic TF binding sites fsg [31] and the EM predicted on the basis of high-throughput in vitro TF-DNA binding assays [35]. The EM allows us to assign a binding energy E(s) to each site.
A summary of maximum-likelihood parameter values for all TFs is shown in Tables 1 and S2, column D. The variation of loglikelihood with fitting parameters is shown in Table S2, columns G and H. Since for many TFs relatively few binding sites are available, in Table S3 we evaluate the goodness of fit using randomly chosen subsets of binding sites and Hessian analysis. Six of the TFs (REB1, ROX1, MET32, PDR3, CUP9, and MCM1) are in the 1{f 0 %1 regime where only c can be inferred unambiguously. Indeed, nonlethal Fermi-Dirac fits with f 0~0 :99 yield very similar values of loglikelihood and c (Table S2, column D). In all of these cases, c is considerably greater than 1, implying that selection is strong compared to drift and the effective population size is large (the s%1, Ns&1 regime in population genetics).
Five TFs (RPN4, MET31, YAP7, BAS1, and AFT1) have very small values of f 0 ( Table 1), indicating that on average, removing their binding sites is strongly deleterious to the cell. In these cases, the global maximum occurs in the vicinity of f 0~0 , away from the degenerate region of parameter space (Table S2, column H, insets). Note however that the likelihood surface is always degenerate in the region of parameter space with 1{f 0 %1 and c~constant; this is true even when the global maximum likelihood does not occur in that region, as observed for these five TFs. Since 1{f 0 &1, n&c, which is a small value in all five cases (Table 1). Given the strength of selection, small effective population sizes (which indicate that genetic drift is strong) are necessary to reproduce the observed variation in binding site sequences. Finally, sites for STB5 have an intermediate value of f 0~0 :401, which means they are under strong selection but are not necessarily essential.
The fits to the Fermi-Dirac fitness landscapes also provide estimates of the effective inverse temperature b ( Table 1). The inferred values of b can be compared to the physical value at room temperature, b phys &1:69 (kcal=mol) {1 : Ten of the TFs (REB1, ROX1, MET32, MET31, PDR3, YAP7, BAS1, STB5, CUP9, MCM1) have b's lower than the physical value, while in the other two (RPN4, AFT1) bwb phys : In most TFs the fitted inverse temperature b is far from its physical counterpart, although in several cases the likelihood function is fairly flat in the vicinity of the peak, indicating that a wider range of b values is admissible (Table S2, column G).
The inferred value of m relative to the distribution of energies E of the binding sites tells us in which qualitative regime of the Fermi-Dirac fitness landscape the sites lie. For five TFs (ROX1, MET32, PDR3, CUP9, MCM1) E{mw0, so the sites reside on the exponential tail of the landscape; since 1{f 0 %1 as well, they are subject to the m%E degeneracy. For a group of six TFs (REB1, RPN4, MET31, YAP7, BAS1, AFT1), E{m&0, so that the sites lie on the bound-unbound threshold. In this regime, changing the energy of the site through mutations may lead to a large change in fitness. Finally, E{mv0 for STB5, so the sites lie on the highfitness plateau and are subject to the m&E degeneracy. The degeneracies in m are also illustrated in Table S2, column G.
What does b=b phys say about the nature and strength of selection? We address this question using the local selection coefficient,s s(E)~Dd log F =dED (Eq. 13). The magnitude of the selection coefficient depends qualitatively on both E{m and whether f 0 is zero or nonzero (Fig. 1). For five of the TFs (ROX1, MET32, PDR3, CUP9, MCM1), f 0 =0, bvb phys , and E{mw0. Thus these TFs are in a regime where decreasing b strengthens selection (Fig. 1F). In other words, selection is stronger for these binding sites than expected from purely biophysical considerations. For RPN4 and AFT1, f 0 &0, bwb phys , and E&m. Hence Ls s=Lbw0, and selection is again stronger than expected. STB5 exhibits bvb phys and lies on the high fitness plateau (E{mv0), and thus selection is also stronger than expected. In contrast, REB1, MET31, YAP7, and BAS1 exhibit bvb phys and lie on the threshold E{m&0, and hence selection is weaker than expected in these four cases.
Fitness landscape model selection. Since the constrained Fermi-Dirac fits have one fewer adjustable parameter than the unconstrained fits, it is more consistent to do model selection on the basis of the Akaike information criterion (adjusted for finite-size samples) [55] rather than log-likelihoods: where k is the number of fitting parameters, L is the likelihood, and n is the number of data points. For each model we can calculate the AIC, which accounts for both the benefits of higher log-likelihood and the costs of additional parameters. The model that provides the best fit for the fewest parameters will have the lowest AIC value. Table 2 shows the AIC differences between the unconstrained Fermi-Dirac fits (UFD, k~4) and the constrained Fermi-Dirac fits with f 0~0 :99 (CFD, k~3) for each TF. Positive AIC differences indicate that UFD is more favorable. We also calculate the Akaike weights w!e {AIC=2 , which give the relative likelihood that a given model is the best [55].
For five of the six TFs in the 1{f 0 %1 regime, the constrained Fermi-Dirac fits perform somewhat but not drastically better than the unconstrained Fermi-Dirac fits (Table 2). Indeed, the Akaike weights for the constrained Fermi-Dirac fits exceed the full fits for these TFs consistently by about a factor of e&2:7, since their raw likelihoods are essentially equivalent and they only differ in the number of fitted parameters k. Out of the five TFs for which f 0 &0, YAP7, BAS1, and AFT1 fit slightly better to the constrained Fermi-Dirac, suggesting that their small fitted values of f 0 are not significant. For RPN4 the AIC analysis shows preference for the fits with low f 0 ; however RPN4 is listed as nonessential in the Yeast Deletion Database [39,51], suggesting either an inconsistency in our analysis or that growth media tested in Refs. [39,51] do not reveal essentiality of this TF.
We may also consider a purely exponential fitness landscape of the form F (E)~e aE . The reasons for including this case are threefold. First, exponential fitness emerges in the limit E{m&0 of the Fermi-Dirac landscape, the regime into which many of the TF binding sites fall. Second, the fitness landscapes in Fig. 6 appear close to linear on the logarithmic scale, implying that to a good approximation fitness depends exponentially on energy. Third, the model has just one fitting parameter, making it a useful null case for AIC evaluation.
The steady-state distribution p(s) with exponential fitness is given by where E(s) is given by Eq. 2, p 0 (s) is the neutral probability of sequence s, p i 0 (s i ) is the background probability of nucleotide s i at position i, and Z i is a single-site partition function: Here we assumed that the background probability of a sequence is a product of probabilities of its constituent nucleotides. In this case, positions in the binding site decouple and the distribution of sites p(s) completely factorizes.
The assumption of factorization underlies the common practice of inferring energy matrices from log-odds scores of observed genomic binding sites [32]. The log-odds score of a nucleotide s i is defined as where p  Columns show maximum-likelihood value of f 0 , c~n(1{f 0 ), and b. The last column shows whether most binding site energies E are lower than the inferred chemical potential m, near it, or above it (see Table S2 for details shows that the log-odds score, which is computed using observed nucleotide probabilities, is equivalent to E s i i (up to an overall scale and shift) under the assumption of site independence.
We can quantitatively compare the exponential fitness landscape with the unconstrained and constrained Fermi-Dirac landscapes using the Akaike information criterion, Eq. 15. The AIC analysis shows that the exponential landscape is significantly poorer than the Fermi-Dirac landscape in all cases except MET31 (Table 2), where the exponential fit is marginally better than the Fermi-Dirac fits, and STB5, where the exponential landscape performs much better than the Fermi-Dirac models. This observation provides statistical support for the fitness landscapes of Fermi-Dirac type and for the non-lethality of deleting most TFs (the exponential fitness decays to zero rather than a nonzero f 0 found in most of our Fermi-Dirac fits).

Discussion
In this work, we have considered how fitness of a single-cell eukaryote S. cerevisiae is affected by interactions between TFs and their cognate genomic sites. Changing the energy of a site or creating new sites in gene promoters may change how genes are activated and repressed, which in turn alters the cell's chances of survival. Under the assumptions of a haploid monomorphic population in which the evolution of binding sites has reached steady state, the fitness landscape as a function of TF binding energy can be inferred from the distribution of TF binding sites observed in the genome, using a biophysical model which assigns binding energies to sites. We use a simple energy matrix model of TF-DNA energetics in which the energy contribution of each position in the site is independent of all the other positions. The energy matrix parameters are inferred from a high-throughput data set in which TF-DNA interactions were studied in vitro using a microfluidics device [35]. We consider two types of fitness functions: Fermi-Dirac, which appears naturally from considering TF binding as a two-state process (Eq. 1), and exponential, which is motivated by the observation that for many TFs, the logarithm of fitness appears to decrease linearly as energy increases.
A single fitness landscape for all genomic binding sites of a given TF can only exist in the absence of site-specific selection. Indeed, it is possible that TF sites experience different selection pressures depending on the genes they regulate: for example, sites in promoters of essential genes may be penalized more for deviating from the consensus sequence. In this case, the fitness function is an average over all sites which evolve under different selection constraints: as an extreme example, consider the case where each site i has a Fermi-Dirac fitness function (Eq. 7) with different parameters m i , b i , and f i 0 . The resulting observed distribution of energies would then be the average of the distributions predicted by Eq. 5: which defines the ''average'' fitness function with effective parameters m m, b b, f f 0 , n n. Thus the fit can be carried out even in the presence of site-dependent selection, but the fitted parameters correspond to fitness functions of individual sites only in an average sense.
To gauge the importance of site-specific selection in TF binding site evolution, we have performed several statistical tests aimed at discovering correlations between binding site energies and biological properties of the sites and the genes they regulate. These tests considered gene essentiality, growth rates of strains with nonessential genes knocked out, gene expression levels, K A =K S ratios based on alignments with S. paradoxus, and the distance of the site to the TSS. We find no consistent correlations among these properties, indicating that for a given TF, the evolution of regulatory sites is largely independent of the properties of regulated genes.
Previously, low correlations have been observed between essentiality and conservation of protein and coding sequences [56][57][58][59][60][61][62], which has fueled considerable speculation as it contradicts the prediction of the neutral theory of evolution that higher selection pressures lead to lower evolutionary rates. It has also been found that the growth rates of strains with nonessential genes knocked out are significantly (though weakly) correlated with conservation of those genes [63]. It has therefore been suggested that selection pressures are so strong that only the most nonessential genes experience significant genetic drift [56]. Previous studies have also found that gene expression levels are a more reliable (though still very weak) predictor of selection pressures than essentiality [60], but we do not find this to be the case for TF binding sites, nor do we observe a consistently significant correlation between gene expression levels and TF binding energies. Available data does not rule out the possibility of timedependent selection in combination with forms of site-dependent selection for which we have not accounted. In this scenario, the variation in site binding affinity is not due to genetic drift, but to variable selection pressures across sites and over time, such that the sites are strongly tuned to particular binding energies which change from locus to locus. Indeed, there is evidence that there is frequent gain and loss of TF binding sites and that the gene regulatory network is highly dynamic [64][65][66][67][68][69][70]. However, it is possible that rapid turnover of binding sites in eukaryotes may be due to evolution acting on whole promoters rather than individual binding sites. Many promoters contain multiple binding sites for a single TF, and it may be that while individual binding sites are lost and gained frequently, the overall binding affinity of a promoter to a TF may be held constant [71][72][73]. Our evolutionary model can account for this scenario using a promoter-level fitness function, which we intend to consider in future work.
Out of 12 TFs with sufficient binding site data, five have f 0 &0, indicating a large fitness penalty for deleting such sites. However, this conclusion is strongly supported by the AIC differences between unconstrained and non-lethal Fermi-Dirac fits for only one TF, RPN4 (Table 2). RPN4 is classified as nonessential in the Yeast Deletion Database. It may be that this misclassification is due to a mismatch between genomic sites, in which the core GCCACC motif is preceded by TTT, and the energy matrix in which the binding energies upstream of the core motif are nonspecific (Table S2). We also classify REB1 and MCM1 binding sites as nonessential, although knocking out these TFs is lethal in yeast. This discrepancy may be due to a minority of essential sites being averaged with the majority of nonessential sites to produce a single fitness function, as described above. In addition, although a penalty for deleting any single site may be small, the cumulative penalty for deleting all sites (or, equivalently, deleting the TF) may be lethal.
We find that in 10 out of 12 cases, fitting an exponential fitness function is less supported by the data than fitting a Fermi-Dirac function ( Table 2). This is interesting since constructing a positionspecific weight matrix by aligning genomic sites is a common practice which implicitly assumes factorization of exponential fitness and independence of each position in the binding site. Our results show the limitations of this approximation. It is important to note that a key difference between the Fermi-Dirac fitness landscape and the exponential landscape is that the former contains magnitude epistasis [8,25] (i.e., the magnitude of a mutation's fitness effect depends on the rest of the site sequence), while the latter is non-epistatic. Thus, our results indicate that epistasis is widespread in the evolution of TF binding sites [22].
Finally, we find that depending on the TF, the distribution of TF binding energies may fall on the exponential tail, across the threshold region, or on the saturated plateau where the sites are always occupied (Table 1). In the first two categories, variation of TF concentration in the cell will lead to graded responses, which may be necessary to achieve precise and coordinated gene regulation. In the third regime, TF binding is robust and not dynamic. We also find that the fitted inverse temperature b is typically not close to the value based on room temperature (Table 1). In particular, our analysis of the variation of selection strength with b indicates that selection appears to be stronger for most TFs than expected from pure biophysical considerations, suggesting the presence of additional selection pressures beyond those dictated by the energetics of TF-DNA binding.

Distribution of monomorphic population genotypes in evolutionary steady state
In the limit u%(LN log N) {1 , where u is the mutation rate per nucleotide, L is the number of nucleotides in a locus, and N is an effective population size, mutations are sufficiently rare that each new mutation either fixes or goes extinct before the next one arrives [41]. Thus populations evolve by sequential substitutions of new mutations at a locus, which consist of a single new mutant arising and then fixing. The rate at which a given substitution occurs is thus given by the rate of producing a single mutant times the probability that the mutation fixes [36]: where u(s'Ds) is the mutation rate from genotype s to s' and w(s'Ds) is the probability that a single s' mutant fixes in a population of wild-type s. We will assume that u is nonzero only for sequences s and s' differing by a single nucleotide.
Given an ensemble of populations evolving with these rates, we can define p(s,t) to be the probability that a population has genotype s at time t. This probability evolves over time according to the master equation For population models obeying time reversibility, we can show that the steady-state distribution p(s) must have the form in Eq. 3 [38]. We assume the fixation probability w depends only on the ratio of mutant to wild-type fitnesses: w(s'Ds)~w(F (s')=F (s)). This occurs in most standard population models and is expected whenever only relative fitness matters (e.g., when the total population size is constant). If the population dynamics are time reversible, the substitution rates and steady state must obey the detailed balance relation W (s'Ds)p(s)~W (sDs')p(s').
Assuming the neutral dynamics also obey detailed balance, u(s'Ds)p 0 (s)~u(sDs')p 0 (s'), we can show that where y(r)~w(r)=w(1=r). Equation 22 implies that y(r)y(r')~y(rr'), leading to y(r)~r n for some exponent n. It can be shown that n must be proportional to the effective population size [38]; for the Wright-Fisher model, n~2(N{1).

Now Eq. 3 follows from
This form of the steady state assumes only time reversibility and dependence on fitness ratios; otherwise, any form of the fixation probability must satisfy it. While many population models do not obey time reversibility exactly, it can be shown that even these irreversible models satisfy Eq. 3 to a very good approximation [38].

Maximum-likelihood fits of fitness function parameters
For a given TF, let S~fsg be the set of binding site sequences and h~(b,m,f 0 ,n) the parameters of the fitness function (Eq. 7). The log-likelihood is given by where F is the fitness function, and Z(h)~P s p 0 (s)(F (sDh)) n is the normalization. ) generated from f 0~( 1ztanh n)=2 for n running from {5 to 5 in steps of 0:1. Our predicted maximum is the maximum likelihood point in the mesh, which is sufficiently fine to estimate all fitting parameters. We have made the code for this procedure and for the analysis of site-specific selection available at www.physics. rutgers.edu/,morozov/publications.html.

Simulations
We consider a haploid asexual Wright-Fisher process [43]. The population consists of N~1000 organisms, each with a single locus of L nucleotides. The new generation is created by means of a selection step and a mutation step. In the selection step, sequences from the current population are sampled with replacement, weighted by their fitness, to construct a new population of size N. In the mutation step, each position in all sequences is mutated with probability u. For simplicity, the mutation rates between all pairs of nucleotides are the same.
We characterize the difference between the distribution expected by our model, p exp (Eq. 3), and the distribution observed in simulations, p obs , using the total variation distance (TVD): The TVD ranges from zero for identical distributions to unity for completely non-overlapping distributions. We calculate the TVD for the distributions in energy space, where the sum in Eq. 25 is over discrete energy bins (we bin the observed sequences by energy by dividing the range from the minimum to the maximum sequence energy for a particular energy matrix into 100 bins of equal size). We begin by randomly generating the energy matrix parameters E si i : Each E si i in the energy matrix is sampled from a uniform distribution and then rescaled such that the distribution of all sequence energies has standard deviation of 1.0. This is achieved by dividing all entries in the energy matrix by a factor x: where E a i is the energy matrix element for base a at position i, L~10 is the binding site length, E E i~P a[fA,C,G,Tg E a i is the average energy contribution at position i, and p 0 (a) is the background probability of nucleotide a (p 0 (a)~0:25 for all a in our simulations). It can be shown that x is the standard deviation of the random sequence energy distributution, which is approximately Gaussian [11]. We generate the energy matrix once and use it in all subsequent simulations and maximum likelihood fits.
We perform the Wright-Fisher simulations in a range of mutation rates from u~10 {6 to u~10 {1 with a ''non-lethal'' Fermi-Dirac fitness function (Eq. 7 with f 0~0 :99, b~1:69 (kcal=mol) {1 , and m~{2 kcal=mol). We run 10 5 simulations for each mutation rate for 100=uz1000 steps, enough to reach steady state. Each simulation starts from a monomorphic population with a randomly chosen sequence. We construct the steady state distribution for each mutation rate by randomly choosing a single sequence from the final population of each simulation. Collected across all simulations, these are used to construct a distribution of sequences at each mutation rate. Additionally, we record the average final number of unique sequences at each mutation rate.
We perform another set of Wright-Fisher simulations with the same fitness function and energy matrix as above, and u~10 {6 . We run 10 5 simulations, each starting from the same monomorphic population with a specific sequence of E&0. At regular intervals in each simulation, we record a randomly chosen sequence from the population. Collected across all simulations, these are used to construct a distribution of sequences at each point in time.

Binding site and energy matrix data
We obtain curated binding site locations for 125 TFs from Ref. [31], which provides a posterior probability that each site is functional based on cross-species analysis. We only consider sites with a posterior probability above 0.9. For this analysis, we use the Saccharomyces Genome Database R53-1-1 (April 2006) build of the S. cerevisiae genome.
We obtain position-specific affinity matrices (PSAMs) for a set of 26 TFs from an in vitro microfluidics analysis of TF-DNA interactions [35]. This study provides PSAMs for each TF determined using the MatrixREDUCE package [34]. We convert the elements of the PSAM w a i to energy matrix elements using E a i~{ log(w a i )=b, where b~1:69 (kcal=mol) {1 at room temperature. For each of these 26 TFs, genomic sites are available in Ref. [31], although we neglect PHO4 since it does not have any binding sites above the 0.9 threshold of Ref. [31], leaving us with 25 TFs for which both an energy matrix and a set of genomic binding sites are available. We align the binding site sequences from Ref. [31] to the corresponding energy matrices, choosing the alignment that produces the lowest average binding energy for the sites.

Essentiality data
The Yeast Deletion Database classifies genes as essential, tested (nonessential), and unavailable, which number 1156, 6343, and 529, respectively [39,51]. For each essential or tested gene, we determine all TF binding sites less than 700 bp upstream of the gene's transcription start site (on either strand), which we designate as the sites regulating that gene. Growth rates for nonessential knockout strains are provided under YPD, YPDGE, YPG, YPE, and YPL conditions, relative to wild-type. We choose the lowest of these growth rates to represent the fitness effect of the knockout.
To measure the rate of nonsynonymous substitutions, we align the non-mitochondrial, non-retrotransposon ORFs taken from the Saccharomyces Genome Database R64-1-1 (February 2011) build [75] of S. cerevisiae to those of S. paradoxus using ClustalW [76]. We measure the rate of nonsynonymous mutations using PAML [77]. We ran PAML with a runMode of 22 (pairwise comparisons) and the CodonFreq parameter (background codon frequency) set to 2; we also tested CodonFreq set to zero and obtained very similar results. We find the rate of nonsynonymous substitutions to be 0.04, and a Spearman rank correlation of {0:16 (p~10 {27 ) between growth rate of knockouts and the nonsynonymous substitution rate of the knocked-out gene. This is consistent with the results of Ref. [58], which found the rate of substitutions to be 0.04 and the rank correlation between growth rate and substitution rate to be {0:19 (p~10 {35 ).
To compare binding energy to evolutionary conservation, we calculate the mean Hamming distance between S. cerevisiae sites and corresponding sites in S. paradoxus [31]. To test for significance in the difference of mean energies and Hamming distances of sites regulating essential and nonessential genes, we use a null model which assumes that the sites were randomly categorized into essential and nonessential. We randomly choose a subset of the sites in our dataset to be ''nonessential,'' equal in size to the number of sites regulating nonessential genes as classified by the Yeast Deletion Database. By repeating this procedure 10 6 times, we build a probability distribution for the difference in the means of the nonessential and essential groups. The p-value is the probability of obtaining a difference in the means greater than or equal in magnitude to the empirically measured value.

Neutral binding site energy distributions
We construct the neutral probability p 0 (s) of a sequence s with length L as p 0 (s)~p 0 (s 1 ) P L i~2 p 0 (s i{1 ,s i ), where p 0 (s i ) is the background probability of a nucleotide s i , and p 0 (s i{1 ,s i ) is the background probability of a dinucleotide s i{1 s i . These probabilities are determined from mono-and dinucleotide frequencies in the intergenic regions of the S. cerevisiae genome (build R61-1-1, June 2008). We project p 0 (s) into energy space using Eq. 2 to obtain p 0 (E), the neutral distribution of binding energies for sequences of length L. If intergenic sequences evolve under no selection with respect to their TF-binding energy, the neutral distribution of site energies should closely match the actual distribution of L-mer sequences obtained from intergenic regions. Table S2, column B shows that these two distributions match very well except at the low-energy tail, which is enriched in functional binding sites. Note that accounting for dinucleotide frequencies is important; mononucleotide frequencies alone are insufficient to reproduce the observed distribution [54].

Supporting Information
Table S1 Full summary of tests for site-specific selection. For 25 TFs we compute TF-DNA interaction energies (in kcal/mol) for each site. Columns from left to right: (A) Essentiality of the TF according to the Yeast Deletion Database; total number of binding sites for each TF; total number of sites with unique sequences. The table lists how many essential and nonessential genes are regulated by each TF, and how many of these genes have gene expression and S. paradoxus K A =K S ratio data. We also report the mean energy SET and the variance V of sites regulating both essential and nonessential genes, and mean squared energy difference SDE 2 T and mean Hamming distance SdT between S. cerevisiae and S. paradoxus sites regulating essential and nonessential genes. We show p-values for the significance of the difference between these two classes of sites (see Methods). (B) Growth rate in strains with nonessential gene knockouts versus energy of TF binding sites regulating the knockout genes.   Table 1, we analyze the quality of fit. Columns, from left to right: (A) Eigenvalues and eigenvectors of the Hessian of the likelihood function around the fit maxima. Eigenvectors of the Hessian represent principal directions and the corresponding eigenvalues represent the curvature in those directions, which should be negative at a local maximum. Positive eigenvalues occur if the maximizer did not reach a maximum. Here, the degeneracy represented by c~n(1{f 0 ) is apparent as many fits have an eigenvalue close to zero (flat) or even slightly positive in the direction (n,f 0 ). For fits subject to the m-c degeneracy, one can see a second low eigenvalue corresponding to the m direction. For computational reasons the Hessian is evaluated using transformed variables log n, m, log b, and T (f 0 )~atanh(2f 0 {1). (B) For each TF, 64 subsets of the full data set were generated by randomly selecting half of the binding sites in the full data set. Maximum likelihood fits were carried out as for the full data set, except that to reduce computation time the grid spacing in the initial four dimensional parameter search was doubled. Shown here are histograms of the resulting parameters. Red dashed lines indicate the maximum likelihood value of each parameter obtained from the full data set. (PDF)