Redundancy and the Evolution of Cis-Regulatory Element Multiplicity

The promoter regions of many genes contain multiple binding sites for the same transcription factor (TF). One possibility is that this multiplicity evolved through transitional forms showing redundant cis-regulation. To evaluate this hypothesis, we must disentangle the relative contributions of different evolutionary mechanisms to the evolution of binding site multiplicity. Here, we attempt to do this using a model of binding site evolution. Our model considers binding sequences and their interactions with TFs explicitly, and allows us to cast the evolution of gene networks into a neutral network framework. We then test some of the model's predictions using data from yeast. Analysis of the model suggested three candidate nonadaptive processes favoring the evolution of cis-regulatory element redundancy and multiplicity: neutral evolution in long promoters, recombination and TF promiscuity. We find that recombination rate is positively associated with binding site multiplicity in yeast. Our model also indicated that weak direct selection for multiplicity (partial redundancy) can play a major role in organisms with large populations. Our data suggest that selection for changes in gene expression level may have contributed to the evolution of multiple binding sites in yeast. We conclude that the evolution of cis-regulatory element redundancy and multiplicity is impacted by many aspects of the biology of an organism: both adaptive and nonadaptive processes, both changes in cis to binding sites and in trans to the TFs that interact with them, both the functional setting of the promoter and the population genetic context of the individuals carrying them.

The evolutionary dynamics of an infinite sexual population is then described by: Yeast data TF binding site models We used 428 position weight matrices (PWMs) summarizing the binding specificities of 190 putative yeast TFs reported in four studies: two analyses of a single genome-wide chromatin immunoprecipitation data set [1,2] and two independent protein binding microarray studies [3,4] (Tables S1 and S2). Each PWM was simplified in two steps: 1. Sequential deletion of all terminal (both 5' and 3') positions with a total information content under 0.125.

Deletion of all but the 8 positions with the highest information content and intervening positions
We then compared all PWMs of the same length to each other by calculating the following distance: where, p i j is the frequency of nucleotide i at position j of the first PWM, and p i j is the corresponding position in the second PWM. The distance between the first PWM and the reverse complement of the second PWM was also calculated and the smallest value of d was used. Any pairs of PWMs showing d < 0.05 were collapsed. Pairs of PWMs with equivalent consensus sequences at 100% or 95% of the maximum PWM score were also collapsed ( Table S2). The final PWM data set consisted of 326 PWMs for 179 TFs (Table S1).

Mismatches
Scanning PWMs at 95% stringency means that certain low information positions allow mismatches.
The maximum number of mismatches allowed by a PWM decreased with the mean information content

Genomic features
We calculated the following quantities for each intergenic region: 1. Sequence length.

A measure of the frequency of meiotic double-strand breaks (DSBs) of a mutant (dmc1∆) defective
in DSB-repair obtained using microarray hybridization [8]; we used the mean of log-transformed unsmoothed average ratios of background-normalized fluorescence.
7. Total number of crossover events identified from examining all four products of 56 yeast meioses [10].
We also calculated the following quantities for each gene downstream of these promoters: 1. Three measures of robustness to trans-perturbations [11], derived from measurements of the variance (corrected for the mean) in levels of gene expression across: 167 viable knockout mutations (genetic), 30 wild isolates (genetic background), and 35 environments (environmental robustness).
The robustness metrics are inversely related to the variance in gene expression.
2. Essentiality-whether a homozygous knock-out of the gene is lethal). We merged the lists of essential genes in http://www-sequence.stanford.edu/group/yeast deletion project/Essential ORFs.txt [12,13] and the Comprehensive Yeast Genome Database (CYGD) [14]. We merged the lists of nonessential genes in CYGD and three other studies [15][16][17]. Overlaps between the resulting lists of essential and nonessential genes were reclassified as nonessential.
3. Whether the gene has a duplicate elsewhere in the genome [18].
5. Degree centrality-total number of interactions between that gene and other genes, including transcription regulatory and protein-protein interactions [19].
6. Protein expression noise, defined as the average of the log-transformed coefficient of variation in protein expression in two environments [20]; inversely correlated with protein and mRNA abundance [20] (Fig. 5B).
8. Protein abundance [22]. Estimates of mean protein abundance in this study are strongly correlated with those in [22].

Software
PWM scans were done using the Bioconductor package 'Biostrings'. Effect size estimates and metaanalyses were done using the 'metafor' package in R [23]. Cluster analyses were done using the 'cluster' package in R.