^{*}

Conceived and designed the experiments: JK XH SS. Performed the experiments: JK. Analyzed the data: JK XH SS. Contributed reagents/materials/analysis tools: JK XH SS. Wrote the paper: JK XH SS.

The authors have declared that no competing interests exist.

Characterization of the evolutionary constraints acting on

The spatial–temporal expression pattern of a gene, which is crucial to its function, is controlled by

Gene regulation is well recognized as a major determinant of how an organism functions

Despite significant recent efforts

Transcription factor binding sites are commonly predicted based on the assumption of their evolutionary conservation

Even the most fundamental assumption of regulatory comparative genomics, that binding sites are evolutionarily conserved, has been challenged–Emberly et al.

Cameron et al.

Earlier attempts to characterize the evolutionary patterns of regulatory sequences used a few well-studied CRM sequences. These studies were limited in their scope

We begin with our findings on the evolutionary behavior of transcription factor binding sites. We collected 68

We annotated binding sites for each transcription factor, in the subset of

Binding sites from different species, that overlap each other in the multiple alignment, are collectively referred to as an “orthologous TFBS set”. Sites in such an orthologous set were re-aligned locally in order to correct for any errors in their precise alignment. Graphic visualizations (

Different positions in binding sites have different contributions to the binding affinity of the TF. Positions that form the core regions for TF-DNA binding are more specific (less variation allowed) in the motif, and should be under stronger selective constraints. We thus expect different positions of TFBSs to have different degrees of evolutionary conservation. The specificity of a position can be expressed by the information content (IC) of the corresponding column in the PWM (position weight matrix), and the evolutionary rate by the number of substitutions in that position in orthologous binding sites (see

Factor | Number of TFBSs | Width of motif | Correlation coefficient |
P-value |

bcd | 160 | 8 | −0.75 | |

cad | 175 | 9 | −0.48 | 0.0969 |

dstat | 129 | 9 | −0.83 | |

hb | 170 | 8 | −0.69 | |

kni | 85 | 12 | −0.82 | |

kr | 177 | 11 | −0.53 | |

tll | 185 | 10 | −0.38 | 0.1375 |

Spearman's correlation coefficient.

While substitution rates in a TFBS are position-specific, this does not imply that different positions evolve independently, although such an assumption is often made in existing evolutionary models

To study evolution at the level of binding sites, as opposed to nucleotides, we developed a simple mathematical model of binding site evolution, called “Site-level Selection” or “SS” model, that treats binding sites as single evolutionary units. Under this population genetics-based model, the fitness of a site can take two values, 1 if the binding affinity of this site is below some threshold, and

When

Note that

We tested how well this model fits the data on binding site evolution, and compared it to another model, called the “Halpern-Bruno” or “HB” model

(A) Distribution of energy difference between a predicted binding site in

Factor | Median SSE |
P-value |
||

HB model | SS model | |||

bcd | 0.19 | 0.10 | <2.20E-16 | 8 |

cad | 0.23 | 0.16 | <2.20E-16 | 8 |

dstat | 0.12 | 0.06 | <2.20E-16 | 11 |

hb | 0.10 | 0.07 | <2.20E-16 | 15 |

kni | 0.21 | 0.15 | <2.20E-16 | 19 |

kr | 0.19 | 0.15 | <2.20E-16 | 8 |

tll | 0.18 | 0.10 | <2.20E-16 | 17 |

Median values of sum of squared errors (SSE) from 100 different simulations with the model.

P-value from paired Wilcoxon signed-rank test.

Optimal value of the free parameter of SS model.

However, in absolute terms, neither model explains the data very well (^{−12}) (

Even though TFBS loss and gain (henceforth called “turnover”) have been commonly observed, it is not clear whether these changes are adaptive ^{2}^{2}

Factor | R^{2} (raw data) |
Adjusted R^{2} (corrected data) |
FP |

bcd | 0.9813 | 0.9631 | 0.14 |

cad | 0.9857 | 0.9693 | 0.29 |

dstat | 0.9913 | 0.9831 | 0.26 |

hb | 0.9114 | 0.9180 | 0.24 |

kni | 0.9642 | 0.9883 | 0.31 |

kr | 0.9698 | 0.9097 | 0.27 |

tll | 0.9894 | 0.9515 | 0.32 |

R^{2} from raw data without correcting for the false positive rate.

Adjusted R^{2} from data corrected for the false positive rate.

Estimated false positive rate obtained by regression.

Random PWMs | |||

Factor | Loss rate | Mean | Stdev |

bcd | 0.1865 | 0.2530 | 0.0217 |

cad | 0.1969 | 0.2444 | 0.0213 |

dstat | 0.2471 | 0.2642 | 0.0172 |

hb | 0.1470 | 0.1937 | 0.0211 |

kni | 0.2315 | 0.2551 | 0.0170 |

kr | 0.1811 | 0.2666 | 0.0172 |

tll | 0.2147 | 0.2389 | 0.0191 |

These rates are without false positive correction.

Having characterized some general patterns of TFBS evolution, in this section we study what specific factors may influence the conservation and turnover of binding sites.

We defined the strength of a site as the degree of match of this site to the corresponding motif, as measured by a log-likelihood ratio (

Factor | Number of TFBS sets | Correlation coefficient |
P-value | Random PWM |

bcd | 163 | −0.71 | 0 | |

cad | 168 | −0.30 | 0.0974 | 18 |

dstat | 129 | −0.46 | 11 | |

hb | 168 | −0.62 | 0 | |

kni | 86 | −0.58 | 4 | |

kr | 191 | −0.72 | 0 | |

tll | 188 | −0.86 | 0 |

Spearman's correlation coefficient.

Number of random PWMs (out of 100 simulations) that show greater correlation than the real motif.

Another potential determinant of turnover is the overall evolutionary constraint on the CRM to which the site belongs. We estimated the substitution rate for each CRM using the Paml software

A “homotypic TFBS cluster”

We next examined spatial proximity of homotypic binding sites at a finer granularity: if two adjacent sites of the same factor are closely located, there may be cooperative binding of the factor to these sites, leading to stronger selective pressure. Such cooperative binding by proximal sites is known for the

Factor | Number of TFBSs | Correlation coefficient |
P-value |

bcd | 157 | 0.04 | 0.3969 |

cad | 162 | 0.38 | |

dstat | 112 | 0.00 | 0.5000 |

hb | 156 | 0.30 | |

kni | 82 | 0.24 | 0.1270 |

kr | 183 | 0.14 | 0.2212 |

tll | 178 | 0.30 |

Spearman's correlation coefficient.

Binding sites often “interact” with sites of other factors in their neighborhood. Such interactions may include, for example, cooperative binding to DNA or short-range repression. We next examined the effect of spatial context of “heterotypic” binding sites on evolutionary constraint. In a procedure similar to that of Hare et al. ^{−5} in ProbconsMorph based alignments, p-value 0.001 in Pecan alignments, Hypergeometric test).

We next tested the effect of the above spatial categories individually for each factor. Comparing the “proximal” and “distal” classes (

Factor | P vs D |
O vs NO |

bcd | 0.9981 |
0.5910 |

cad | 0.5626 | |

dstat | 0.8981 | |

hb | 0.2141 | |

kni | 0.4784 | 0.2425 |

kr | 0.2071 | |

tll | 0.0806 |

Numbers are P-values from hypergeometric test.

P means proximal and D means distal.

O means overlap and NO means non-overlap.

The opposite p-value is 0.0124.

In a similar comparison of the “overlap” class with its complement (“proximal or “distal”) (

In summary, five of the seven motifs showed a significant tendency to be conserved when they had a partner either overlapping with or proximal to them.

Finally, we analyzed insertions and deletions in known regulatory sequences, to study the extent of indel-purifying selection. Among 370 non-overlapping

Another related question is the indel pattern in the “spacer” region between CRMs and transcription start site (TSS) of the target genes. Transcriptional regulation depends on the communication between CRMs and promoter sequences

Our results show that indel-purifying selection exists on CRM sequences, but such selection acts most strongly on deletions. We did not find clear suppression of long-indels, as has been observed before

The study of

There are several technical issues that were important to address in our analysis. Evolutionary comparison depends on the alignment of orthologous sequences, but in general, alignments cannot be perfectly determined and may be a source of biased conclusion

Another critical component of our analysis is the prediction of TFBSs. By using the same PWMs for all the genomes, we have made the assumption that the PWM of any TF is fully conserved across 12

Our model of binding site evolution, the “Site-level Selection” (SS) model, is a special case of the population genetic model proposed by Mustonen and Lassig

Our findings of a molecular clock extend earlier results on a small number of well characterized CRMs

Our tests point out that stronger binding sites are conserved more often than weaker sites. This is consistent with an earlier study

An unexpected result of our analyses is that the degree of homotypic clustering does not affect turnover rate. This is contrary to the notion that more binding sites of the same type will lead to greater redundancy, easing the selective pressure on the individual sites. Instead, the number of binding sites seems to be important to CRM function. This observation is similar to one of the implications of our findings of site-level selection: that exact affinities of binding sites are functionally important. Both observations are consistent with the so called “gradient threshold model”

We found that the presence of a binding site for a different factor, either overlapping or proximal to a binding site, can strongly affect the latter's evolution. Different mechanisms of local interactions between sites are known in developmental CRMs, e.g., cooperative binding between two factors

We did not find strong evidence of suppression of large indels within CRMs relative to their flanking sequences. Our results are different from an earlier study of indel patterns of CRMs in sea urchins, which reports that large indels (>20 bp in length) are virtually absent inside CRM sequences

12

For the analysis of TFBS evolution, we developed a new multiple alignment program, “ProbconsMorph”, by integrating Probcons

Pecan

Insertion and deletion annotations were done using our previously published Indelign program

The seven PWMs were used to scan

The

For each orthologous TFBS set containing binding sites in all species, a parsimony cost for each position was computed. The average of this parsimony cost, over all orthologous TFBS sets, was used as the evolutionary rate of the position. For this analysis, orthologous TFBS sets were obtained differently: Pecan alignments of five closely related species (

To simulate the evolution of a binding site, we repeat two steps: (i) compute the rate of each substitution event at each position according to HB model

For each factor, the collection of sites was divided into two randomly chosen subsets: the first, called the training set, was used to learn a value of ^{6} times, to obtain a histogram of energy difference values. To obtain the SSE values shown in

Let ^{2}^{2})/(n−k−1)

We first constructed a phylogenetic tree for each orthologous TFBS set by labeling a leaf node as “1” if its corresponding species has the site and 0 otherwise; a subtree rooted at the least common ancestor of leaf nodes labeled 1 was then identified. The turnover rate was defined as the parsimony cost calculated in the subtree, divided by the sum of branch lengths of the subtree. The overall TFBS turnover rate across multiple orthologous TFBS sets was defined as the sum of the parsimony costs of the individual orthologous TFBS sets, divided by the sum of branch lengths (obtained as described above) (see

The evolutionary rate of a CRM was defined as the sum of branch lengths estimated by Paml

These were constructed by starting with one of the seven original PWMs, randomly permuting columns, and then randomly permuting rows for A and T, and rows for C and G, to obtain a random PWM that retains the information content and G/C content of the original.

Bins were defined by the values of statistic (i.e., TFBS strength, rate of CRM, or distance between adjacent TFBSs). For the collection of samples in each bin, the overall TFBS turnover rate and the average of statistic were calculated, and the correlation test between them was performed.

An example of the graphic visualization of alignments of CRMs with binding site annotation.

(1.25 MB DOC)

Correlation between the specificity of a TFBS position and its evolutionary rate, with Pecan alignments.

(0.84 MB DOC)

Correlation between the specificity of a TFBS position and its evolutionary rate, with ProbconsMorph alignments.

(0.93 MB DOC)

Distributions of energy difference from observed binding sites (Observed), and those simulated by HB (HB) and Site-level Select (SS) models, with ProbconsMorph alignments.

(0.83 MB DOC)

Distributions of energy difference from observed binding sites (Observed), and those simulated by HB (HB) and Site-level Select (SS) models, with Pecan alignments.

(0.93 MB DOC)

Distributions of the number of substitutions from observed binding sites (Observed), and those simulated by HB (HB) and Site-level Selection (SS) models, with ProbconsMorph alignments.

(0.69 MB DOC)

Distributions of the number of substitutions from observed binding sites (Observed), and those simulated by HB (HB) and Site-level Selection (SS) models, with Pecan alignments.

(0.80 MB DOC)

The fraction of

(0.47 MB DOC)

The fraction of

(0.58 MB DOC)

Comparison of the two different indel length distributions, a single geometric distribution and a mixture of two geometric distributions.

(0.22 MB DOC)

An example of the calculation of TFBS turnover rate.

(0.05 MB DOC)

Correlation between the specificity of a TFBS position and its evolutionary rate, with ProbconsMorph alignments.

(0.03 MB DOC)

Comparison of HB and SS models, with Pecan alignments.

(0.03 MB DOC)

Goodness-of-fit of a linear model for the fraction of conserved binding sites over divergence time, with Pecan alignments.

(0.03 MB DOC)

Comparison of loss rates of binding sites using real and random motifs, with Pecan alignments.

(0.03 MB DOC)

Correlation between TFBS strength and TFBS turnover rate, with Pecan alignments.

(0.03 MB DOC)

Correlation between the distance between two adjacent homotypic sites and TFBS turnover rate, with Pecan alignments.

(0.03 MB DOC)

Binding site conservation and its spatial context, with Pecan alignments.

(0.03 MB DOC)

Correlation between evolutionary rate of CRM and TFBS turnover rate, with ProbconsMorph alignments.

(0.03 MB DOC)

Correlation between evolutionary rate of CRM and TFBS turnover rate, with Pecan alignments.

(0.03 MB DOC)

Supporting text.

(0.09 MB DOC)

We thank Michael Brodsky for pointing out the evidence for spatial proximity of caudal sites.