Spatial Promoter Recognition Signatures May Enhance Transcription Factor Specificity in Yeast

doi:10.1371/journal.pone.0053778

Figure 1.

Description of the model.

(A) One kilobase upstream of the transcription start site of YPL192C is depicted, with PWM-scores of putative Ste12 binding sites plotted in gray. The transcription start site is represented by an arrow. (B) A sample state configuration for the model is shown. Variables are represented as circles, with hatching added to variables considered to be ‘observed.’ As described in detail in Methods, the binary ‘regulation’ variable, in green, emits a series of ‘site’ variables (blue), each corresponding to and emitting a single nucleotide (red) in the promoter. The middle segment highlights how a background b₀ state transitions to a series of frequency matrix states, which in turns transitions to a background b₁ state. This b₁ value is carried, as shown, to the end of the sequence, where it emits a final background nucleotide and the observed value of 1 for the ‘consistency’ state, in orange. This consistency state takes value one if the final state variable takes a value of either b₁ or b_x, ensuring that the original ‘regulation’ variable specifies whether or not a binding site is emitted. The frequency matrix states shown here correspond to the position of one of the two highest-scoring matches to the Ste12 motif; here they emit the consensus TGAAACA sequence observed on the forward strand of the YPL192C promoter. (C) The probability of transition from a background state to a frequency state depends on the position of the nucleotide. Here we depict the final spatial model for Ste12, highlighting how the fitted parameters μ and ω specify the center and the width of the spatial distribution of emitted binding sites. The maximum height of the plateau corresponds to the parameter λ, which determines the rate at which binding sites are emitted. Not shown are the parameters ρ, which determines the probability that any site at all will be emitted, τ, which determines the extent of the strand bias of emitted sites, and η, a free parameter that determines the slope of the curve up to the plateau. (D) The model incorporates position weight matrix information (depicted in 1A) and spatial information (depicted in 1C) to arrive at a weight for each putative binding site. Here we plot, for each position, the expected value that the state variable corresponds to the beginning of a binding site.

More »

Expand

Figure 2.

Description of promoter signatures.

Promoter signatures for all transcription factors with more than twenty screened bound intergenic regions, excluding those with trainable signatures in unbound regions. Sequence logos depict the frequency matrices described in the main text. The blue region corresponds to the site-enriched plateau illustrated in figure 1C: it is centered at the location parameter μ and shows the range from μ−ω to μ+ω. If the region is gray, then either we were unable to find statistically significant support for training the parameters μ and ω (bottom four cases) or these trained parameters failed our shuffling test (top seven cases), indicating for these factors that promoter lengths alone are sufficient to explain their observed spatial restriction. The strand column depicts strand bias, from 100% reverse-strand bias (green) to 100% forward-strand bias (red). Circles in the count column depict the expected number of binding sites per promoter. Gray circles correspond to those sequences that better fit the monosite model, having strictly one site per promoter.

More »

Expand

Figure 3.

Transcription factors exhibit a diversity of spatial preferences.

Score density is plotted against position. Score density is defined as the sum of positive log-two position weight matrix scores in a twenty base window, divided by the total number of possible binding site positions within that window of the training data. The black line is the simulated background score density; the gray area is the 95% confidence interval about that line. Confidence intervals are wide in windows far from the transcription start site due to the low number of intergenic regions in the training data reaching this distance. The green area is weighted by the model to be part of the promoter-signature distribution; the black area is weighted by the model to be part of the background distribution. Depicted factors are (a) Reb1, (b) Abf1, (c) Gcn4, and (d) Fhl1. No intergenic region used to train Fhl1’s spatial signature is as long as 1,000 base pairs, creating a blank area.

More »

Expand

Figure 4.

Promoter signatures compensate for and increase the information available to weakly specified binding sites.

The specificity of a transcription factor is measured by the Kullback-Leibler divergence between the distribution of possible binding sites and a distribution of background sequences. Here we have calculated an analogous measure to quantify the specificity of our of different spatial signature models: each model emits a distribution of possible promoter sequences, allowing us to approximate the Kullback-Leibler divergence, and hence the specificity, separating this promoter sequence distribution from a distribution of background sequences. This measure of information is plotted in orange here for each of five factors. For comparison, in blue, we plot the information carried by a simpler model. This model emits sequences with only a single site without strand bias or spatial restriction, and as such the information of the promoter distribution is solely the product of the information carried by the binding site. In every case, spacing, strand, and/or density substantially increases the information carried by the model. For these five factors, these restrictions allow them to share essentially the same promoter-level information content despite their diversity of binding site specificities.

More »

Expand

Figure 5.

Spatial signature scores are correlated with expression change in transcription factor deletion mutants.

(A) Considering the top 50 target promoters predicted by each of four methods, ChIP-chip, our sequence signatures model, a simple thermodynamic model, and a single site model, we plot the number of cases in which that method’s predicted targets for a transcription factor exhibited a significantly different average expression upon deletion of the transcription factor. We shaded each bar by the number of times that, for a given method and transcription factor, the magnitude of this average expression difference ranked 1st (black), 2nd (dark gray), 3rd (light gray), or 4th (white) among the methods significantly associated with expression for that transcription factor. The targets predicted by the spatial signature model typically showed a greater magnitude of expression change upon factor deletion than did the targets predicted by the thermodynamic model (p = .0176, see Methods), which in turn typically exhibited a greater magnitude of expression change than those targets predicted by the single site model (p = .0005). (B) For each transcription factor where the top 50 predictions of ChIP-chip were associated with expression, we plot for each method the average expression change of its top 50 predicted targets upon that factor’s deletion. We derived 95% confidence intervals by resampling 5,000 sets of 50 putative targets chosen at random from the expression dataset and calculating the average expression change in each of these. Note that if the measured changes in expression are widespread, as is the case for the Nrg1 deletion, it is possible for the confidence interval derived from this null distribution to not contain zero.

More »

Expand