Conceived and designed the experiments: MA KL. Performed the experiments: MA KL. Analyzed the data: MA KL. Wrote the paper: MA KL. Contributed feedback and ideas: HL MN. Supervised the study: MN HL.
The authors have declared that no competing interests exist.
Protein binding microarrays (PBM) are a high throughput technology used to characterize protein-DNA binding. The arrays measure a protein's affinity toward thousands of double-stranded DNA sequences at once, producing a comprehensive binding specificity catalog. We present a linear model for predicting the binding affinity of a protein toward DNA sequences based on PBM data. Our model represents the measured intensity of an individual probe as a sum of the binding affinity contributions of the probe's subsequences. These subsequences characterize a DNA binding motif and can be used to predict the intensity of protein binding against arbitrary DNA sequences. Our method was the best performer in the Dialogue for Reverse Engineering Assessments and
DNA binding proteins form a diverse class of proteins that play crucial roles in many cellular processes. They replicate and repair the genome, transcribe genes, form the structure of chromatin, and mediate intracellular signals, among other activities
Methods for TF target identification can be divided into two broad classes: methods that observe binding sites directly, and methods that use TF binding specificity models to computationally identify putative binding sites. A traditional method for directly discovering the DNA binding sites of proteins has been to use protein-DNA crosslinking, followed by DNA fragmentation and chromatin immunoprecipitation (ChIP)
Protein binding microarrays (PBM)
PBM arrays are constructed by taking a normal oligonucleotide microarray and constructing a complementary strand for each probe using DNA polymerase. Probe-invariant flanking sequences are used as complementary targets for the polymerase primer. Antibody-labeled DNA binding protein is then allowed to bind to probes on the microarray slide, according to the protein's sequence binding preferences.
Binding motifs can be identified from ChIP, DIP or PBM experiments by computationally analyzing the sequence or microarray data produced by the measurement platform. Any discovered motifs are then represented in the form of a motif model for subsequent use in binding site prediction. Commonly used models include consensus sequences, position frequency matrices (PFM) and position weight matrices (PWM)
Representation of binding motifs as PFMs or PWMs makes the implicit assumption that all mononucleotides contribute independently to the binding affinity. Studies done on zinc finger proteins have challenged this assumption
The literature describes a number of different algorithms for inferring motif models from binding-enriched unaligned sequences. Lawrence et al. formulate the problem using a model-based approach and develop a Gibbs sampling technique for statistical inference
Motif models have been successfully applied in several biological contexts in the past. For example, Litvak et al. recently used PWM motif scanning to predict a feed-forward motif (consisting of NFkB, ATF3 and CEBPd) that was shown to shape the transcriptional response of TLR stimulated macrophages
While the PWM motif model has proven its usefulness in many applications, more general approaches can also be considered. One alternative is the full 8-mer model described by Chen et al. in their RankMotif++ paper. Chen et al. compare the performance of RankMotif++ against a full 8-mer model where the signal intensity of a PBM probe is predicted by taking the 8-mer subsequence with the highest median intensity across the probes containing it on the training PBM array, and using that median intensity as the prediction for the target array
We present a new linear motif model that represents a TF's binding affinity toward a DNA sequence as a linear combination of its affinities towards the variable-length K-mers that make up the DNA sequence. Here by “binding affinity” we refer to a quantity that measures the relative specificity of a protein towards a particular DNA sequence. It should be noted that these binding affinities do not directly correspond to dissociation constants or other physical quantities. Our motif model can be learned from any binding affinity data where a binding affinity score is associated with each interrogated DNA sequence. The model produces prediction results better than those produced using full 8-mer models, while having a more compact motif representation. We illustrate the power of our model by applying it to PBM data from the DREAM5 transcription factor/DNA motif recognition challenge
Since we use K-mers rather than mononucleotides, our model can capture full binding specificity information for short motifs (shorter than 9 bases). For longer motifs, our model assumes that binding affinity can be modeled as an additive effect of the component K-mers. Since the additivity assumption has been found to be a good approximation at the mononucleotide level
In addition to the prediction of binding affinity, an important related problem is the identification of unknown bound TFs. Such a problem can arise, for example, when a set of genes having similar gene expression profiles share a common regulator or when indirect binding sites are found in ChIP experiments. The increasing interest in differential transcriptional regulation between individuals also highlights the importance of TF identification
Our binding model represents the measured binding affinity of a protein towards a DNA sequence as the sum of the binding affinity contributions of the sequence's constituent subsequences. The subsequences are allowed to vary in length, so that the affinity contributions of all constituent 4–8-mers are included in the model. Without loss of generality, we can restrict our discussion to the case where the motif model is learned from PBM data, so that the differentially bound sequences come from dsDNA probes.
As the first step of our algorithm, the K-mers present in the probe sequences on a PBM array are represented as a design matrix
Both the flanking and interrogating sequences are considered when building the matrix. A column for a constant background component is also included in the matrix.
If a probe sequence
Once a design matrix has been constructed, we solve the K-mer affinity contributions from the linear system
If all 4–8-mers are included in the model, the system is underdetermined, having roughly 90 000 unknowns. For this reason, we regularize the system by only including 7–8-mers with the highest median intensity across the probes that incorporated them. We also include all 4–6-mers, since they are critical for accurately predicting the intensities of low affinity probes. This regularization approach is based on the assumption that K-mers with the highest median intensity are the most informative in terms of protein binding.
We also considered regularizing the system by minimizing an
The sparse but large linear system is solved for the affinity vector by applying the conjugate gradient method to the normal equations
Once the affinity vector
If we are dealing with raw PBM array data, we have to preprocess and normalize the probe intensity profiles before solving the affinity vector
PBM samples are first preprocessed by removing dark outlier probes and performing spatial detrending. The samples are then quantile normalized before application of the linear model. In this example, the predicted binding intensities are shown to be calculated against probe sequences on another PBM array, but could just as well be calculated for genomic sequences or any other DNA sequences.
In the first preprocessing step of
(a) A filter cutoff point is determined based on the intensity histogram. (b) Two examples of how low intensity filtering successfully removes dark edge artifacts in PBM samples. In both samples pairs, the original sample is on the left, and the filtered sample on the right. Red pixels indicate missing or discarded intensity values.
Next, spatial detrending is performed on the data by rescaling the intensity of each microarray spot by the ratio of the global median and the median calculated within a 7×7 window centered on the spot. This step compensates for the spatial trends (light or dark blotches) often seen in microarray data (
(a) A 7×7 median window is used to rescale probe intensities. (b) Two examples of how the spatial detrending step successfully removes large light and dark regions (spatial artifacts) in PBM samples. Original samples on the left, preprocessed samples on the right. Red pixels indicate missing or discarded intensity values.
In the normalization step, the samples used in learning the motif models are quantile normalized. Quantile normalization assumes that the true intensity distributions (uncontaminated by experimental errors) of different transcription factors have roughly similar shapes. The validity of this assumption is subject to debate, but according to our tests, quantile normalization does improve the accuracy of our model's predictions. We suspect that this improvement is largely due to quantile normalization's ability to recover the high intensity tails in saturated PBM samples (
The figure shows how the log-intensity histogram of the Foxo3 PBM sample is changed by quantile normalization. An example of how quantile normalization can recover the high intensity tails in saturated PBM samples. The saturated probe intensities (highlighted in red) are recovered by fitting them to the consensus distribution.
It is critical that we do not simply discard the saturated probes as we did with dark probes, because whereas dark probes can be considered non-informative, high intensity probes are the most informative features in terms of binding affinity. Ideally, the saturated probes would be dealt with by improving the experimental setup and protocol. But in cases where this cannot be done, a computational method is needed.
It is worth noting that the saturation peaks in the intensity histograms are somewhat spread out from the absolute saturation ceiling, so that an ordering exists even for the saturated probes. This ordering may not actually contain any useful information, but if it does, then our quantile normalization step can effectively utilize this information by maintaining the intensity ordering while extrapolating probe intensities beyond the saturation ceiling.
In solving our linear system using the ordinary least squares method, we implicitly assume
TFs were identified by deducing PWM motifs from the PBM data and comparing the PWMs to mammalian PWMs from TRANSFAC release 2010.2
We assessed the performance of our protein-DNA binding model using PBM data from the DREAM5 transcription factor/DNA motif recognition challenge. The dataset consisted of 86 paired PBM samples for a total of 82 murine transcription factors, each hybridized onto two different PBM platforms (HK and ME). Transcription factors
The goal of the challenge was to predict probe intensities on one array based on intensities measured on the other array. We applied our binding model to the problem by first learning the TF specific affinity vectors
In some samples, a relatively large number of probes were found to be saturated at high intensities (
Each figure shows a probe log-intensity histogram where saturated probes are highlighted in red color.
The y-axis represents predicted probe intensities, while the x-axis represents true probe intensities on the reference array. The scatter plots clearly indicate the negative effect that reference sample saturation has on assessing the accuracy of model predictions.
Using preprocessing and quantile normalization for the training samples only, our model was capable of predicting probe intensities on the target array with average Pearson and Spearman correlations of 0.624 and 0.624, across the 86 paired PBM samples. This placed our method as the best performer in the DREAM5 challenge final ranking.
Due to space constraints, results are only shown for the first 20 TFs. Results for all TFs are provided in supplementary
After the DREAM5 challenge we also tested the effect of applying preprocessing and quantile normalization to both the training samples and reference samples. The two groups of samples were normalized separately. The result was that the average Pearson and Spearman correlations increased to 0.670 and 0.670, respectively. Although we suspect that these reference-corrected correlations are probably more indicative of the model's true predictive power, we will hereafter only discuss correlations against uncorrected reference samples, analogous to the original DREAM5 performance evaluations.
Berger et al. report a Spearman correlation of 0.53 for their 8-mer E-scores for a single TF across two different PBM array designs. For two technical replicates from a single PBM array they observe a Spearman correlation of 0.91
In their paper, Chen et al. showed that the 8-mer based HMIK predictor performed better than the PWM motif models at predicting binding affinities. The PWMs in these comparisons were derived using a number of PWM discovery algorithms, including MatrixREDUCE, MDScan, PREGO, Seed and Wobble and RankMotif++
We also assessed the effect of the preprocessing steps on prediction accuracy: averaged correlations across all 86 PBM samples are shown in
Original signal | + | Low filtering | + | Spatial detrending | + | Quantile normalization | |
|
0.603 | 0.603 | 0.607 | 0.624 | |||
|
0.618 | 0.620 | 0.623 | 0.624 |
The accuracy of our model depends on the maximum length of the K-mers included in the design matrix. Although the additive model allows reasonably good predictions to be made using K-mers as short as 4 bases, the accuracy does consistently improve as the K-mer length approaches 8 bases. No significant improvement is seen from including K-mers longer than 8 bases (
To prevent the linear system from becoming underdetermined, the K-mers in this figure were regularized so that for 7–9-mers, only the 1000 most informative K-mers were included.
Since our model associates each K-mer with a TF specific binding affinity, we can better understand the binding specificity of a TF by studying its top affinity K-mers. These highest affinity K-mers can then be contrasted with K-mers selected according to median probe intensity. We performed this comparison, and noted that the top median intensity K-mer lists mostly contained 8-mers and hardly any shorter K-mers. We also observed a disproportionally high number of 8-mers containing guanine or cytosine repeats among the top median intensity K-mers. In contrast, among the top affinity K-mers we saw many short K-mers, and less enrichment for the G/C repeats. The top affinity K-mers were also in excellent agreement with the TF binding motifs found in JASPAR Core, even for gapped motifs (
Shown at the top of the figure are JASPAR Core sequence logos for four TFs. Visible below the sequence logos are the top five highest affinity K-mers from the linear model, for all four TFs. An arrow and the characters “RC” indicate reverse complement K-mers. All sequence logos are for Mus musculus, and were downloaded from JASPAR.
We further found that the 4-mer affinities learned by the model were significantly correlated across unrelated TFs, with an average Pearson correlation of 0.56. Correlations between technical replicates were even higher, typically in the neighborhood of 0.90. This result implies that even though 4-mer affinities do show variation between TFs, they also have a shared background that may reflect either a systematic artifact in PBM measurements or a common theme in TF binding.
We tested the probe noise model by fitting an affine curve to the
To better handle proteins with a binding specificity for gapped sequences, we experimented with incorporating gapped K-mers into our linear model. We extended our model with all 8-mers with a single nucleotide gap in the middle, and then regularized the gapped K-mers using the same median intensity approach we used for the contiguous K-mers. We found that the inclusion of 500 gapped 8-mers to the model did improve prediction results in a statistically significant manner (p =
Next we studied the effect of strand specificity on our model by constraining all reverse complement K-mers to have equal binding affinity contributions. We found that the loss of strand specificity induced by the constraint had a systematic negative effect on prediction accuracies. The average Pearson correlation across all 86 PBM samples dropped significantly from 0.62 to 0.56 (p<
The bonus round of the DREAM5 challenge involved identifying the unnamed transcription factors hybridized to the test PBM arrays. To achieve this, we ran the motif discovery tool MEME and compared the discovered motifs to known mammalian TF motifs in TRANSFAC and JASPAR. However, motif databases contain only a few motifs for each TF family and thus the exact TF name cannot be reliably identified. Thus, if the predicted TF names according to Tomtom were the same for several TFs, we used literature to distinguish the TFs. For example, TFs #13 and #51 in the DREAM5 dataset were both predicted to belong to the POU family of transcription factors. However,
At the top of the figure are shown the MEME-predicted sequence logos for Pou1f1 and Pou2f1. Below are shown the binding site consensus sequences from literature
We have presented a linear model for uncovering TF binding specificity based on PBM measurements. While we only tested our model using data and metrics from the DREAM5 transcription factor/DNA motif recognition challenge, our model can also be applied in less artificial contexts. One obvious application is to use our model for predicting genomic binding sites and their associated TF affinities. This can be achieved by first generating the K-mer affinity vector
Several extensions to our model can be considered. One idea is to enhance our algorithm's normalization step by learning the consensus distribution based on unsaturated samples only, and to then fit only saturated or otherwise aberrated samples to the consensus distribution. A shape-preserving normalization technique would then be applied to the unsaturated samples to bring the rest of the samples to the same intensity scale. Another idea is to learn motif models from ChIP-seq data by placing sequence windows on top of ChIP-seq peaks and using the sequences within those windows to build the design matrix and learn the affinity vector
Another potential extension to our model lies in defining a distance metric for evaluating the similarity of two motifs. Indeed, many metrics have been proposed in the literature for measuring the similarity of PWM motifs: one such example is the Tomtom algorithm by Gupta et al.
One issue with the current model is that the design matrix columns are not independent; for instance, the column for a 4-mer is often a linear combination of four 5-mer columns. This means that the K-mer affinity solutions are not unique. We tried to avoid this issue by first learning a 4-mer model, then learning a 5-mer model based on the residual, then a 6-mer model on the new residual and so forth, but found that our original model (with 10 conjugate gradient method iterations) systematically produced better results (p =
One downside of our linear model is the lack of a powerful visual interpretation for the motif model. Due to their mononucleotide-based nature, PWM models can be visualized as graphical sequence logos that are easily interpreted by humans. The same cannot be said of K-mer based models, where the motif can only be described as a set of K-mers toward which a protein has a high binding affinity. The interpretation is made particularly difficult due to the lack of positional information for the motif's constituent K-mers. One way for visualizing K-mer based models would be to convert the model to one or more PWM motif models that would attempt to encode the same specificities as the K-mer model.
One interesting problem in motif modeling is the handling of proteins with gapped motifs, i.e. proteins whose DNA binding motifs contain positions where the nucleotide content does not matter. PWM models can handle such gaps by giving low weights to all mononucleotides inside the gaps. Our proposed K-mer model does not weigh individual nucleotides within K-mers, and hence does not model gapped motifs in this way. Instead, gapped motifs are modeled as a sum of the affinity contributions of the K-mers found before and after the gap. A potential advantage of modeling gapped motifs in this manner is that the length of the gap is not rigidly constrained. This allows the model to accommodate proteins that bind with motifs of variable gap size. On the other hand, this makes our model less powerful at handling proteins for which the gap size is rigidly constrained. It is also important to ensure that the sequences for which the design matrices are built are not too long, so that the K-mer constituents of a gapped motif are constrained to be reasonably close to one another in the sequence.
In conclusion, our linear K-mer based motif model represents a departure from traditional PWM based motif models, and was the best performing method in the DREAM5 transcription factor/DNA motif recognition challenge. Based on our own measurements, the model exhibits significantly higher performance than the full 8-mer model described by Chen et al., while producing more compact motif model representations. This suggests that K-mer based motif models may provide a practical and powerful alternative to mononucleotide models.
Examples of PBM samples with spatial artifacts. Red pixels indicate missing intensity values.
(TIF)
Effect of the regularized K-mer count on prediction accuracy.
(TIF)
Probe noise modeling. The figure shows a scatter plot of the relationship between average probe intensities and sample standard deviations, across three Zscan10 PBM replicate samples. Also shown is the least squares linear fit to the data.
(TIF)
Prediction accuracies for all 86 sample pairs (HK→ME). Legend: This table contains the full prediction accuracy assessments for all 86 PBM sample pairs and 7 different prediction models. Included among the 7 models are the highest median intensity K-mer (HMIK) predictor, and 6 versions of the linear prediction model with different preprocessing steps.
(XLS)
Prediction accuracies for all 86 sample pairs (ME→HK). Legend: This table contains the full prediction accuracy assessments for all 86 PBM sample pairs and 7 different prediction models. Included among the 7 models are the highest median intensity K-mer (HMIK) predictor, and 6 versions of the linear prediction model with different preprocessing steps.
(XLS)
Top 20 highest affinity K-mers for all 86 HK array samples. Legend: This table lists the top 20 highest affinity K-mers for each of the 86 HK array samples. The K-mers were learned using the linear model with low intensity probe filtering, spatial detrending and quantile normalization enabled.
(XLS)
Top 20 highest median intensity K-mers for all 86 HK array samples. Legend: This table lists the top 20 highest median intensity K-mers for each of the 86 HK array samples. By the median intensity of a K-mer we mean the median intensity across all probes that contained the K-mer. This table is provided for the purposes of comparing with
(XLS)
Comparison between strand specific and non-specific models. Legend: This table lists the prediction accuracies for both the strand specific and non-specific models, for all 86 paired PBM samples. The predictions were made in the HK→ME direction.
(XLS)