Single-Nucleotide Mutation Matrix: A New Model for Predicting the NF-κB DNA Binding Sites

In this study, we established a single nucleotide mutation matrix (SNMM) model based on the relative binding affinities of NF-κB p50 homodimer to a wild-type binding site (GGGACTTTCC) and its all single-nucleotide mutants detected with the double-stranded DNA microarray. We evaluated this model by scoring different groups of 10-bp DNA sequences with this model and analyzing the correlations between the scores and the relative binding affinities detected with three wet experiments, including the electrophoresis mobility shift assay (EMSA), the protein-binding microarray (PBM) and the systematic evolution of ligands by exponential enrichment-sequencing (SELEX-Seq). The results revealed that the SNMM scores were strongly correlated with the detected binding affinities. We also scored the DNA sequences with other three models, including the principal coordinate (PC) model, the position weight matrix scoring algorithm (PWMSA) model and the Match model, and analyzed the correlations between the scores and the detected binding affinities. In comparison with these models, the SNMM model achieved reliable results. We finally determined 0.747 as the optimal threshold for predicting the NF-κB DNA-binding sites with the SNMM model. The SNMM model thus provides a new alternative model for scoring the relative binding affinities of NF-κB to the 10-bp DNA sequences and predicting the NF-κB DNA-binding sites.


Introduction
NF-kB is a transcription factor (TF) identified in lymphocyte and then found to regulate the transcriptions of a large number of genes in most of cell types [1]. The NF-kB/Rel family consists of five members, including RelA/p65, c-Rel, RelB, NF-kB1/p50 and NF-kB2/p52, which can function as heterodimers or homodimers in the regulation of gene transcription [2]. The heterodimer formed by p50 and p65 is the most common functional NF-kB dimer in the mammalian cells, which regulates many important biological processes, such as immunity and inflammation [3,4]. In this dimer, both subunits contact DNA, but only p65 contains a transactivation domain [5]. Like this dimer, other dimers formed by different members of NF-kB family can also bind the DNAbinding sites (DBSs) in the genome. Therefore, NF-kB is a wellknown sequence-specific DNA-binding TF.
The consensus of the NF-kB DBSs was first identified as GGGRNNTYCC(R: G, A; Y: C, T; N: G, A, T, C) [6]. Subsequently, based on several in vitro assays, including electrophoresis mobility shift assay (EMSA), protein-binding microarray (PBM) and systematic evolution of ligands by exponential enrichment-sequencing (SELEX-Seq), this consensus was gradually expanded into GGRRNNYYCC [7][8][9] and RGGRNNHHYY (H: A, T, C) [10]. Furthermore, the analysis of the in vivo binding locations identified with the chromatin immunoprecipitation-sequencing (ChIP-Seq) revealed that NF-kB bound many of its target genes via these expanded consensuses [10,11]. These data suggest that the DBSs of this TF have high variability. Therefore, it is still a great challenge to identify its all DBSs in the whole genome with wet experiments. In this case, the bioinformatic models for predicting DBSs would be helpful. The predicted putative DBSs provide more confident targets to the wet experiments, which can facilitate the identification of the functional DBSs of this TF in various cells [10].
In recent years, many bioinformatic models have been developed for predicting the putative DBSs of various TFs, such as position weight matrix scoring algorithm (PWMSA) [12], Match [13], TFSEARCH (http://www.rwcp.or.jp/papia), Mapper [14], and Matinspector [15]. These models can be used to predict the relative binding affinities of NF-kB to various DNA sequences by using a position weight matrix (PWM) that represents the DNAbinding motif of this TF, such as PWM M00051 NF-kB p50 in the TRANSFAC database. However, these models were not developed as the NF-kB-exclusive tool. Fortunately, one NF-kB-specific model had been developed, that is the Principal Coordinate (PC) model [8]. This model was constructed by training on the data of the relative binding affinities of NF-kB p50 homodimer to 52 variant DNA sequences representing the consensus of GGRRNNYYCC [8]. Unfortunately, this model can only be used to predict the binding affinity of NF-kB p50 to 256 DNA sequences belonging to this consensus. This limitation prevents it from more wide application in the identification of all variant potential NF-kB DBSs in the mammalian genomes. Therefore, more compatible NF-kB-specific models are still in need for this most intensively studied TF.
We have ever measured the binding affinities of NF-kB p50 homodimer to a wild-type binding site (GGGACTTTCC) and its all single-nucleotide mutants with an unimolecular doublestranded DNA (dsDNA) microarray for finding the relative importance of each position to the interaction of NF-kB with its cognate DNA sites [16]. In this paper, we confirmed these binding affinities with a newly fabricated bimolecular dsDNA microarray and constructed a single nucleotide mutation matrix (SNMM) according to the relative binding affinities measured with this wet experiment. We then applied this model to the prediction of NF-kB p50 homodimer to various DNA sequences in order to evaluate its reliability. We also compared this model with other three models, including PC, PWMSA and Match.

Measurement of the DNA binding affinity with DNA microarray
Pairs of complementary oligonucleotides were chemically synthesized and one was modified with amino group at the 59 end. The paired complementary oligonucleotides were annealed in the TEN buffer (10 mM Tris-HCl, pH8.0, 100 mM NaCl, 1 mM EDTA) by incubating at 95uC for 5 min and then cooling to the room temperature gradually. The annealed double-stranded oligonucleotides were spotted (AD1500, BioDot) on the glass slides with aldehyde group (CapitalBio) to form the bimolecular dsDNA microarray. The bimolecular dsDNA microarray was incubated with the Cy3-labeled NF-kB p50 protein (Promega) by using a protocol as previously described [16]. The protein-bound dsDNA microarray was scanned with a scanner (LuxScan 10K, CapitalBio) and the fluorescent signal intensity was quantified with Image J.

Calculation of the correlation coefficient and p value
The Pearson correlation coefficient (Pearson's r) has been widely used as a measure of the strength of linear dependence between two variables [17,18]. Pearson's r ranges from -1 to +1. The negative and positive r values mean negative and positive correlation between two variables, respectively, and the zero r value means no correlation. The absolute r value between 0 and 0.2, 0.2 and 0.4, 0.4 and 0.6, 0.6 and 0.8, and 0.8 and 1.0 mean very weak, weak, medium, strong and very strong correlations, respectively. To demonstrate the statistical significance of the correlation coefficient, the confidence interval (p value) of the Pearson's r was calculated. The p values less than 0.05 and 0.01 mean that Pearson's r is significant and extremely significant, respectively. In this study, we calculated the Pearson's r and p value with MATLAB7.0 software.

Evaluation of the SNMM model with the experimental data
For evaluating the SNMM model, we used three resources of NF-kB binding affinity data. The NF-kB binding affinity data measured by EMSA were collected from the previous study performed by Udalova et al [8], which detected the binding affinity of NF-kB p50 homodimer to 52 DNA sequences with the radioactive EMSA and constructed a PC model for predicting the binding affinity of NF-kB to the DNA sequences belonging to the consensus of GGRRNNYYCC. The NF-kB binding affinity data measured by PBM were collected from the previous study performed by Siggers et al [19], which detected the binding of NF-kB p50 to various DNA sequences with PBM. The NF-kB binding affinity data measured by SELEX-Seq were collected from the study performed by our lab [20], and the data was deposited into the Gene Expression Omnibus (GEO) database under the accession number of GSE:48660.
The PWMSA model was developed by Stormo et al [12]. The Match model was developed by Kel et al. [13]. When predicting the binding affinities of a transcription factor to DNA sequences by using these two models, a known PWM of the interested transcription factor is needed. To predict the binding affinities of NF-kB to DNA sequences with these two models, we employed a known PWM of NF-kB p50 constructed with 18 SELEX-selected DNA sequences by Kunsch et al [21], which was collected in transcription factor database TRANSFAC (accession number: M00051) ( Table S1 in File S1). The TRANSFAC database provides an on-line Match prediction program. When predicting the binding affinity with the Match model, we input the DNA sequences to this on-line program. When predicting the binding affinity with the PWMSA model, we used a Perl script written by ourselves according to the formulas described in File S1.

Determination of the optimal SNMM threshold for predicting the NF-kB DBSs
The optimal threshold was determined as previously described [13,22]. To determine the optimal threshold, two groups of sequences, S1 and S2, were first selected. S1 consists of N sequences that are the NF-kB DBSs. S2 consists of M sequences that are not the NF-kB DBSs. The S1 and S2 sequences were predicted with the SNMM model. The threshold was taken from 0 to 1 at an interval of 0.001. At the threshold of x, if n S1 sequences were not predicted as the NF-kB DBSs, the false negative ratio (rFN) were [N-n]/N. Similarly, at the threshold of x, if m S2 sequences were predicted as the NF-kB DBSs, the false positive ratio (rFP) was m/M. The threshold that resulted in the lowest rFN and rFP was regarded as the optimal threshold.

Measurement of the DNA binding affinity with EMSA
This study also detected the binding affinity of NF-kB to several DNA sequences with EMSA. Fifty five nanograms of the NF-kB p50 protein (Promega) was incubated with 1 pmol of the biotinlabeled dsDNA in a 10-mL protein binding reaction at room temperature for 1 h. The reaction was detected with an infrared fluorescence EMSA (NIRF-EMSA) exactly as previously described [23].

Measurement of DNA-binding affinity with DNA microarray
The relative binding affinities of the NF-kB p50 homodimer to a wild-type binding site (GGGACTTTCC) [1,24] and its all possible single nucleotide mutants were detected with the bimolecular dsDNA microarray. A representative fluorescent image of the dsDNA microarray detections is shown in Figure 1. The signal intensities of fluorescence images were quantified with the Image J software and the results were shown in Table S2 in File S1. These relative binding affinity data were in agreement with those we previously measured with the unimolecular dsDNA microarrays [16]. The bimolecular dsDNA microarray differs from the unimolecular dsDNA microarray at the DNA probe structure as previously described [25,26].

Construction of the SNMM model
The single-nucleotide mutant matrix (SNMM) was constructed according to the method previously described by Veprintsev et al. [27]. All binding affinities were normalized by subtracting the signal intensity values (Table S2 in File S1) with that of the reference sequence (GGGACTTTCC). As a result, a 4610 matrix was obtained ( Table 1).
The SNMM score of a particular sequence was calculated according to the following equation: is the normalized binding affinity value of the reference sequence (0); S (i, n) is the score of base i at the position n in SNMM of a sequence; gDS (i, n) is the sum of scores of bases at each position in SNMM of a sequence; S min is the sum of the lowest scores at each position in SNMM; S max is the sum of the highest scores at each position in SNMM.

Evaluation of the SNMM model with the experimental data
Evaluating the SNMM model with the EMSA-measured data. Udalova et al. measured the relative binding affinity of the Table 1. The single-nucleotide mutant matrix (SNMM).   Table 2. We also collected the PC scores of these sequences from the published data [8] and included in Table 2. With these data, we analyzed the correlation between the EMSA values and the scores obtained by various models. The results are shown in Figure 2A.  [19]. The binding affinity was presented as z score. The high z score means the high binding affinity. In comparison with the PBM-detected sequences, only 41 EMSA-detected sequences ( Table 2) had the z scores (PBM in Table 2). We first analyzed the correlation between the PBM z scores and the EMSA values. The results reveal that the EMSA values are very strongly correlated with the PBM z scores (Pearson's r: 0.884) ( Figure 2B), indicating that two wet experiments obtained the coincident results. We then analyzed the correlation between the PBM z scores and the scores obtained with four models. The results reveal that the SNMM scores are strongly correlated with the PBM z scores (Pearson's r: 0.624) ( Figure 2B). The scores obtained with other models are also strongly correlated with the PBM z scores ( Figure 2B). Therefore, the SNMM scores are strongly correlated with both the EMSA-and PBM-measured data.
Evaluating the SNMM model with the SELEX-Seqmeasured data. Recently, we detected the relative binding affinities of the NF-kB p50 homodimer to all 10-mer DNA sequences with SELEX-Seq [20]. We obtained the SELEX-Seq fold enrichment value (name SELEX-Seq value hereafter) of the selected sequences after a four-round selection. The SELEX-Seq value reflects the relative binding affinity of the NF-kB p50 homodimer to a sequence. After a four-round selection, we obtained 7,282,890 10-mer reads that contained 242,957 variant 10-mer sequences [20].
To evaluate the SNMM model with the SELEX-Seq values, we scored all these 10-mer sequences with the SNMM model and found that there were 3660 sequences with the SNMM score $ 0.747 (a threshold for predicting NF-kB DBSs with SNMM, see below) (Table S3 in File S1). We then scored these sequences with the PWMSA and Match models (Table S3 in File S1) and analyzed the correlations between the SELEX-Seq values with the SNMM, PWMSA and Match scores, respectively. The results show that there are medium correlations between the SELEX-Seq values and the scores of three models; however, the SNMM scores are better correlated with the SELEX-Seq values than the PWMSA and Match scores ( Figure 2C).
Among the 3660 sequences, we found that 114 sequences had the PC scores (Table S4 in File S1). We analyzed the correlation between the SELEX-Seq values of these 114 sequences and their SNMM, PWMSA, Match and PC scores ( Figure 2C), respectively. The results reveal that the PC scores are very strongly correlated with the SELEX-Seq values. The SNMM and Match scores are  Among the 114 sequences, we also found that 25 sequences had the EMSA values (Table S5 in File S1). We analyzed the correlations between the SELEX-Seq values of these sequences and their EMSA values and SNMM, PWMSA, Match and PC scores ( Figure 2C), respectively. The results demonstrate that the SELEX-Seq values are highly correlated with the EMSA values, suggesting that two wet experiments obtained similarly reliable results. The Match and PC scores are very strongly correlated with the SELEX-Seq values. The SNMM and PWMSA scores are strongly correlated with the SELEX-Seq values.
Confirming the SNMM predictions with NIRF-EMSA. To further evaluate the SNMM model, we selected four DNA sequences with various SELEX-Seq values and detected them with NIRF-EMSA (N-EMSA). The results are shown in Figure 3. The signal intensity of the shifted bands resulted from the DNA/p50 complex were quantified and normalized to the highest signal intensity (Figure 3). We then scored these sequences with the SNMM, PWMSA and Match models, respectively. We also collected the PC scores and the EMSA values of these sequences because they had been previously scored by the PC model and detected by the radioactive EMSA (R-EMSA) (Figure 4) [8]. Finally, we analyzed the correlations between the N-and R-EMSA values and the SNMM, PWMSA, Match and PC scores, respectively. The results reveal that the SELEX-Seq values are very strongly correlated with both the N-EMSA and R-EMSA values, demonstrating that the SELEX-Seq value can reflect the relative binding affinity of NF-kB to a sequence. The results also demonstrate that the SNMM scores are very strongly correlated with both the N-EMSA and R-EMSA values, suggesting that the SNMM score can be used as an indicator of the relative binding affinity of NF-kB to a 10-mer DNA sequence.
Determining the threshold of SNMM for predicting the NF-kB DBSs. To predict the NF-kB DBSs with the SNMM model, an optimal threshold should be determined. For this end, we first set up two groups of sequences. From 52 EMSA-detected DNA sequences (Table 2), we selected 30 sequences with the highest EMSA values as S1 group (Table S6 in File S1). These sequences were regarded as the DBSs of NF-kB because most of them (93.3%) had the Match score over 0.75 that is the common threshold of this model. We also selected 30 sequences from random 10-mer sequences that had the Match score of zero as S2 group (Table S6 in File S1). These sequences were regarded as the non-DBSs of NF-kB. By using the methods described in Methods and Materials, we identified 0.747 as the optimal threshold of SNMM for predicting the NF-kB DBSs.

Discussion
In the evaluations of the SNMM model with the experimental data ( Figure 2), it can be seen that the PC scores are more strongly correlated with the experimental data than the scores of other models. The reason lies in two aspects. First, the PC model was trained on the binding affinity data of the NF-kB p50 homodimer to 52 variant DNA sequences representing the consensus of GGRRNNYYCC [8]. When analyzing the correlations between the PC scores and the experimental data, we used the sequences with the PC scores, including 52, 41 and 114 sequences detected by EMSA, PBM and SELEX-Seq, respectively. Of course, these sequences belong to the consensus of GGRRNNYYCC. Second, the PC model was constructed as a mathematical model based on algorithm [8], which takes account of the correlation between nucleotides at different positions in its construction. However, PWMSA and Match are the PWM-based models. The PWMbased models are based on the additivity assumption, which assumes that the nucleotides in a transcription factor binding site are independent. The SNMM model also is a matrix-based model,  which takes no account of the correlation between the nucleotides at the different positions in its construction. Therefore, in the evaluations of the SNMM model with the sequences belonging to GGRRNNYYCC, the matrix-based models are not as good as the PC model. However, the PC model can only be used to score 256 sequences belonging to GGRRNNYYCC, but the SNMM model is enabled to score any sequence. For example, we scored 242,957 variant SELEX-Seq-selected 10-mer sequences in this study and found 3660 sequences with the SNMM score $0.747.
In this study, the PWMSA and Match scoring used a PWM that was constructed with 18 SELEX-selected sequences [21] (Table S1 in File S1). Like 52 sequences ( Table 1) that were used to train the PC model, these sequences also belong to the consensus of GGRRNNYYCC, which is identified as the canonical NF-kB DNA-binding motif [7][8][9]. However, most sequences used to construct the SNMM model do not belong to this consensus, such as AGGACTTTCC and GGGACTTTCA. Therefore, in the evaluations of the SNMM models with 52 EMSA-detected and 25 SELEX-Seq detected sequences belonging to the consensus of GGRRNNYYCC, the PWMSA and Match scores are more strongly correlated with the experimental data than the SNMM scores. However, in the evaluations of the SNMM models with 41 PBM-detected and 114 SELEX-Seq-detected sequences belonging to the consensus of GGRRNNYYCC, the SNMM scores showed the similar correlations with the experimental data as the PWMSA and Match scores. Furthermore, in the evaluations of the SNMM model with 3660 SELEX-Seq detected sequences, the SNMM scores are more strongly correlated with the experimental data than the PWMSA and Match scores. It should be noted that most of these 3660 sequences are not the canonical NF-kB DBSs belonging to the consensus of GGRRNNYYCC. These results reveal that the SNMM model is not only qualified to identify the canonical NF-kB DBSs as other models, but also more qualified to identify the noncanonical NF-kB DBSs than other models.
In this study, when analyzing the correlations of the SNMM, PWMSA and Match scores with the SELEX-Seq values of the 3660 sequences, we found that all correlations did not reach the strong level ( Figure 2C). This results from the complex constitution of the SELEX-Seq-selected sequences. SELEX-Seq is an unbiased in vitro selection technique that can be used to find any DNA binders of a transcription factor [28][29][30]. For example, Wong et al. studied the DNA-binding specificity of NF-kB with the technique and found that NF-kB could bind both canonical and noncanonical sequences [10]. However, the PWM used in the PWMSA and Match models was established only with the canonical sequences. Although the construct of the SNMM model used some noncanonical sequences, the reference sequence and most its mutated sequences used in SNMM also belong to the canonical sequences. This suggests that it is important to take the noncanonical sequences into account in establishing more accurate models for predicting the NF-kB DBSs. The binding affinity data obtained with the unbiased in vitro detection techniques such as SELEX-Seq [10] and universal PBM [19] would be important to this end.
This study demonstrated that SNMM provides a simple model for predicting the NF-kB DBSs. It is worthy to mention that several new studies also revealed that simple models based on mononucleotide PWM are effective in evaluating the DNAbinding specificities of TFs. For example, Jolma et al. systematically analyzed specificities of 303 human DNA-binding domains (DBDs), 84 mouse DBDs, and 151 human full-length TFs that represent 411 different TFs using a high-throughput SELEX (HT-SELEX), they found that the vast majority of interactions that occur between a TF and the individual DNA bases are independent of each other [31]. Weirauch et al. systematically evaluated 26 methods for modeling TF sequence specificity using the in vitro PBM data of 66 mouse TFs belonging to various families, the results indicated that the simple models based on mononucleotide PWM trained by the best methods performed similarly to more complex models for most of TFs examined [32]. Therefore, the approach that was used to establish the NF-kBspecific SNMM model in this study may be used to construct the similar SNMM models for other TFs.
In conclusion, we constructed a new simple model for predicting the NF-kB DBSs and verified its effectiveness with various resources of experimental data in this study.

Supporting Information
File S1 Supporting methods and Tables. Supporting methods. Construction of PWMSA. Table S1. The TRANSFAC NF-kB MA00051 PWM. Table S2. The binding affinities of NF-kB p50 homodimer to the sequence GGGACTTTCC and its all singlenucleotide mutants detected with the dsDNA microarray.

Author Contributions
Conceived and designed the experiments: JG JKW. Performed the experiments: WXD JG. Analyzed the data: WXD TTW. Contributed reagents/materials/analysis tools: WXD TTW JG. Wrote the paper: WXD JKW.