Prediction and Experimental Validation of Novel STAT3 Target Genes in Human Cancer Cells

The comprehensive identification of functional transcription factor binding sites (TFBSs) is an important step in understanding complex transcriptional regulatory networks. This study presents a motif-based comparative approach, STAT-Finder, for identifying functional DNA binding sites of STAT3 transcription factor. STAT-Finder combines STAT-Scanner, which was designed to predict functional STAT TFBSs with improved sensitivity, and a motif-based alignment to minimize false positive prediction rates. Using two reference sets containing promoter sequences of known STAT3 target genes, STAT-Finder identified functional STAT3 TFBSs with enhanced prediction efficiency and sensitivity relative to other conventional TFBS prediction tools. In addition, STAT-Finder identified novel STAT3 target genes among a group of genes that are over-expressed in human cancer cells. The binding of STAT3 to the predicted TFBSs was also experimentally confirmed through chromatin immunoprecipitation. Our proposed method provides a systematic approach to the prediction of functional TFBSs that can be applied to other TFs.


Introduction
The ability of any biological system to properly respond to stimuli heavily depends on biochemical cascades of signaling pathways that culminate in the activation of transcription factors (TFs) and the subsequent alteration of gene expression patterns [1]. Information about which genes need to be expressed in a specific cell type at any given time is believed to be encoded in the genome. The molecular machinery used to interpret such genetic information has evolved to ensure the accuracy and specificity of gene regulation. Transcription is a multi-step process requiring the concerted action of many proteins. Transcriptional activators and repressors bind in a sequence-specific manner to promoters or enhancers of target genes. They govern the recruitment of transactivators, chromatin modifiers, and general transcription factors, including RNA polymerase II, to regulate gene expression [2,3].
Whole genome approaches to measure genome-wide expression patterns have divulged groups of genes that are co-regulated to exert spatially and temporally controlled cellular responses [4]. Identifying the responsible regulatory modules that govern the coordinated actions of combinatorial transcription factors is crucial for understanding the regulatory circuits of biological processes [5]. For this purpose, computational tools have been developed to aid in the identification of transcription factor binding sites (TFBSs) in the promoters of the co-regulated genes [6,7,8]. These computational approaches can be divided into two classes: (1) pattern detection and (2) pattern matching. Pattern detection, also known as de novo motif discovery, finds putative binding sites for unknown TFs that are over-represented in the promoters of co-regulated genes. If the binding specificity of a TF is already known, pattern matching methods are preferred [9]. In the pattern matching approach, DNA sequence information of TFBSs is expressed as a position weight matrix (PWM), which can be used to score potential regulatory sites within a statistical framework [10]. However, because DNA binding sites for TFs are generally short and degenerate, this method is prone to high false positive prediction rates [11].
Based on the observation that conserved non-coding DNA sequences are often important for the regulation of biological functions, cross-species sequence comparisons have been actively integrated to distinguish functional and non-functional TFBSs [12,13,14]. The act of incorporating the evolutionarily conserved sequence information in the regulatory regions filters out the nonconserved TFBSs, thereby greatly reduce the false positive prediction rate [15,16,17,18,19]. Although this approach has been successfully applied to increase the predictive power of motif finding, it is highly sensitive to the algorithm used for sequence alignment and the accuracy of annotated transcriptional start site (TSS) information. Therefore, it has been reported that sequence-based promoter alignments often fail to detect short or degenerate regulatory elements, when evolutionary divergent promoter sequences are aligned [12,17]. To overcome these limitations, an alignment-free algorithm based on network-level conservation has also been suggested [20].
Signal transducer and activator of transcription 3 (STAT3) belongs to the STAT family of transcription factors, which is activated by Interleukin-6 (IL-6) and related cytokines, such as IL-10, Oncostatin M (OSM), and leukemia inhibitory factor (LIF) [21]. Thus far, seven mammalian STATs (1, 2, 3, 4, 5a, 5b, and 6) have been identified. They all possess a DNA binding domain, an SH2 domain for dimerization, and a C-terminal trans-activation domain [22]. Upon stimulation with extracellular ligand, activated STAT3 forms homodimers or heterodimers with another STAT family member, STAT1, then translocates into the nucleus and binds to cognate regulatory elements in the promoters of STATresponsive genes. Accumulating evidences suggest that STAT3 also associates with other transcription factors to form enhanceosome complexes in the promoter regions of target genes and controls cooperative gene induction [23,24,25]. STAT3 is involved in diverse cellular responses, including cellular differentiation, survival, stem cell renewal, wound healing and systemic inflammation; this has been proven by the phenotypes of genetically modified STAT3 mutant mice [22,26,27,28,29]. It has been found that STAT3 participates in carcinogenesis, and that the ectopic expression of a constitutively active form of STAT3 (STAT3-C) induces tumor formation in nude mice [30]. Furthermore, the expression of constitutively-active STAT3 has been observed in various types of human cancer including multiple myeloma, colon, ovary, liver, lung, head, and neck cancers [31]. While the regulatory and general trans-activation mechanisms of STAT3 have been thoroughly studied, not too much effort has been made towards the identification of direct target genes of STAT3. The identification of those target genes is crucial for mediating the diverse biological effects of STAT3 signaling.
To characterize STAT3-mediated transcriptional programs, we have developed a computational framework designed to predict STAT3 TFBSs with improved sensitivity and low false positive rate. Through the integration of the microarray data obtained from the STAT3 activation condition and the TFBS prediction tools, we attempted to identify novel STAT3 target genes. Using our STAT-Finder program, we identified eight novel STAT3 target genes among a group of genes that are highly expressed in cancer cells. These were then confirmed through chromatin immunoprecipitation.

Overview of STAT-Finder
To identify direct STAT3 target genes, we developed a computational framework that predicts functional TFBSs of STAT3 with increased sensitivity and low false positive rate. Our framework, STAT-Finder, was constructed based on two computational components, a TFBS scanning program (STAT-Scanner) and a motif-based alignment program ( Figure 1). STAT-Scanner was designed to increase the sensitivity for detecting functional STAT3 TFBSs. A currently available STAT3-specific PWM of TRANSFAC database [32], V$STAT3_01, frequently Figure 1. An overview of STAT-Finder. STAT-Finder has two components: The first module, STAT-Scanner, takes a set of six orthologous mammalian promoter sequences as input. Each promoter sequence is searched to mark putative TFBSs using the modified 8 STAT-related PWMs. Binding affinity scores of predicted TFBSs are calculated based on the P-values, and a sequence of affinity scores is generated for each promoter. The second module progressively aligns the score sequences and calculates posterior probability to evaluate the degree of motif conservation. doi:10.1371/journal.pone.0006911.g001 fails to detect experimentally proven STAT3 binding sites (data not shown). For improved predictive power, STAT-Scanner was therefore designed to use combined PWMs of binding specificity similar to STAT3. Although STAT family members have different physiologic functions and regulate distinct sets of target genes, the targets of individual STAT proteins sometimes overlap, and DNA sequences recognized by STAT family members are similar [21,22,23].
For unbiased identification of the PWMs that share sequence similarity with the STAT3-specific PWM, V$STAT3_01, a total of 565 PWMs derived from vertebrate TRANSFAC database [32] were clustered based on their motif similarity ( Figure S1). The motif similarity was defined as the P-value of the gapped alignment between the two PWMs based on the Kullback-Leibler divergence [33] (See Methods). Total numbers of PWM clusters increased with stringent P-value cut-off, reaching maximum cluster numbers of around 10 216 P-value ( Figure S1A). With the P-value cut-off of 10 27 , PWMs assigned for the STAT family members were found in the same cluster. It is noteworthy that PWM clustering did not reveal any non-STAT PWMs that were similar enough to include nor were there any STAT PWMs that were distinctly different ( Figure S1B). We chose among them eight PWMs from the STAT family members with high PWM quality scores (.0.6), where each quality score was calculated using the method proposed by Rahmann et al. [34]. The relevance of the selected PWMs for detecting known STAT3 TFBS has been evaluated in the previously identified STAT3 target genes [35] ( Figure S2).
To minimize false positive predictions, results from STAT-Scanner were then analyzed using the comparative motif-based alignment tool ( Figure 1). This method finds conserved binding sites within the orthologous promoters of six mammalian species by comparing multiple sequences. Within a probabilistic framework, STAT-Finder then evaluates the posterior probabilities of TFBSs as predicted by STAT-Scanner by assigning higher prior probabilities on conserved sites over non-conserved ones.

Validation of STAT-Scanner
We first compared the performance of STAT-Scanner with the most practical TFBS prediction tools, MATCH 2.7 [36] and MotifLocator [37]. For this purpose, we collected positive genes with experimentally proven STAT3 binding sites in their promoter regions through literature mining and TRED search (http://rulai. cshl.edu/ TRED) [38]. Resulting information on the 22 reference sequences are listed in Table S1. Genomic DNA sequences spanning from 2,000 bp up-stream to 500 bp down-stream of the annotated TSS of each gene were used as input promoter sequences. Prediction of the true positive TFBSs was then plotted as a function of the total predicted TFBS count for different cut-off values. As shown in Figure 2A, STAT-Scanner, which uses combined STAT3-related PWMs, outperforms MATCH and MotifLocator, both of which use the representative STAT3 PWM (V$STAT3_01). We believe the enhanced predictive power of STAT-Scanner was partly due to the usage of combined STAT3related PWMs, especially since the predictive power of MotifLocator also increased when combined PWMs were used ( Figure S3).
We also evaluated the performance of STAT-Scanner using genome-wide STAT3 binding data obtained using embryonic stem cells [39]. Among the 461 genes with STAT3 binding peaks in the 2.5 kb promoter regions, 412 have been accurately predicted by STAT-Scanner to have at least one STAT3 TFBS ( Figure 2B). The overall performance of STAT-Scanner was better than those of both MATCH and MotifLocator, as the detection of the same number of true binding sites was achieved by both with significantly lower total numbers of predicted sites. Although MATCH and MotifLocator performed similarly to STAT-Scanner in detecting about 50% of true STAT3 TFBSs, the latter outperforms both by accurately predicting the remaining true sites. We believe this is partly due to the usage of combined STATrelated PWMs which has the capability to enhance the performance of MotifLocator, albeit less than the enhancement for STAT-Scanner, with combined data derived from multiple PWMs ( Figure S4). The relative performance of both methods is  (Table S1) and (B) in the genome-wide STAT3 ChIP-Seq dataset [39]. doi:10.1371/journal.pone.0006911.g002 low compared to that of STAT-Scanner; this can be explained by the fact that their scores on the predicted sites are not directly comparable among different PWMs, thus showing the importance of our scoring scheme in integrating matches to different PWMs. These results also indicate that overlapping PWMs with similar binding specificity are critical to the development of improved strategies to detect functional TFBSs of STAT3 with high predictive accuracy.

Features of the functional STAT3 TFBS
The ultimate goal of computational prediction is to detect functional TFBSs with a high degree of confidence. To filter out the false positive TFBSs with high affinity scores, we examined various functional constraints such as evolutionary conservation and genome structure of predicted STAT3 TFBS regions. Sequence conservation among multiple species has been proven to constrain functional TFBS [16,17,40]. Therefore, we first evaluated the distribution of multispecies conservation scores (PhastCons score) [41] and regulatory potentials (RegPotential score) [42] for positions in the functional and non-functional STAT3 TFBSs detected by STAT-Scanner using the reference set of 22 genes (Table S1). For convenience, we considered a TFBS functional if it was supported by experimental STAT3 binding data; otherwise, the TFBS was considered non-functional. The distribution of PhastCons scores for the non-functional STAT3 TFBSs were skewed towards zero, while PhastCons scores for about 50% of the functional STAT3 TFBS exceeded 0.1 ( Figure 3A). In contrast, the distribution of RegPotential scores, which measure the similarity of patterns to those in the known regulatory elements, was similar for positions of the functional and non-functional STAT3 TFBSs ( Figure 3B). Next, we investigated the methylation-resistant CpG island features of the STAT3 TFBS-containing regions. Over-representation of the binding sequences for specific transcription factors, such as zinc-finger proteins, in CpG islands has been previously reported [43]. Most of the predicted STAT3 TFBSs are located inside CpG islands [44], but the genomic distribution is not significantly altered among the functional and non-functional STAT3 TFBSs ( Figure 3C). Repeat elements [45] in the genomic sequence might compromise the functions of transcription factors, as none of the functional STAT3 TFBSs have been identified inside the repeated regions ( Figure 3D). In summary, motif conservation, a major constraint that distinguishes between functional and non-functional STAT3 TFBSs, has therefore been included in STAT-Finder.

Validation of STAT-Finder
We next evaluated the performance of STAT-Finder compared to other comparative methods, namely, EEL [46] and CONREAL [12]. Given that EEL performs pair-wise alignment based on the matches to a single PWM, we compared the performance of EEL using each PWM (V$STAT3_01 and V$STAT1_01) separately. Meanwhile, the performance of CONREAL was examined by combining both PWMs. We tested the prediction accuracy of STAT-Finder in the two positive data sets with STAT3 bindings. STAT-Finder exhibited better performance compared to EEL using V$STAT3_01, EEL using V$STAT1_01, or compared to CONREAL in predicting true STAT3 TFBSs in the 22 previously identified positive genes ( Figure 4A). Note that both EEL and CONREAL failed to detect about 40-60% of true positive STAT3 sites even at the minimum cut-off value, while STAT-Finder found all of these. These data indicate that STAT-Finder showed better performance in terms of finding true positive STAT3 TFBSs that the other comparative programs missed. It was made more evident when we searched STAT3 TFBSs using EEL or CONREAL in the data sets with genome-wide STAT3 binding. Although the overall performance of the STAT-Finder was similar to EEL in detecting 56% of true STAT3 TFBSs, only STAT-Finder was capable of detecting the remaining 30% of the true sites ( Figure 4B). Our data suggest that the improved sensitivity of STAT-Finder could be attributed to the usage of combined STAT-related PWMs, which evidently overcame the performance limitations of V$STAT3_01.
We next attempted genome-wide prediction of STAT3 binding in the human promoter regions. For this purpose, we first estimated the cut-off value of the motif conservation score (MCS) to identify conserved functional STAT3 TFBSs. The degree of conservation of the predicted TFBS, which was determined by calculating MCS, was integrated with the affinity scores by STAT-Scanner (See Methods). The confidence score at each MCS was evaluated using the 2.5 kb promoter sequences of all annotated human genes and orthologous mouse genes. The confidence score determines the probability that a given TFBS is not conserved by chance. As cut-off values of MCS increased, the total number of predicted STAT3 TFBSs decreased at a slower rate than the average number of aligned instances of control motifs, resulting in escalated confidence scores at MCS values higher than 0.9 (Figure S5). Using STAT-Finder, we performed a genome-wide search for STAT3 TFBSs in the human promoter regions. Among the 15461 human genes with identified orthologs in the mouse, about 7600 genes were predicted to have putative STAT3 binding sites within the 2.5 kb promoter region, at the probability threshold of 0.9. Significant enrichment of STAT3 TFBSs could be predicted at the proximal upstream regions of TSS using STAT-Scanner and STAT-Finder [35,39] (Figure S6).

Identification of novel STAT3 target genes in the cancer cells
Constitutive activation of STAT3 and over-expression of its target gene have been suggested to play critical roles in human carcinogenesis [12,31,47,48,49,50]. To determine whether or not STAT-Finder is useful in identifying novel STAT3 target genes, we applied this program to a group of genes that are overexpressed in human cancer cells. We integrated microarray data obtained from the expression module map of genes up-regulated in cancer [51] and data derived from the A549 cells overexpressing a constitutively active form of STAT3 [52].
Among the 33 genes that are commonly up-regulated, eleven have already been reported to be regulated by STAT3 (Table 1). Using this group of genes, we examined whether or not STAT-Finder could detect experimentally proven STAT3 TFBSs. It is noteworthy that we were able to analyze only a fraction of the promoter sequences, mainly due to alternative promoter usage and the poorly annotated TSS information available. STAT-Finder detected three putative STAT3 binding sites in the JUNB promoter region including one site that has previously been reported to be a STAT3 binding site [53] ( Figure 5A). Using three different cell lines derived from human cancer patients, we confirmed STAT3 binding to the JUNB promoter by chromatin immunoprecipitation ( Figure 5B). STAT-Finder also successfully detected one STAT3 TFBS in the Nicotinamide N-methyltransferase (NNMT) promoter region, a recently identified STAT3 target gene [54] (Figure 5C, D). Interestingly, STAT-Finder was unable to detect known STAT3 TFBS in the MYC promoter region ( Figure 5E), even though MYC has been reported to be a STAT3 target [55]. It has also been reported that STAT3 binding to the promoter region of the MYC gene requires a site that is different from the consensus STAT3 binding sequences, but is similar to E2F TFBS, indicating that, in this case, STAT3 binding depends on the presence of other transcription factors [55]. Using primer sets that detect known STAT3 binding sites in the MYC promoter, we were able to confirm its binding upon IL-6 stimulation in HepG2 cells ( Figure 5F). These results suggest that STAT-Finder could efficiently detect binding sites for STAT3 only if their binding does not depend on the presence of other cis or trans factors. We next examined whether or not we can identify novel target genes of STAT3 using STAT-Finder. For this purpose, we selected genes with conserved TSS (Table 1) and determined the presence of putative STAT3 TFBSs using STAT-Finder in their promoter regions. STAT-Finder successfully detected putative STAT3 TFBSs with high probabilities in the promoter regions of AKAP12 (A-kinase anchoring protein 12), HIC2 (hyper-methylated in cancer 2), and THBS1 (Thrombospondin 1). STAT3 binding to these predicted sites was experimentally confirmed by ChIP assay (Figure 6A-F). To verify the specificity of STAT-Finder, we also assayed the binding of STAT3 to the sites that were not conserved, but were present in the promoters of human orthologous genes. In contrast to the conserved STAT3 TFBSs, we could not detect STAT3 binding to the non-conserved STAT3 TFBSs in human cancer cell lines ( Figure 6G). STAT3 binding to other predicted STAT3 TFBSs present in the promoter regions of ATF3 (activating transcription factor 3), DUSP5 (dual specificity phosphatase 5), SERPINE1 (serpin peptidase inhibitor, class E), NP (nucleoside phosphorylase), and SLC2A3 (solute carrier family 2, facilitated glucose transporter, member 3) were also experimentally validated ( Figure S7). Finally, we studied whether or not other computation tools such as EEL or CONREAL could also accurately detect STAT3 target sites that have been identified and validated in this study. Of 10 promoter sequences containing experimentally proven 10 STAT3 binding sites ( Figure 5, 6 and  (Table S1) and (B) in the genome-wide STAT3 ChIP-Seq dataset [39]. doi:10.1371/journal.pone.0006911.g004 S7), STAT-Finder predicted a total of 29 STAT3 binding sites including all of the 10 experimentally validated STAT3 binding sites. Meanwhile, EEL and CONREAL detected only 5 (50%) and 2 (20%) validated STAT3 binding sites among 23 and 6 total predictions, respectively, thereby indicating that STAT-Finder has better performance in terms of identifying novel target genes of STAT3 ( Figure S8).

Discussion
We presented a computational framework for identifying functional STAT3 TFBSs in mammalian promoters. The first compartment, STAT-Scanner, was designed to predict functional STAT3 TFBSs with improved sensitivity. By using comparative motif-based alignments, STAT-Scanner was linked to STAT-Finder to minimize false positive predictions. Our proposed method was tested using previously identified STAT3 target genes and was successfully applied to the identification of novel target genes.
Our strategy in developing STAT-Finder relied on several assumptions. First, the DNA binding specificity of STAT3 is shared by other STAT family members. STAT transcription factors bind to similar DNA sequences, and the similar DNA binding specificity of various STAT transcription factors, such as STAT1, STAT5A/5B, or STAT6, have been experimentally proven [56]. It has also been noted that integration of the overlapping matches detected by matrices from the same family members greatly reduces the number of total predicted TFBSs, and hence decreases the rate of false positive detection [57]. Furthermore, it has been recently reported that roughly half of TFs recognize multiple sequence motifs [58]. Therefore, a conventional motif scanning approach using a single PWM for each TF has an intrinsic limitation in detecting all functional TFBSs. As a result, the predictive power of STAT-Scanner was significantly enhanced by integrating STAT-related PWMs. The second assumption, used in the motif-based alignments, is that the relative locations of functional TFBSs are conserved among closely related mammalian species. In yeast, highly conserved TFBSs for a set of TFs exhibit relatively low spatial deviations (,150-200 bp) [20]. Likewise, we found that, for six mammalian species, known STAT3 TFBSs are located within a similar spatial distribution on each promoter.  Using STAT-Finder, we have identified a list of STAT3 target genes that are over-expressed in human cancer cells. Likewise, STAT3 binding to the predicted TFBSs has been experimentally verified in IL-6 stimulated human cancer cell lines. Interestingly, STAT3 was recruited to the predicted TFBS in a cell type-specific manner. For example, STAT3 binding to the predicted TFBSs in the promoter regions of the AKAP12 and HIC2 genes was observed in un-stimulated but not in IL-6 stimulated A549 and MDA-MB-231 cells. However, in the HepG2 cells, STAT3 was recruited to the same TFBS only after IL-6 stimulation ( Figure 6). In contrast, STAT3 binding to the promoter regions of MYC, SERPINE1, NP, and SLC2A3 was only detectable in IL-6 stimulated HepG2 cells, but not in A549 or MDA-MB-231 cells (Figure 6, Figure S7). Furthermore, it is evident that STAT3 binding to the predicted TFBSs in the promoters of the candidate target genes does not guarantee the expression of that gene. Although the expression of most of the target genes had been altered upon STAT3 binding to the promoter, we found that STAT3 binding to target sites did not always correlate with gene expression in the cell lines tested (Oh, YM, unpublished data). This suggests that STAT3 binding to target sites is not sufficient in inducing gene expression, and tissue-specific transcription factors, or trans-activators that specifying modification in the chromatin region may also be required [59,60,61,62].
A cis-regulatory module comprises a cluster of multiple TFBSs that cooperatively-interact with TFs to control gene expression. The identification of cis-regulatory modules for specific gene regulation is a challenging step towards understanding genome-wide transcription regulatory networks in mammalian genomes. Therefore, it is necessary to efficiently predict functional TFBSs for individual TFs. We expect that our comparative approach can be applied to other TFs with some restrictions. First, the efficiency of our program depends on the degree of evolutionary conservation among the six mammalian species. Therefore, DNA binding sites for TFs engaged in species-specific gene regulation may not be predicted. It is noteworthy that the frequent gain or loss of TFBSs in the intergenic regions leads to the evolution of transcriptional circuits [63]. Second, our program may not be applied to TFs that rely on other DNA binding proteins for recruitment into DNA. Third, because we only compared 2 kb of upstream promoter sequence relative to the annotated TSS, DNA binding sites of TFs that are enriched in regions distal to the TSS might be overlooked by our program. Although cis-regulatory regions that lie .100 kb away from the TSS exist, it has been suggested that most functional TFBSs are highly enriched in regions proximal to the TSS [40,64]. Another limitation is the amount and quality of annotated TSS information obtained from diverse mammalian species. With the exception of those from humans and mice, annotated TSS information for most of the mammalian genomes is not available, and correct TSS information is crucial for the identification of evolutionarily conserved TFBSs based on motif-based alignments. Obtaining accurate and reliable prediction of functional TFBSs in the promoter region is a critical step in deciphering the regulatory code of the complex transcription regulatory networks that govern diverse biological responses. Given that our proposed method is based on a multiple-motif model, we believe it can be applied to other TFs, with some modifications, and may serve as a basic tool to discover important cis-regulatory features.

Clustering of STAT3-related PWMs
We used dynamic programming to find the optimal gapped alignment between two PWMs. We denote by H 1 [ R W1|4 , H 2 [ R W2|4 the two PWMs of length W 1 ,W 2 over S~A,C,G,T f g to align, where H T k,w represents row w, each entry is non-negative and P 4 l~1 H k,wl~1 . The optimal pair-wise alignment can be found by the following steps. We first construct a matrix F[R (W1z1)|(W2z1) whose pz1, qz1 ð Þ -element F pz1, qz1 ð Þis the score of the optimal alignment between the sub-matrices H T 1,1:p and H T 2,1:q . Initially, we set F 1,1 ð Þ~0, F pz1,1 ð Þpd and F 1,qz1 ð Þqd for all p,q, where d is a gap penalty. We then build up the matrix F using the following recurrence: where s H T 1,p ,H T 2,q is the match score which is defined by the Kullback-Leibler divergence To define the similarity between the two PWMs, we assessed the optimal alignment by calculating the P-value. The P-value of the observed alignment score was calculated by estimating the score distribution of 1000 randomly-permuted PWMs via the Gaussian distribution. In this study, the gap penalty d was set to 0.5.

Prediction of putative STAT TFBSs: STAT-Scanner
We searched putative TFBSs of the STAT family in input promoter sequences and evaluated their binding affinity scores using STAT-Scanner. Given a set of position count matrices obtained from TRANSFAC 9.4 [32], we reconstructed STAT-related PWMs to compute the P-values of the match scores using a method for calculating the exact distribution of scores [65,66]. Briefly, our method consists of three steps. First, we transformed each position count matrix into the corresponding position frequency matrix (PFM) by adding position-dependent pseudo-counts [34]. This position specific regularization leaves the conserved positions of the matrix relatively unchanged. Second, from the regularized PFM, we reconstruct a position weight matrix (PWM) whose element is the log-odds score between the PFM and background model, defined by a zero-order Markov chain. Then, the match score is defined by the sum of the log-odds scores. To account for the effect of the uneven distribution of ''GC'' and ''AT'' content, we used six different background models that were constructed based on clusters of nucleotide compositional vectors of the whole mouse promoter sequences available from Ensembl [67]. For clustering, we used the k-means clustering algorithm. Finally, to determine statistically significant TFBSs, we calculated the exact distributions of the match scores under the background model assumption. From the distributions, we calculated the type-I sequence error probability a n (s), which measures the probability that at least one site within a sequence of length n (n = 500 as proposed by [34]) has a match score larger than or equal to s, under the assumption that the sequence is generated from the background model. We then converted the match score, s, into the affinity score, t, defined by 1-a n (s). This conversion makes it easy to define a threshold, c, of the affinity score within a statistical hypothesis testing framework. In addition, it is also plausible to directly compare the affinity scores of different STAT-related PWMs.
Given a set of STAT-related PWMs and an input promoter sequence, STAT-Scanner first computes the nucleotide composition of the input sequence in order to select the nearest background model, and searches TFBSs whose affinity scores are larger than the threshold, with the PWMs constructed by the chosen background model. We used eight STAT-related PWMs, V$STAT_01, V$STAT1_01, V$STAT3_01, V$STAT5A_01, V$STAT5B_01, V$STAT5A_02, V$STAT1_02, and V$STAT_Q6, and combined all overlapping sites of the eight PWMs into one with the maximum affinity score. Notably, the max operator is applicable because the affinity scores of different PWMs are directly comparable.

Prediction of the conserved STAT TFBSs: STAT-Finder
STAT-Finder was designed to minimize the false positive discovery of predicted TFBSs using comparative sequence comparisons. It searches conserved sites within the promoters of six orthologous species by sequences of affinity scores. To diminish probabilities of misalignments, we used score sequences defined by STAT-Scanner as a first approximation of the conserved regulatory regions. We regarded a region with nonzero affinity scores within the score sequence as the regulatory region. We focus on the regulatory regions of multiple alignments by ignoring the non-conserved regions. We progressively aligned the six score sequences obtained from orthologous promoter sequences, according to the phylogenetic tree of all six mammalian species, and evaluated the degree of conservation by calculating the Motif Conservation Score (MCS). The MCS value of an aligned TFBS ranges from 0 (non-conserved) to 1 (most conserved).
Motif-based alignment tool consists of two main parts to align multiple sequences with different affinity scores of STAT TFBSs. The first part is a pair-wise global alignment module that finds an optimal alignment between two sequences. We adapted a variant version of the Needleman-Wunsch algorithm [68], with a modification in the scoring function for the match between two affinity scores. The second part is a progressive alignment module that determines a multiple motif alignment among orthologous promoter sequences derived from six mammalian species. The basic concept of this approach is to sequentially perform pair-wise alignments between two sequences of affinity scores, between a sequence and a profile, or between two profiles, according to the phylogenetic tree of the six species. The profile is a set of aligned sequences with gaps. This progressive alignment efficiently aligns multiple sequences with reasonable accuracy. In the given multiple motif alignment, we computed the motif conservation score (MCS) at each aligned position by taking the average of the aligned affinity scores. The affinity score of the gap was set to zero and the ''N'' character was not considered when calculating the average.
STAT-Finder has a unique feature that detects not only conserved binding sites but also non-conserved ones with very strong binding signals to rectify the unavoidable alignment error. Before describing our probabilistic model, we first explain our notations to define our model. Among the 8 STAT3-related PWMs, we denote by H k [ R W |4 the k th PWM of length W over S, where H T k,w represents row w, each entry is non-negative and P 4 l~1 H k,wl~1 . The background model h 0 [ R 4 , which describes frequencies over the alphabet within non-binding sites, is defined by a zero-order Markov chain. We assume the background model is known in advance and estimated from the whole mouse promoter sequences.
Suppose we have a promoter sequence S i which is a string of length L over the alphabet S. In order to allow for multiple binding sites per sequence, we represent the sequence S i as a set of overlapping subsequences S W ij~S ij ,S i jz1 ,0 ½ T . Our probabilistic model has the following specification for defining the joint distribution. First, latent variables Z k ij indicating the starting positions of binding sites of the k th PWM are governed by the probability p~p 1 ,p 2 ½ T such that p 1 ,p 2 §0 and p 1 zp 2~1 . The prior probability of Z k ij is specified by For each latent position Z k ij , the probability distribution of the subsequence S W ij is given by is an indicator function which is 1 if S i jzw{1 ð Þ l, and 0 otherwise. The objective of probabilistic inference in our model is to calculate the posterior probability P Z k ij jS W ij ,H k ,h 0 because this probability evaluates the degree of being a true binding site of the subsequence using our prior knowledge and given data. The posterior probability can be obtained by using Bayes' theorem The degree of conservation of each subsequence can be easily incorporated into our probabilistic framework by assigning relatively higher prior probability p on Z k ij than non-conserved one. In this work, we used the following settings: p 2~0 :0002 for non-conserved subsequence (the expected number of binding sites is 1 when the promoter sequence is 2500 (double stranded)) and p 2~0 :01|MCS. The eight different posterior probabilities of the eight STAT3-related PWMs across the latent positions are integrated by taking the maximum value and the probability cutoff value was set to 0.5.

Motif-based pair-wise alignment
We denote two promoter sequences of lengths m i and m j by t j~t j 1 t j 2 Á Á Á t j mj and t j~t j 1 t j 2 Á Á Á t j mj , where t i k and t j l are the affinity scores of STAT3 TFBSs. If the score is smaller than a threshold c, we cut the score to 0. We also set the score to 21 if the corresponding site contains the ambiguous ''N'' characters. With this setting, the optimal pair-wise alignment between t i and t j can be found by dynamic programming. We first construct a matrix ð Þ whose kz1,lz1 ð Þ -element F kz1,lz1 ð Þis the score of the optimal alignment between the segments t i 1 Á Á Á t i k and t j 1 Á Á Á t j l . For initialization, we set F 1,1 ð Þ~0,F kz1,1 ð Þkd and F 1,lz1 ð Þ~ld for all kandl, where d is a gap penalty. We then build up the matrix F using the following recurrence: where s t i k ,t j l is the sequence match score between t i k and t j l which is defined by We construct the optimal alignment by tracing back the choices that result in the final value of F m i z1,m j z1 À Á .

Motif-based profile alignment
We denote two profiles of lengths m and n by p~p 1 p 2 Á Á Á p m and q~q 1 q 2 Á Á Á q n , where p k and q l are the aligned affinity scores of STAT3 TFBSs constructed from disjoint promoter sequences indexed by I and J, respectively. The pair-wise profile alignment can be also found by dynamic programming. The profile match score s p is defined by the average of the sequence match score:

Estimation of MCS confidence score
To generate randomly shuffled control motifs, we manually aligned STAT-related PWMs without gaps by looking up the core regions (TTCCNGGAA). We excluded V$STAT5A_02 (homotetramer) because it was not aligned with other PWMs. The operation for random permutation was then applied to the aligned PWMs to generate 100 control motifs. Based on the assumption that the control motifs should have occurrence rates similar to the real motif, we selected 42 control motifs that detect similar numbers of TFBSs in the reference data set (615%). Among them, we chose 10 motifs that were most dissimilar to V$STAT_01, based on the inter-motif distance measure (cut-off: 0.25) [69]. The confidence level at each MCS was then calculated using the following equation: (the total number of TFBSs of the real motifthe average number of TFBSs of the control motifs)/the total number of TFBSs of the real motif. This value represents the fraction of the number of conserved TFBSs above the ones that occurred by chance.

Retrieving information for promoter sequence
Human and mouse promoter sequences (22000, +500 bp of the annotated transcription start sites) were downloaded from Table Browser of the UCSC genome browser [70]. We used hg18 and mm9 for human and mouse genome UCSC version, respectively. Orthologous promoter sequences of chimpanzee, orangutan, and rhesus were obtained by blatting the 2.5-kb human promoter sequences into the UCSC genome browser of each species [70]. Rat promoter sequences were obtained by blatting the 2.5-kb mouse promoter sequences into the UCSC genome browser of rat. For each 2.5-kb promoter sequence, PhastCons scores, Regulatory potential scores, CpG island, and regions for repeated elements were also obtained through Table  Browser of the UCSC genome browser.
Genome-wide STAT3 ChIP-Seq data set was obtained from [39]. In this data set, we first selected 461 genes with STAT3 binding peaks located in 2.5kb promoter regions, among which 412 genes have at least one site predicted by STAT-Scanner (cutoff: 0.2) that is overlapped with experimentally identified regions (within 150 bp of STAT3 binding peaks). We next defined true positive sites as those that are overlapped with the STAT3 binding regions and that match to the highest scoring site predicted by STAT-Scanner, as suggested by [71].

Microarray data analysis
Cancer module map information was downloaded from the web browser (http://ai.stanford.edu/,erans/cancer/). We used Cancer Module 3,17,18,54,57,98,124,126,and 197, which contain commonly up-regulated genes across liver cancer, B lymphoma, grade3 breast cancer, and stimulated macrophages. Microarray CEL data files of STAT3-C over-expressed cells were obtained from Dr. E.B. Haura (University of South Florida, Tampa) [52]. Microarray data was analyzed using the SBEAMS program [72]. The data set was normalized by the global quantile scaling method (GC-RMA) and filtered to include differentially expressed genes with more than two fold change, with FDR ,0.1 and P-value ,0.01 (t-test).

Chromatin Immunoprecipitation (ChIP)
ChIP assays were performed as described with minor changes [64]. Cells were fixed in 1% formaldehyde for 15 min, harvested in buffer A (0.25% Triton X-100, 10 mM EDTA, 0.5 mM EGTA, 10 mM HEPES [pH 6.5]), and then resuspended in buffer B (200 mM NaCl, 1 mM EDTA, 0.5 mM EGTA, 10 mM HEPES [pH 6.5]). Cells were then lysed in lysis buffer (1% SDS, 10 mM EDTA, 50 mM Tris-HCl [pH 8.1] with proteinase inhibitors). Chromatin sonication was performed three times for 40 s at setting 5.0 using a Branson 250 sonicator with a microtip. Fragmented chromatin was immunoprecipitated with STAT3 antibodies (SC-482, SC-483; Santacruz Technology, CA, and S21320; Transduction Laboratories, Lexington, KY) for 4 hrs. After reversal of the cross-links and DNA precipitation, enriched DNA was analyzed by PCR amplification with primers that flank the predicted STAT3 TFBSs (Table S2). Figure S1 PWM similarity clustering. (A) Total 565 vertebrate TRANSFAC PWMs were clustered by pair-wise similarity comparison with the Kullback-Leibler divergence. The number of PWM clusters at different similarity P-value cut-offs is plotted. (B) PWM cluster at 10-7 P-value of similarity was represented by Cytoscape [73]