Modelling of the breadth of expression from promoter architectures identifies pro-housekeeping transcription factors

Understanding how regulatory elements control mammalian gene expression is a challenge of post-genomic era. We previously reported that size of proximal promoter architecture predicted the breadth of expression (fraction of tissues in which a gene is expressed). Herein, the contributions of individual transcription factors (TFs) were quantified. Several technologies of statistical modelling were utilized and compared: tree models, generalized linear models (GLMs, without and with regularization), Bayesian GLMs and random forest. Both linear and non-linear modelling strategies were explored. Encouragingly, different models led to similar statistical conclusions and biological interpretations. The majority of ENCODE TFs correlated positively with housekeeping expression, a minority correlated negatively. Thus, housekeeping expression can be understood as a cumulative effect of many types of TF binding sites. This is accompanied by the exclusion of fewer types of binding sites for TFs which are repressors, or support cell lineage commitment or temporarily inducible or spatially-restricted expression.


Introduction
After integrating FANTOM5 (F5) and ENCODE datasets, Hurst et al. [1] demonstrated that a simple metric of the architecture of the proximal promoter, namely its size (that is the cardinality of the set of interacting TFs), was strongly correlated with the breadth of expression (BoE). Statistical analyses, and the knowledge about the molecular mechanism of the activation of transcription [2], suggested that this correlation was causative. This is in contrast to associations with various metrics of gene expression described previously for chromatin features such as histone acetylation or histone H3K4/36/79 methylation [3], or the DNase I signal [4], or gene compactness/codon usage [5]. The latter associations may still be useful for prediction, but they most likely cannot explain the molecular mechanism of the initial activation of transcription (they are effects, not causes, although chromatin modifications may help in the propagation of the activated transcriptional state).
We need to start with a few definitions. BoE means the fraction of tissue-and cell-types in which a gene is expressed. It is arguably the most fundamental metric of gene expression in a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 vectors are abstract measures that define a hyper-plane in much higher dimensions than we are used to perceiving. These vectors are not directly interpretable as correlation coefficients are.) The aim of the current study is to fill this knowledge gap, presenting a more quantitative picture of the regulation of BoE. For this purpose, we intentionally concentrate on explanatory rather than predictive modelling. However, we do verify our models by cross-validation as it is suggested in statistical literature [12].
The distinction between predictive and explanatory statistical modelling was underlined by Galit Shmueli in an influential review [13]. The ultimate goal of explanatory modelling is to infer causal relationships and to create new useful theoretical generalizations. (However, the descriptive modelling of associations between variables without assuming causality can be also useful in most cases.) Non-linear methods such as SVMs are ill-suited for this type of statistical modelling. Instead, one ought to use generalized linear models (GLMs), which are a robust generalization of multiple regression [14]. In the GLM models presented here, the coefficients of multiple regression should be interpreted as the effect of the given TF on BoE over and above all other TFs included in the model formula.
The reader is sure to be aware that there are two major approaches to statistical modelling: classical and Bayesian. In the classical approach, parameters are point estimates, usually calculated according to the method of maximum likelihood. The data are seen as a random variable. In the Bayesian approach, this situation is reversed. Parameters are modelled as random variables with assumed prior distributions that are updated given observed data according to the Bayes rule. By using both the classical approach and the Bayesian approach, we ensure that the findings reported here are robust in respect to the statistical methodology used.
Herein, we apply several technologies of explanatory statistical modelling to the problem of modelling BoE from promoter architectures. (See the analysis workflow in Fig 1). We started with tree models for the initial description of the dataset [15]. Next, the core R implementation of GLMs was compared with regularization-based GLMs. The regularization methods included lasso, ridge and the elastic net. A Bayesian GLM was also constructed to provide an alternative to the above classical methods. Complementary non-linear-response models, were also constructed where k-means clustering was followed by multinomial regression to refine the response variable. Finally, TF coefficients calculated by different models were compared. In each case, the architectures of proximal promoters were the inputs for the models. BoE was the response variable which we sought to model and explain.

How was BoE defined?
F5 expression data were in the format of tag counts from the technology of cap analysis of gene expression-CAGE [7]. To ensure comparability across different experiments, CAGE counts were normalized by the consortium to tags per million (TPM). BoE was defined as the fraction of tissues in which the transcript was expressed (that is present at more than 10 TPM, an accepted standard of the F5 consortium-alternative cut-offs were examined previously). The distribution of BoE is shown in Fig 2A. Also as reported previously [1], most transcripts were either narrowly (BoE < 0.33) or broadly (BoE > 0.66) expressed, with 65% and 23% of all transcripts in these categories, respectively. Transcripts with intermediate expression were underrepresented (only 12%). Furthermore, the distribution of BoE was characterized by the minimum = 0 (n = 2,296), median = 0.11, mean = 0.3, maximum = 1 (n = 7), and the variance of 0.13.

RefSeq transcripts were used as the common key for data integration
As before, the data on promoter architectures were integrated with BoE using RefSeq transcripts (S1 Dataset) as the common key for data merging [1]. After merging, there were 30,793 human RefSeq transcripts, and corresponding promoter architectures, for which F5 expression data were available. However, while we intentionally wanted to focus on In the first step, F5's expression data were integrated with promoter architectures calculated from merged ENCODE ChIP-seq screens. Expression data were used to calculate BoE (BoE), that is the fraction of samples in which a transcript is expressed. In the second step, the integrated datasets were used to prepare model matrices and formulae. In the third step, tree models and GLMs were constructed and interpreted. correlation between BoE and promoter size was the (Spearman's rho = 0.68, P-value < 2e-16 for all data points, and rho = 0.52, P-value < 2e-16 after disregarding empty architectures). Moreover, just one size of the promoter window was chosen for our analysis. This was one kilobase window with symmetrical boundaries located ±500 bps from the TSS. We reasoned that the robustness of the trend has already been proven beyond any doubt. For this reason the inclusion of additional analysis variants and the associated detail would only obscure the main ideas, make the presentation difficult, and decrease readability.
The motivation for the choice of ENCODE TF quality score (QS) cut-off ENCODE TFBSes have a QS assigned to them which varied from zero through 1000. The consortium processed all separate ChIP-seq datasets through a unified bioinformatics pipeline [16] which assigned the score. The score reflected the confidence the consortium had in predicting a given site. In practice, high QS scores corresponded to a strong binding site in any single ENCODE sample and low scores corresponded to weak sites across ENCODE samples. Herein, we concentrated on strong ENCODE TFBSes, rather than on more numerous weak / low-occupancy binding sites whose functional significance is debatable [8,17,18].
TFs could also correlate with each other, especially if they belonged to the same gene family (Table 1, S1 Table). Statistical significance of the correlations was established using an asymptotic approximation to the exact test, as well as a permutation procedure (permutation tests are more appropriate to genomic data which are not a sample from a population). The random genomic expectation was obtained by the permutation of the sites transcriptome-wide a sufficient number of times (in this case, n = 10). Each time, Spearman's rhos were recalculated for the entire permuted set. This random expectation was used to define significance thresholds corresponding to critical values in a two-sided hypothesis test at 5% significance level: -0.0104 and 0.0124. A plot visualizing the random expectation, the observed distribution of correlations in the ENCODE dataset, and the determined thresholds is given in S2 Fig: "Random expectation and the observed distribution of Spearman's correlation coefficients".

Multicollinearity
Highly correlated pairs of TFs could be a problem for the modelling strategy adopted here as multiple regression produces correlation coefficients after controlling for all the other covariates. This is otherwise known as the multicollinearity problem. To control the problem, the variance inflation factor (VIF) was calculated for all TFs. The recommendation in the literature is to remove variables with VIF higher than ten [19]. The TFs with the VIF higher than five were as follows: octamer TF 2 − Oct-2 (VIF = 95.4), POU class 2 homeobox 2 − POU2F2 (95.2), BDP1 (19.5), RPC155 (16.1), BRF1 (10.5), USF1 (5.7), Pol III (5.6), and YY1 (5.4). These inflated VIFs were due to the highly correlated TF pairs. As a remedy, the following variables were removed from the P matrix: Oct-2, BDP1, upstream TF 1 − USF1 (the sc8983 antibody), RPC155, and YY1 (the c20 antibody). After this selection, there was only one variable with VIF > 5 (namely BRF1 with VIF = 5.2).

Modelling philosophy
In an influential review, Galit Shmueli [13] stressed that the strategy of explanatory modelling must be suggested by prior theoretical knowledge and scientific insights about the natural phenomenon under investigation. What informed the modelling process in this work was the theoretical knowledge of the biochemical steps through which TFs bind DNA to open the chromatin, attract RNA polymerases, and activate transcription. One of the key decisions was that Pol II sites were not included as explanatory variables for multiple regression. (ChIP-seq peaks identified by four antibodies targeting Pol II from the inputs of GLMs were removed.) We reasoned that including them would be a methodological mistake. As suggested by Gelman et al. one must not control for the consequences of the activation of the process being modelled [20]. Table 1. The correlated pairs of TFs. All the correlations shown were highly statistically significant (Pvalue < 0.0001). Positive correlations between TFs were of much greater absolute magnitude than negative (the maximum rho was 0.99, the minimum was only -0.095). However, the negative correlations were still statistically significant (S1 Table).

Binary partitioning trees
Tree models were used to identify the top sources of variation in BoE among TFBSes. The tree algorithm works by binary partitioning [21]. Such tree models [15] are useful for the initial inspection of datasets with multivariate regression problems and in such a role they were applied here. The first tree (S3 Fig) included all explanatory variables. In this tree, RNA polymerase type II (Pol II), TAF1 and phosphorylated Pol II were chosen by the algorithm as the first, the second, and the third partitioning variables, in this order of importance. However, these associations were trivial and Pol II, which is not a TF, could mask more meaningful associations. Therefore, in the second tree (Fig 3), Pol II, TAF1 and phosphorylated Pol II covariates were removed to reveal the deeper structure of the dataset. The second tree identified the following partitioning variables (that is TFs), in the descending order of importance: (1) HEY1 [22], (2) ELF1 [23], (3) c-Myc, (4) early growth response protein 1 − Egr-1 [24], (5) E2F1 [25], and (6) zinc finger and BTB domain-containing protein 7A − ZBTB7A. For each variable the value of 0.5 was reported to be the cutoff making the splits "yes or no" (presence or absence) decisions for respective TFBSes.

GLMs
The tree model was useful in defining the top sources of variability in the dataset. However, unlike regression, the tree model does not produce the estimates of correlation coefficients for all significant TFs. Instead, this was achieved by fitting GLMs [14,26]. The motivation for using GLMs was that BoE was conveniently interpreted as a proportion with a binomial structure of errors. Consequently, the residuals were not normally distributed. This violated the assumptions of linear regression based on the sum of least squares [21]. Here, we started by using the glm() function of the core R [27]. The coefficients for individual TFs obtained by this model are shown in S2 Table. The coefficients of different GLMs are summarized in Table 2 and fully listed in S3 Table.

Regularized GLMs
GLMs can suffer from the problems of multicollinearity and over-fitting to which shrinkage and regularization methods are a remedy [28]. Here, we used glmnet, which is a popular R package implementing GLMs with regularization and variable selection. Elastic net regularization [29] performed by the package is an even more effective variable selection mechanism than the original lasso method [30], or ridge, which glmnet generalizes.
Lasso, ridge and the elastic net are alternative shrinkage methods in which correlation coefficients of multiple regression are shrunk towards each other and towards zero to prevent over-fitting, and to select covariates which are the most relevant for the model [28]. The lasso was designed to select a single variable from a correlated group if they are collinear  [15]. This simple figure gives intuitive insight into the transcriptions factors which had the greatest impact on BoE. Each path from the top of the tree to one of the terminal leaves at the bottom follows through a series of splits. At each split, one follows a YES or NO direction according to whether a given TF is present in or absent from the promoter architecture. At each terminal leaf, the mean BoE for a given subset of the data is given. The TFs with the most significant contributions to the reduction in deviance are as follows: HEY1, ELF1, c-Myc, Egr-1, E2F1, and ZBTB7A. Additionally, at the bottom of each panel, the tree structure is given in text with the number of transcripts (n), deviance, and mean BoE for each group.
(eliminating others). Ridge shrinks such coefficients towards each other and towards zero. Ridge and lasso were compared during a meta-analysis of lung cancer gene expression datasets [31]. The lasso penalty resulted in a classifier which included fewer genes, but had similar predictive performance to the ridge model.

Activating, saturating, and inhibitory regulatory logic of TFs
In Fig 4A, two contrasting examples of TFs and of their impact on BoE are shown. The first TF, Egr-1, is an example of the activating regulatory logic. For this TF the average BoE of transcripts increases with each additional binding site. (In other words, BoE is the function of the number of TFBSes-TFBSes, that is monotonically increasing.) In contrast, SUZ12 is an example of inhibitory logic. The average BoE of transcripts decreases with each additional binding site. The differences between the effects of the above TFs were startling: transcripts with three Egr-1 promoter binding sites had the average BoE of 0.61 (N = 7). For SUZ12, this value was only 0.032 (N = 9) that is 19 times lower (Wilcoxon test P-value = 0.02). In addition to activating and inhibitory TFs, a third category of saturating TFs could be identified (Fig 4B). For saturating TFs, the average BoE was higher for genes with one TFBS than for those without any, but genes with two binding sites had statistically significantly lower BoE than those with a single site (Wilcoxon test). In other words, for saturating TFs the relationship between BoE and the number of TFBSes was non-linear (and the corresponding function was not monotonic).
We applied the above classification scheme to forty selected TFs which were found to have regression coefficients statistically significantly different from zero. (If all TFs were included, due to multicollinearity, TFs simply associating with BoE rather than causative would be identified). In the group of these TFs, 36 were classified as activating, 2 as inhibitory (NRSF and SUZ12), and 2 as saturating (GABP and Pbx3). The list of the average values of BoE for the selected TFs is given in Table 3.

The interactions between TFs
Multiple regression can detect statistical interactions between model terms. Such interactions suggest the antagonism or synergy of the impact of pairs of TFs on BoE. When pairwise interactions between terms were modelled as an extension of the core R GLM linear model, there were 110 interaction pairs at the P-value cut-off of 0.05. Several TFs were frequently included in these interaction pairs, for example: ELF1 (in 13 pairs), ZBTB7A (12 pairs), Nrf1 (12 pairs), HEY1 (12 pairs), Egr-1 (11 pairs), NFKB (10 pairs), Ini1 (10 pairs), or E2F1 (10 pairs). In Fig  5,

K-means clustering and non-linear-response models
As a refinement of GLMs, non-linear-response modeling of BoE was pursued. In the first step, K-means clustering (a non-linear classification algorithm) was used to define three groups of transcripts with respect to BoE and the sizes of their promoter architectures (Table 4). These three clusters were visualized in Fig 6. The tissue-specific cluster consisted of transcripts with small promoter architectures (n = 11,176). The housekeeping cluster included broadly expressed transcripts with large promoter architectures (n = 5,299). The atypical cluster consisted of transcripts without the positive correlation between the architecture size and BoE (n = 3,928). In the second step, the above three cluster were used to encode a response variable Modelling of the breadth of expression for regularized multinomial regression implemented in the glmnet package. Coefficients from non-linear-response models were compared with those of linear response models in Table 2.

Random forest
Variable importance (S4 Table) was measured using percentage increase in mean squared error when variable was omitted. The distribution of variable importance was right-skewed.  positively and negatively correlated with housekeeping expression. However, ten out of the eleven most important TFs were positively correlated with housekeeping expression in the GLMs of Table 2.

The pairs of antibodies targeting the same protein
The ENCODE dataset contains several pairs of antibodies targeting the same protein [32]. It is interesting to compare the results obtained for these pairs. For example, there were two antibodies targeting USF-1 (sc-229 and sc-8983). Their peaks correlated highly with each other (rho = 0.862). There were also two antibodies targeting YY1: sc-1703 and sc-281, also highly correlated (rho = 0.896). Two antibodies targeting MAF BZIP TF K − MafK (ab50322 and sc-477) correlated at rho = 0.772. In conclusion, the results from the pairs of antibodies were largely in agreement. The exception to the above rule were two antibodies targeting CCCTC-binding factor − CTCF, antibody C-20 (sc-15914) and N-17 (sc-5916). Both were goat polyclonal antibodies [32]. While these antibodies correlated (rho = 0.618), the first was found to have a positive impact on BoE, while the second had negative impact. We note that the first antibody targets the epitope near the C-terminus of human CTCF-denoted as CTCF(C), while the second targets N-terminus of the protein-denoted as CTCF(N). It is likely that these termini have different reactivity profiles in ChIP-seq assays.

The subset of dual Pol II-Pol III promoters
One of the interesting sub-groups of transcripts were 34 housekeeping transcripts arising from dual Pol II-Pol III promoters. (Typically, Pol III promoters are associated with rRNAs and tRNAs genes [33]. However, such transcripts were excluded in the F5 pipeline.) The dual promoters which were included in the pipeline were not pure Pol III promoters. In fact, they were enriched in Pol II in addition to having a Pol III ChIP-seq peak. Only two out of the 34 did not have a Pol II peak, versus 13,846 out of 30,759 (P-value = 0.0000008, Fisher's exact test, giving an enrichment ratio of 13).
Reassuringly, Pol III peaks correlated positively with those for several known co-factors of RNA polymerase III. The correlated TFs were as follows (those with rho > 0.1): RNA polymerase III transcription initiation factor-BRF1 (rho = 0.852), transcription initiation factor IIIB-BDP1 (0.79), the catalytic core and largest (155kDa) component of Pol III − RPC155 (0.759), heat shock factor 1 − HSF1 (0.417), TF for polymerase III C − TFIIIC (0.301), histone Table 4. Three clusters of transcripts. Three clusters of transcripts were identified through k-means clustering (Fig 6) and labelled I, II and III. Of these, II was the cluster of housekeeping transcripts. The average BoE of this cluster was 0.84, that is 7.6 fold higher than the value of 0.11 obtained for the remaining transcripts (Wilcoxon text, Pvalue < 2e-16). Transcripts in cluster II had also greater average size of promoter architecture: 16 versus 5 for others (Wilcoxon text, P-value < 2e-16). acetyltransferase p300-N-15 rabbit polyclonal antibody (0.297), and TATA-binding protein − TBP (0.147).

The effect of the level of gene expression
We [1] and others [5] reported before that BoE correlated with the average level of gene expression. In the dataset used here, there was a Spearman correlation of rho = 0.96 between BoE and mean expression, and rho = 0.9 with the median. The correlations persisted even if the average or median expression were calculated only across tissues in which a gene was expressed: rhos of 0.37 and 0.41, respectively. (P-values for all the above correlations were lower than 2.2e-16.)

Modelling of the breadth of expression
To ensure that our statistical models were specific to BoE, we verified that the effects of individual TFs on BoE persisted even after controlling for the level of gene expression. If mean, median, or maximal expression were included as explanatory variables, we obtained similar models to those obtained with default inputs. This was evident from correlations between coefficient vectors for default inputs and for inputs including mean (rho = 0.93), median (rho = 0.95), or maximal expression (rho = 0.97) as covariates. (At the same time, models with average or median expression as the response were only weakly similar with rhos of 0.3 and 0.43, respectively.)

Discussion
Herein, we have integrated F5 and ENCODE and sought to identify pro-housekeeping and pro-tissue-specific TFBSes in proximal promoters of human transcripts. Such a strategy of integrative bioinformatics can help reveal genome-wide trends that, as essentially statistical arguments, could not be identified by the study of individual loci or verified experimentally using the techniques of molecular biology.
The exceptionally comprehensive functional genomic resources of F5 and ENCODE were intentionally integrated using the RefSeq reference transcript collection. We argue that TSSes of experimentally-verified transcripts, rather than predicted gene models or alternative splice forms, are the natural focus for this analysis. Not only are the original expression data but also ENCODE data mapped to TSSes. One of the key findings of F5 was that many genes have multiple TSSes which differ strongly in their expression patterns [7]. For the obvious reasons of alternative genomic location, these alternative TSSes also differ in the architectures of their proximal promoters. All the above considerations prompted us to deliberately and purposefully focus on the set of non-redundant TSSes linked to RefSeq transcripts and to avoid the focus on genes. (We note, however, that the correlation between promoter sizes and BoE was previously shown to exist both on the level of transcripts and on the level of genes [1,9]. ) We began the analysis with a tree model (S3 Fig), which stratified the dataset into four subsets of TSSes with increasing average BoE (increasing gradually from 0.08 through 0.4, 0.5 to 0.7). This stratification depended on the presence of promoter ChIP-seq peaks of RNA polymerase II (Pol II), TAF1, and S2-phosphorylated Pol II. TAF1, or TFIID subunit 1, is otherwise known as the TATA-box binding protein associated factor 1. TAF1 interacts with the TATAbox binding protein (TBP) and other co-factors in forming the Pol II initiation complex; it is one of basal factors required for Pol II transcription initiation [34]. It was reassuring that so much variability in the data could be explained just by these basic components of the polymerase II complex. After all, polymerase II is responsible for the transcription of messenger RNAs of the majority of genes-protein coding genes. However, the presence of these factors does not explain the mechanism of the activation of expression: it is merely a consequence of it.
We note that some of the polymerase II peaks, especially those detected by the non-phospho-specific antibody labelled Pol II [35], might be bound to proximal promoter but in a paused state [36]. However, the S2-phospho-specific antibody − ab5095 [37] recognized phosphorylated polymerase. (Phosphorylated in the amino-acid position two of its C-terminal repeat YSPTSPS.) Importantly, this phosphorylated form is thought to be actively elongating [38]. Therefore, at least a fraction (probably the majority) of transcripts with the highest BoE in S3 Fig were arising from promoters with elongating polymerase type II.
The above trivial correlations with the components of RNA polymerase II complex might be masking more interesting, but subtler, effects that were more likely to be causative. Therefore, we removed Pol II and TAF1 from the second tree model and from all GLMs. We stress that it was previously found that the correlation between the size of promoter architecture and BoE persisted after Pol II sites were disregarded [1]. (If Pol II and TAF1 were retained in the core R GLM, they produced coefficients of 0.675 and 0.3 and were ranked as the first and the third most significant terms.) Interestingly, the second tree model (Fig 3) had a different regulatory logic to the first one. In the first model (S3 Fig), several types of binding sites cumulatively led to high BoE. In the second model, there were alternative TFs which separately conferred higher than average BoE, but there was no evidence for their cooperativity. In other words, these TFs were not additive in explaining additional deviance. How can these results be interpreted? Clearly, having either a HEY1, or ELF1, or E2F1, or YY1, or Egr-1, or a c-Myc binding site made a transcript's BoE higher than average. Not having any of these sites made transcript's BoE as low as not having any Pol II binding sites. (The average BoE of such Pol II-depleted transcripts was lower than 0.1 which corresponded to less than 10% of cell-lines expressing the transcript.) Interestingly, the pro-housekeeping TFs identified by the second tree model (Fig 3) tended to have a variety of diverse targets and context-dependent roles in many tissue-types. For example, Egr-1 and c-Myc integrate many different signalling pathways. Their aberrant expression may be associated with cancer, but low levels of expression are present throughout tissue and cell-types and developmental stages. Egr-1 is a transcriptional regulator of many biological responses in cell types that are widely distributed among the tissues of the body: fibroblasts, endothelial cells, epithelial cells and lymphocytes [39,40]. The second TF, c-Myc modifies expression of up to a third of transcriptome and was proposed to act as a universal amplifier of gene expression [41,42]. According to the modern view, c-Myc, although originally identified as a proto-oncogene, does not have its own unique transcriptional program. Instead, it strongly amplifies transcriptional output of genes that are already active, perhaps by prompting paused Pol II to enter the phase of elongation [43]. Crucially, c-Myc does not localize to the promoters of silent genes. However, c-Myc was identified as a key cooperative binder with a group of pioneer TFs which can recognize and bind silent chromatin and initiate the reprogramming of somatic cells to pluripotency [44]. Moreover, a systematic search for pioneer TFs which can open chromatin and activate transcription [45] identified several TFs which in our data correlate positively with the high BoE. These included Nrf1, E2F1, and Elf1, while c-Myc was identified as a key settler TF. All these facts suggest a causative mechanism for c-Myc.
Several TFs identified as model terms with negative coefficients to BoE were known transcription repressors, lineage-specific differentiation factors, or TFs involved in inducible rather than housekeeping expression (Table 2). This is straightforward to interpret in causative terms, as lineage-specific or inducible expression patterns disagree with housekeeping expression. For example, SUZ12 is a member of the polycomb repressive complex that functions in the heterochromatin-mediated repression of transcription [46]. TAL1 plays a role in the differentiation of the hemopoietic lineage [47]. ZNF274 is another transcriptional repressor [48]. It recruits histone methyltransferase SETDB1 to the 3'-ends of zinc finger proteins, thus effectively silencing multiple TFs [49]. Another TF negatively impacting BoE, NRSF is a wellknown transcriptional repressor which restricts the expression of sodium channel genes to neurons [50,51]. Yet another negative regulator of BoE, c-Fos, is a proto-oncogene whose expression tends to be inducible and temporarily restricted [52,53]. Finally, MEF2A, or a myocyte enhancer factor 2A, activates the expression of muscle-specific genes and plays a role in muscle development and neuronal differentiation [54].
We note that some concern was expressed in the literature that modENCODE [55] ChIPseq peaks may coincide with spurious Phantom peaks [54]. Such phantom peaks might be due to non-specific interactions of ChIP-seq antibodies with the regions of active transcription (where the chromatin has open structure and the presence of activated but unstructured proteins might randomly attract other proteins). Here, we do not comment on the modEN-CODE data. However, we did previously discuss and reject a conceptually related model of spurious TF binding termed a "sticky" model [1]. The findings presented here add further weight to the evidence against the "sticky" model, because the binding patterns of TFs are clearly non-random and their impact on BoE is easily interpretable.
The total number of TFs in the human genome was proposed to be as high as 1,391 [56]. This would suggest that the ChIP-seq screen of ENCODE investigated perhaps as little as 10% of the full repertoire of human TFs. Nevertheless, it should be correct to extrapolate the trends detected in this subset. Could the intercepts of the linear models be used to infer any information about the "missing" TFs? The intercepts of different GLMs were strongly converging (with the mean of -2, or -2.4 for a model with interactions). This corresponds to the BoE of 12-9% (the inverse logit between 0.12 and 0.09). In other words, transcripts with empty promoter architectures were on average expressed in approximately one tenth of the samples. This basal expression level is probably the footprint of the TFs missing from ENCODE.
The regularization and shrinkage performed by lasso, ridge and elastic net ought to eliminate TFs with little impact from the models of BoE. Did they lead to the elimination of many TFs from the models? The answer is mostly negative. See Table 5 for the summary of different models. The table contains the numbers of variables included in the models. At most 22% of TFs were irrelevant in the elastic net model. This is strong evidence for the non-redundant role of individual TFs. In other words, the regulation of BoE could not have been reduced to the actions of just a few TFs. Moreover regularization did not improve model fit (Table 6).
In fact, neither the interactions, nor regularization, nor a fully non-linear model built with random forest improved predictive performance over the core R GLM as measured by crossvalidation (Table 6). In the case of regularization, the failure to improve predictions was not entirely unexpected as the goal of regularization was variable selection and coefficient shrinkage. However, that a fully non-linear modelling technique, the random forest, did not improve predictive performance was significant, suggesting that most signal encoded in the data was already recovered by the linear models.
Were the models produced by non-regularized and regularized GLMs similar? Their coefficients were highly correlated (Fig 7). As the ratio of the number of transcripts to the number of explanatory variables was high (116) over-fitting was probably less of a concern resulting in a limited correction by the shrinkage of coefficients. (Typically over-fitting manifests itself in very large values of regression coefficients). However, some shrinkage was introduced by the For the non-linear-response models, this fraction varied widely, with 54-59% for non-regularized models, and falling as low as 31% and 25% for the elastic net and lasso. The last model was an outlier with most coefficients shrunk to zero by regularization. For most models, the fraction of TFs with positive coefficients was higher than those with negative (the average difference was 20%, paired t-test Pvalue = 0.0048). The last observation contributed to the explanation of the general correlation between the size of the architecture of the promoter and BoE. The net effect of the increasing size of the architecture of the promoter was in increasing BoE. By adding Bayesian GLMs as an alternative linear and an alternative non-linear-response model, we further ensured that the findings presented were robust in respect to the statistical methodology employed. The classical linear models were more similar to each other than to the Bayesian model, however, the congruence with the Bayesian model was still high (Fig 7). For example, the mean correlation between classical linear models was rho = 0.97, SD = 0.02, N = 6, while the mean correlation with the point estimates from the linear Bayesian model was rho = 0.93, SD = 0.02, N = 4 (Wilcoxon test between the two groups of comparisons Pvalue = 0.0095).
In summary, we constructed several explanatory statistical models of the breadth of expression in human. The models were largely in agreement in identifying a group of pro-housekeeping transcription factors (the majority of ENCODE TFs), and a small group of repressors which disagreed with housekeeping expression. Moreover, the theoretical knowledge of transcriptional regulation suggested these associations were likely to be causative. The results presented here should impact on our understanding of the regulation of gene expression.

Statistical tests
Contingency tables: the P-value for rejecting the null hypothesis of independence of rows and columns in contingency tables was calculated by Monte Carlo simulation (based on 2000 replicates). The P-value was calculated by core R function fisher.test(), which implements a C version of the FORTRAN subroutine FEXACT [57,58].  The comparison of means of two samples: in most cases, the variables compared were not normally distributed and the Wilcoxon test was used. The author refers to the two-sample Wilcoxon rank sum test with continuity correction, as implemented in the function wilcox.test() of core R [59]. This function computes exact P-value through permutations (but only if there are no ties). In other cases a normal approximation is used.

ENCODE data, F5 data, and data integration
To define promoter architectures, we used merged ChIP-seq peaks [60,61] from the ENCODE consortium [62]. For gene expression, we used F5's [7] work package 4 expression tables [63]. ENCODE and F5 datasets were pre-processed and integrated as described in detail previously [1].

BoE
In the simplest approach, BoE was modelled as a random variable which varied from zero through to one. This simple representation was used in tree models. However, for GLMs BoE was re-interpreted as the number of successes (expression switched "on") in the total number of attempts. Each attempt was one tissue or cell-line. Thus, for the GLMs, the response variable was re-coded as a two-row matrix. One row was the number of successes (samples with expression "on"). The second row was the total number of attempts. The total number of samples equalled 1660, i.e. the number of libraries in the F5 dataset.

Correlogram
We computed the correlogram by performing a hierarchical clustering of the Spearman's correlation coefficients with a complete linkage method aiming at finding similar clusters. We then plotted the coefficients in the lower triangle of the correlation matrix. Colour-coding was used to highlight the significant values of Spearman's rhos. P-values were obtained from the evaluation of correlations using their random expectation.

The tree models
The tree model algorithm is also known as binary recursive partitioning. The tree algorithm splits the dataset in consecutive steps at each step choosing the variable which maximally contributed to explaining the deviance of the response variable. The splitting continues until the data are too sparse to proceed further (which means fewer than six observations left to partition). The algorithm was implemented in the R function tree from the package under the same name [21,64]. The models are shown in Fig 3 and S3 Fig. Linear-response GLM: The core R GLMs are a family of models which generalize linear regression to response variables for which errors are not normally distributed. To specify a model one specifies the family of distributions from which errors derive and a link function. The link function can be non-linear and describes the relation between the mean of and the linear combination of the predictive variables in the model [21]. BoE was conveniently modeled as a proportion [21]: the number of samples a gene was expressed in over the number of samples. This suggested a binomial structure of errors [21]. Accordingly, we chose GLMs of the quasi-binomial family (with the logit canonical link function). The logistic link function has a "squashing" effect, transforming real numbers into the interval [0, 1]. Quasi-likelihood was used to compensate for over-dispersion of the linear model (the residual deviance of 1,980,000 at 30,792 degrees of freedom). The GLMs produced vectors of correlation coefficients θ 1-N (for N TFs which were the inputs for the model) plus the estimate for the intercept (θ 0 ). Model parameters were by default returned on the scale defined by the reverse of the link function. These values could be transformed into the scale of model inputs by a reverse logit transformation. Standardized coefficients were the result of variant analyses where variances of independent and dependent variables were standardized to one. As all independent variables were on the same scale (TF counts), standardization was not essential. Unless otherwise stated, interaction terms were not included in models.

Linear-response GLMs with regularization (lasso, ridge, and the elastic net)
Lasso [30] and ridge [65] are likelihood penalized GLMs implemented in the R package glmnet [66]. The key feature of glmnet is regularization which facilitates variable selection and shrinkage among the groups of correlated explanatory variables. This makes glmnet the method of choice for genomics applications (where the number of features tends to be high and the number of observations may be limited). The authors of the package use the metaphor of the fishing net catching "big fish". The net ensures that only significant variables are selected to have non-zero coefficients in the final model [66]. Depending on the value of the parameter alpha, glmnet will favor either the lasso (alpha = 1) or the ridge (alpha = 0). The elastic net (alpha = 0.5) is a compromise between lasso and ridge. The ridge penalty tends to shrink the coefficients of correlated variables so that they become more similar. In contrast, the lasso tends to promote one of the variables in a correlated group and discard the others. Both lasso and ridge were invoked through the function glmnet(). Lambda is the parameter of regularization controlling the penalty imposed on maximum likelihood. The glmnet algorithm constructs a sequence of lambda values and plots the correlation coefficients of the model along this sequence (S4 Fig). It is recommended that correlation coefficients at lambda.min are picked for the final model.

Glmnet cross-validation identifies lambda values at which the prediction error is at the minimum
Glmnet automatically builds a predictive model and performs leave-one-out cross-validation. Although, a prediction was not our goal, cross-validation is a crucial means of the verification of the fit of explanatory models [12]. The cross-validation reports lambda.min-the value of lambda which minimizes the cross-validated mean squared error (S5 Fig). For example, in our linear GLMs, the values of lambda.min were 0.0005 for the lasso, 0.00046 for the ridge, and 0.0009 for the elastic net. These values of lambda were used to report the glmnet correlation coefficients of individual TFs.

Bayesian GLMs with R and Stan
Stan is a powerful and flexible statistical system and programming language for Bayesian statistical modelling [67,68]. Stan performs fast Markov-chain Monte Carlo (MCMC) calculations of posterior distributions with a No-U-Turn sampler. Rstan is a general R interface to Stan; Rstanarm is a more specialized R interface for GLMs. For our Bayesian model, we used weak non-informative Gaussian priors Ɲ(0,10) for both the regression coefficients (θ  ) and the intercept (θ). The model was defined with an identical formula to the one used for glm() and glmnet(). The MCMC sampler was invoked using Rstanarm's function stan_glm(). Four MCMC chains were run for 1,000 iterations (split evenly between warm-up and sampling).
The convergence between the chains was estimated usingR.R is the so-called potential scale reduction factor for MCMC chains. It is interpreted as a potential further improvement in convergence, expected if the chains were allowed to run indefinitely. At convergence,R approaches one.

K-means clustering
The k-means implementation of the core R was used, namely the function kmeans(). The option of the Forgy algorithm was specified. The clusters were defined in two dimensions. These clusters were used to construct a non-linear-response model with multinomial regression that was fitted with the glmnet package using the lasso penalty (alpha = 1).

The details of the non-linear-response modelling
The core R and the Rstanarm implementations of GLMs do not implement multinomial regression. Therefore, the response variable of multinomial regression had to be recoded as a binary variable (cluster II versus the other two clusters). S2 Dataset. TF-transcripts matrix. Note this dataset contains redundant RefSeqs. Please, contact the corresponding author for details on post-processing and R scripts. (TXT) S1