Assessment of transcriptional importance of cell line-specific features based on GTRD and FANTOM5 data

Creating a complete picture of the regulation of transcription seems to be an urgent task of modern biology. Regulation of transcription is a complex process carried out by transcription factors (TFs) and auxiliary proteins. Over the past decade, ChIP-Seq has become the most common experimental technology studying genome-wide interactions between TFs and DNA. We assessed the transcriptional significance of cell line-specific features using regression analysis of ChIP-Seq datasets from the GTRD database and transcriptional start site (TSS) activities from the FANTOM5 expression atlas. For this purpose, we initially generated a large number of features that were defined as the presence or absence of TFs in different promoter regions around TSSs. Using feature selection and regression analysis, we identified sets of the most important TFs that affect expression activity of TSSs in human cell lines such as HepG2, K562 and HEK293. We demonstrated that some TFs can be classified as repressors and activators depending on their location relative to TSS.


Introduction
The identification of complex mechanisms of regulation of gene expression in higher eukaryotes is a major challenge for modern computational biology. The key question is to better understand the role of transcription factors (TFs), which regulate the transcriptional machinery in cells. Over the past decade, ChIP-Seq has become the most popular experimental technology for studying the genome-wide interactions between TFs and DNA. To date, several databases, such as GTRD (http://gtrd.biouml.org/) [1,2], ENCODE (https://www. encodeproject.org/) [3], ChIP-Atlas (https://chip-atlas.org/) [4], and ReMap (http://tagc.univmrs.fr/remap/) [5] have been created to systematically process and collect ChIP-Seq datasets obtained by applying different peak callers to the primary ChIP-Seq data.
To study the effect of TF binding on gene expression, it is common practice to analyze the integrated ChIP-Seq and RNA-Seq data [6,7], since RNA sequencing is a source of transcription level data. Another source of experimental data on the level of transcription is the CAGE (Cap Analysis of Gene Expression) technology. Thus, FANTOM5 (fifth edition of the FANTOM database) a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 contains profiled TSSs in the human genome using CAGE with single-molecule sequencers (Heli-Scope) and a generated atlas of CAGE expression levels (TSS activities) in primary cells, tissues and cell lines [8]. Initially, the GRCh37/hg19 assembly was used as the reference human genome. This atlas was later redesigned to fit newer genome assembly-GRCh38/hg38 [9].
The aim of our study was to assess the direct influence of TF binding on activities of TSSs in most of the studied human cell lines. For this purpose, we initially generated a large number of features that were defined as the presence or absence of TFs in the different promoter regions around each available TSS. To generate features, we used the ChIP-Seq datasets of human TF binding regions (TFBRs) collected in the GTRD database and TSS activities from the FANTOM5 atlas. For further selection of the most important features, we used the stepwise forward regression where the selection of features was carried out by an automatic stepwise procedure. As a result, the constructed regression models made it possible to compose narrow lists of TFs, which had significant influence on TSS activities in the considered cell lines. In other words, the composed lists consisted of features that directly related with TSS activity.
Finally, it is important to note that efforts to create atlases of candidate cis-regulatory elements (promoters, enhancers, silencers, insulators) of human and mammalian genomes has been increased over the past decade [10][11][12][13][14][15][16][17][18][19]. A breakthrough in high-throughput sequencing technologies [20], which made it possible to analyze the genomic landscape and gene expression from different points of view, as well as large amounts of data obtained for various types of cells and activation stimuli, made it possible to approach the creation of such atlases for the most studied taxa, human and mouse. Nevertheless, due to the extreme complexity (a wide variety of types of primary cells and cell lines; cell-specific functions of enhancers [21]; features of gene expression in various cells; differences in the implementation of the cell program depending on an external or internal stimuli, etc.), the solution of this problem is far from complete. Most of the research has focused on gene expression activators such as enhancers, while the regions that suppress gene expression-silencers-are poorly understood [22].

Materials and methods
In general, the key datasets for our study were the overlapped sets of TFBRs that were compiled through a three-step meta-processing of the ChIP-Seq datasets collected in the GTRD database. Thus, for a given cell line and a given TF, we initially selected only those ChIP-Seq experiments in which the cell line was not treated. In the first step of meta-processing, the following peak callers were applied to the same raw data obtained from individual ChIP-Seq experiment: GEM [23], MACS2 [24], PICS [25], and SISSRs [26], see Fig 1. In the second step, four resulting sets of TFBRs were overlapped and the False Positive Control Metric (FPCM) [27] was applied to perform quality control for the overlapped dataset.
If FPCM exceeded the pre-specified threshold value of 3.0, then all so-called orphans (such TFBRs that did not overlap with other initial TFBRs) were removed from the overlapped dataset. Thus, the single refined dataset was identified for data from the given ChIP-Seq experiment. Finally, in the third step, the single final dataset was obtained for the given TF as a union of all the refined datasets corresponding to distinct experiments.
To determine the primary set of the regression features (say, PRIMARY_FEATURES), we initially defined the following eight promoter regions (in base pairs) around each available TSS: The genomic coordinates of 209,911 TSSs and their activities were extracted from the FAN-TOM5 atlas [9]. The first eight real-valued features were defined as relative numbers of TFs that bonded (at least, partially) these promoter regions. In detail, if m different TFs were available for a given cell line and TFBRs of m 0 TFs overlapped with the [x 1 , x 2 ]-promoter region, then the feature Abundance[x1, x2] was determined as the ratio m 0 / m. In other words, the feature Abundance[x 1 , x 2 ] is an estimate of the concentration of TFBRs within a given [x 1 , x 2 ]-promoter region. According to its definition, each feature Abundance[x 1 , x 2 ] varies in the range [0, 1]. In general, these features indicate the abundances of promoter regions with TFBRs. It is important to note that these Abundance-features can be interpreted as indicators of cis-regulatory modules. Indeed, according to their definitions, cis-regulatory modules represent the stretches of DNA, where a number of TFs can bind and regulate the expression of nearby genes and regulate the rate of their transcription [28]. The next features were binary. Each binary feature took values {1, 0} depending on the presence or absence of TFBRs of individual TFs in a given promoter region. Thus, PRIMARY_FEATURES consisted of 8×(m+1) features. One can expect that considerable number of the primary features in PRIMARY_FEATURES may be irrelevant in particular regression models. In general, if there are hundreds or even thousands of features, then it is advisable to perform feature selection to create a regression model that includes only the most important features. For this purpose, we used well-known stepwise forward regression approach. According to this approach, we selected at each step the single feature from PRIMARY_FEATURES the inclusion of which into ordinary least squares regression gave the highest correlation (say, R o-p ) between the predicted and observed transcriptional activity.
Finally, it is important to note that we have used all of the available 209,911 TSSs, although one might expect some of them to be falsely generated due to CAGE technology. For such TSSs, regression models must correctly predict negligible (or almost negligible) expression https://doi.org/10.1371/journal.pone.0243332.g001 levels due to the binarity of features. Indeed, falsely generated TSSs are not transcriptionally active, therefore, promoter regions around such TSSs should not contain TFBRs. According to the definition of our features, at least the majority of features for such TSSs have to take zero values. In turn, for our regression analyses, we used only linear regression models. Therefore, levels of expression are predicted as the inner products of regression coefficients and zero-valued features. As a result, these products also have zero values (or near zero-values).

Primary regression models
Basically, we focused on the following three human cell lines: HepG2 (hepatoblastoma), K562 (myelogenous leukemia), and HEK293 (embryonic kidney). We selected these cell lines because they were the most representative cell lines in GTRD. Thus, HepG2 was represented by 230 initial ChIP-Seq datasets obtained for 169 TFs (see Table 1); HEK293 was represented by 210 datasets for 177 TFs, and 304 datasets for 186 TFs were available for K562.
PRIMARY_FEATURES sets were generated as described in the Materials and Methods for each cell line independently. Thus, PRIMARY_FEATURES for HepG2 consisted of 1360 (= 8 × 170) features that represented the presence/absence of TFBS in the promoter regions defined in (1).
The stepwise forward regression was applied to the composed PRIMARY_FEATURES to select the most important features and obtain a primary regression model. This regression described the relationship between TSS activities and the 20 most important features. The logtransformed expression levels (say, LTE-levels) from the FANTOM5 atlas were used as TSS activities hereinafter. For a given expression level EL, the LTE-level was defined as {0, if EL < 2; lg(EL) otherwise}. Table 2 contains the primary regression model obtained for the HepG2 cell line. S1 and S2 Tables contain the primary regression models obtained for K562 and HEK293, respectively. All the most important features selected by stepwise forward regression were sorted in the order of their selection. The accuracy of each intermediate regression model was measured by the Pearson correlation coefficient R o-p between the predicted and observed transcriptional activities. The values of R o-p demonstrated that it was sufficient to implement only 20 steps of stepwise forward regression, since increments of R o-p in the last steps became almost negligible, see Table 2. All selected features turned out to be statistically significant, p-value < 10 −67 .
In general, the accuracy of the primary regression models turned out to be quite acceptable, since R o-p varied in the range [0.626, 0.726], see Table 3. To assess the reliability of regression models we cross-validated them. For this purpose, we have split at random the entire set of features into training and test sets of the equal size. After that, a regression model was built on the training set, and LTE-levels were predicted independently in both sets using the constructed regression model. Thus, Table 3 contains also the accuracies of primary regression models obtained on training and test sets. It turned out that the constructed regression models are quite reliable because the differences between R o-p are negligible. In other words, the regression models were not overfitted. The regression coefficients were obtained using the ordinary least square regression, which was built in step 20. The sign of the regression coefficient may clarify the function of some TFs. If the coefficient is positive, then TF can be classified as a transcription activator or coactivator. If the coefficient is negative, then TF can be classified as a repressor or corepressor. According to Table 2, some TFs can act as activator and repressor depending on location of the binding site with respect to TSSs. For example, HEY1 acted as an activator in the three promoter regions [1,100], [501, 1000], [-500, -201], while it acted as a repressor in the [-100, 0]-promoter region. It is important to note that our regression model re-revealed this well-known role for HEY1 [29]. It is important to note that HEY1 prefer to act as activator for some genes and as repressor for other genes. In particular, HEY1 preferred to avoid binding to both [-100, 0] and [1, 100]-promoter regions simultaneously. In order to confirm this avoidance, we calculated the ratio of the observed and estimated probabilities of simultaneous binding to these promoter regions. It turned out that the ratio was equal to 0.543, hence the observed simultaneous binding is essentially rare than can be expected. Table 2 also demonstrates the same effect for KLF10. It can be classified as an activator if it is located in the [-100, 0]-promoter region, while it acts as a repressor in the [501, 1000]-promoter region. Our regression model once again confirmed the well-known fact that KLF10 is a repressor of multiple genes in many cell types [30]. Finally, SMAD5, JARID1B and MLL can be classified as activators and repressors in K562 or HEK293 (see S1 and S2 Tables) depending on their location. It is important to note that the influence of TF binding on TSS activity in the K562 cell line was also studied [31] using data from the ENCODE consortium [3]. TSS activities were represented by CAGE expression levels, and approximately 120 ChIP-Seq datasets were used in this study. A single feature for given TSS and TF was defined as the average number of ChIP-Seq reads within the [-50, 50] region of the promoter. As a result, a list (say, List-40) of 40 the most important TFs (features) was identified by random forest regression model. It is difficult to compare the results directly because of the differences of feature and regression types. To overcome this difficulty, we extracted 320 (8 × 40) binary features from PRIMARY_FEATURES, which represented TFs in List-40 and applied stepwise forward regression to them. The resulting primary regression model (say, K562_List-40) is available as S3 Table. Comparison of K562_List-40 and our primary model in S1

Comparative analysis of cell lines
To increase the accuracy of regression models it is necessary to generate additional features and involve them in regression models. For this purpose, we performed a comparative analysis of cell lines using their transcription activity profiles. We determined the transcription activity profile for the given cell line as a set of 209,911 TSS activities from the FANTOM5 expression atlas.
In general, this atlas contains expression levels for the following three types of objects: cell line, primary cell, and tissue. We analyzed the similarity of objects of the same type using correlations between their transcription activity profiles. In addition, we considered a randomly selected sample to control the similarities between objects of various types. It turned out that there was a high correlation between the considered objects, see Fig 2. Moreover, the highest correlations were observed between different cell lines (see Table 4). It is important to note that RNA-Seq data also confirmed similarity between cell lines. To demonstrate this, we calculated correlations between 25 distinct cell lines. For this purpose, we used the RNA-Seq datasets generated by the ENCODE 3 consortium. It turned out that correlation coefficient varied in the range [0.423, 0.845], and mean correlation was equal to 0.681, when protein-coding transcripts from Ensembl were used for calculation of correlation.
Obviously, from a biological point of view, it is not surprising that there are relationships between primary cells, or/and tissues, or/and cell lines, because tissues are composed of different types of primary cells, and cell lines are immortalized or cancer-transformed cells that resemble their tissue of origin [32]. In other words, one can expect that many pairs of primary cells, tissues, and cell lines can be similar in terms of their transcriptional activity. However, Table 4 and Fig 2 allowed not only confirming this fact, but also estimating the strength of these relationships from a statistical point of view.
Due to the revealed similarities, it is not difficult to accurately predict the transcriptional activity profile of one cell line using the profile of another cell line. In particular, the following two regression models expressed the relationship between transcriptional activity profile for HepG2 and the profile for HEK293 or the profile for K562:

Advanced regression models
Based on comparative analysis performed in the previous section, we can confidently conclude that cell lines are similar in terms of their transcription activity profiles. In other words, the activities of many TSSs are almost identical in many cell lines. To incorporate this cell line commonality into regression models, we generated a new feature called 'mean profile'. It was defined as a set of 209,911 mean values of activities, where an individual mean activity for each TSS was determined by averaging all of its activities in cell lines available in the FANTOM5 atlas.
The accuracy of regression model was significantly improved when stepwise forward regression was applied to the combination of PRIMARY_FEATURES and the mean profile. Thus, a comparison of R o-p values in the first row of Table 5 with R o-p values achieved using primary regression models (see Table 3) indicated that the accuracy increased 1.23-1.48 times. However, such a regression model has a serious disadvantage, since it is completely useless for predicting the activities of novel TSSs, which are absent in the FANTOM5 atlas. To avoid this disadvantage, we generated a new feature called 'predicted mean profile'. This feature was determined using the following two-step procedure. In the first step, the stepwise forward regression was applied three times to PRIMARY_FEATURES determined for HepG2, K562 and HEK293 independently. As a result of the first step, three predicted profiles were obtained. In the second step, the 'predicted mean profile' was generated by averaging three predicted profiles. Thus, 'predicted mean profile' was determined by applying stepwise forward regression technique to all the PRIMARY_FEATURES defined for HepG2, K562 and HEK293. Finally, the stepwise forward regression was applied to the combination of PRIMARY_FEA-TURES and the predicted mean profile to select the most important features and get an advanced regression model. Table 6 contains the advanced regression model obtained for the HepG2 cell line. S4 and S5 Tables contain the advanced regression models obtained for cell lines K562 and HEK293, respectively. The accuracy of advanced regression models is demonstrated in the second row of Table 5.
It is important to note that the predicted mean profile was the most important feature in all three advanced regression models. Therefore, one can conclude that common (i.e. not specific to the cell line) transcription processes are dominant in different cell lines. However, cell line specificity can also be detected using advanced regression models. Thus, it is well known that HEY1 is involved in the regulation of self-renewal of liver cancer cells [33]. Therefore, it was not surprising that HEY1 was the most represented TF in the most important features for HepG2. According to Table 6, HEY1 was observed in the six most important features. In other  Tables 2 and 6), one can conclude that the features of the advanced regression model were more specific for cell lines than those of the primary regression model. Indeed, HEY1 was observed only in the four most important features of the primary model. Additionally, it is well-known that HNF3G (hepatocyte nuclear factor 3-gamma) plays an important role in the development, differentiation and regeneration of the liver [34]. Therefore, it was not surprising that HNF3G was selected using the advanced regression (see Table 6) but it was not selected using primary regression.
It is interesting to note that the most important features correlated with some additional features from PRIMARY_FEATURES. Thus, Table 7 Table 7, we can expect the existence of a putative cis-regulatory module within [501, 1000] promoter regions of some genes, and this module contains HEY1, ZNF205, and IRF2. According to the information for TAF1 [1,100], another putative cis-regulatory module within [1,100] promoter regions contains TAF1, NONO, CEBPD, and HEY1. On the one hand, HEY1, TAF1 and NONO are involved in the most important features. On the other hand, ZNF205, IRF2 and CEBPD are not directly involved and, possibly, can be classified as less important. Nevertheless, their impact on TSS activity was also taken into account because they participated in the selected 'Abundance[-100, 0]' feature. Thus, from the point of view of regression models, the most important features were related to TSS activity directly and individually while less important features were related with TSS activity mutually.
According to the signs of the advanced regression coefficients, the most important features identified by advanced regressions can also be classified as activators or repressors. In particular, based on Table 6, one can conclude that five features (namely, HEY1 [-100, 0], AhR [-100, 0], NF-YC [501, 1000], TAF1 [-1000, -501] and HNF3G [101, 500]) can be classified as repressors, while the remaining features-as activators. However, one can expect that this classification can be distorted by the presence of Abundance-features among the most important features, since Abundance-features include information about many TFs. To understand how reliably stepwise regression models can actually classify features into activators or repressors, we conducted the following test. We removed all Abundance-features from the most important features and built the ordinary least squares regression models using the remaining important features. As a result, we observed a slight decrease in the accuracy of the regression but the regression coefficients and their significance were changed imperceptibly. Thus, in the case of the HepG2 cell line, the removal of the feature Abundance [-100, 0] resulted in a slight decrease in the R o-p correlation coefficient from 0.743 to 0.739. However, the same five features can be classified as repressors due to the negative signs of their regression coefficients: It is interesting to note that we also considered an additional way of repressor/activator classification. In this case, each TF was analyzed independently. For each TF, we constructed the ordinary least squares regression model for which only eight binary features were used. Based on the signs of the regression coefficients and the p-values, we considered the following three categories: TF was classified as a significant repressor in a given promoter region if the sign of the regression coefficient of the corresponding feature was negative and p-value < 10 −5 . If the sign was positive and the p-value < 10 −5 , then TF was classified as a significant activator. If the p-value > 10 −5 , then TF was considered insignificant.  On the one hand, for construction of regression models it is sufficient to use the 'predicted mean profile' and approximately 20 the most important features due to small increments of R o-p in last steps of feature selection. According to their p-values, selected features are extremely significant. On the other hand, it seems likely that these sets of features can be extended by additionally composed the lists of attendant features that also play a role in cellspecific regulation. For this purpose, we continued to select features with the help of stepwise forward regression. In this case, we stopped selection when the p-value of least significant regression coefficient exceeded the threshold 10-20. The lists of attendant features for HepG2, K562 and HEK293 cell lines are available as S6 to S8 Tables, respectively. The attendant features can be classified as less important, but still highly significant for cell-specific regulation.
Additionally, we briefly considered the possibility of using advanced regression models obtained for one cell line (for example, HepG2) to predict TSS activities in another highly correlated cells (for example, primary hepatocytes). Unfortunately, this approach is not applicable (at least, intensively) to our features due to frequent incompleteness of the ChIP-Seq data. This incompleteness is due to the fact that, for example, for the cell line, the ChIP-Seq experiments were carried out using one set of TFs, while for the primary cells, the experiments were carried out with a different set of TFs. In particular, according to Table 6, to predict TSS activities in hepatocytes using the advanced regression model, it is necessary to have TFBRs of selected eleven TFs, while currently (according to the GTRD database) only CTCF, EZH2 and NR1H4 have been studied in ChIP-Seq experiments. Therefore, the advanced regression model mentioned in Table 6 is currently still not useful for predicting TSS activities in hepatocytes.
Finally, to demonstrate the usefulness of the predicted mean profile, we performed a regression analysis of three rare cell lines DU145 (prostate carcinoma), THP-1 (acute monocytic leukemia) and U937 (adult acute monocytic leukemia). Only a few TFs were studied in ChIP-Seq experiments on these cell lines, see Table 8. It was therefore not surprising that the primary  regression models could only achieve low accuracy: 0.426 � R o-p � 0.538. However, the accuracy increased 1.30-1.57 times when we used the predicted mean profile to build advanced regression models, see Table 8. These models are available in S9 to S11 Tables. The significant increment of R o-p values indicated that a non-cell-specific feature (the predicted mean profile) could compensate, at least in part, for the absence of a large number of features in poorly studied cell lines.

Sum-transformation of expression levels for closely spaced TSSs
One of the specific properties of TSSs in the FANTOM5 atlas was that many of them are located close to each other. In particular, 116,620 TSSs (55.6%) had other TSSs nearby at a distance of less than 100 bp. For such closely spaced TSSs, we replaced their individual expression levels with sums of their expression levels and then calculate LTE-levels for sums. For the remaining TSSs, which were separated by at least 100 bp, we did not change their individual LTE-levels. As a result, for the given cell line, we created a new transcription profile, say, sumtransformed profile.
To predict sum-transformed profile, we also applied stepwise forward regression to PRIMARY_-FEATURES. In other words, we have constructed primary regression models for prediction of sum-transformed profiles. S12 to S14 Tables contain the resulting sum-transformed regression models. The first row of Table 9 contains the accuracy of sum-transformed regression models for the HepG2, K562 and HEK293 cell lines. Comparison of R o-p values in the first row of Table 9 with R o-p values achieved using primary regression models (see Table 3) indicated that the accuracy increased 1.073-1.15 times. Thus, the transition from transcription activity profiles to sum-transformed profiles has become the second way to increase the accuracy of regression models.
Finally, it is interesting to note that the list of features selected by the sum-transformed regression models and the list of features selected by the primary regression models were significantly overlapped. For example, Table 2 and S12 Table contained 12 (60%) identical features. The second row of Table 9 contains the percentage of identical features for the three analyzed cell lines.

Conclusions
1. Using the stepwise forward regression method, we identified the sets of the most important TFs that affect expression activity of TSSs in human cell lines such as HepG2, K562 and HEK293.
2. With the help of the constructed regression models, we demonstrated that some TFs can be classified simultaneously as repressors and activators depending on their location relative to TSS.

3.
A comparative analysis of cell lines revealed high similarity between them. We expressed the commonality of cell lines using the novel feature 'predicted mean profile'. We demonstrated that this feature is useful for improving the accuracy of regression models, as well as for analyzing rare cell lines.