Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Assessment of transcriptional importance of cell line-specific features based on GTRD and FANTOM5 data

  • Ruslan N. Sharipov ,

    Roles Conceptualization, Data curation, Investigation, Writing – original draft, Writing – review & editing

    shrus79@gmail.com

    Affiliations Laboratory of Bioinformatics, Federal Research Center for Information and Computational Technologies, Novosibirsk, Russian Federation, Specialized Educational Scientific Center, Novosibirsk State University, Novosibirsk, Russian Federation, BIOSOFT.RU, Ltd, Novosibirsk, Russian Federation

  • Yury V. Kondrakhin,

    Roles Conceptualization, Data curation, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliations Laboratory of Bioinformatics, Federal Research Center for Information and Computational Technologies, Novosibirsk, Russian Federation, BIOSOFT.RU, Ltd, Novosibirsk, Russian Federation

  • Anna S. Ryabova,

    Roles Software, Writing – original draft

    Affiliations Laboratory of Bioinformatics, Federal Research Center for Information and Computational Technologies, Novosibirsk, Russian Federation, BIOSOFT.RU, Ltd, Novosibirsk, Russian Federation

  • Ivan S. Yevshin,

    Roles Data curation, Investigation, Writing – original draft

    Affiliations Laboratory of Bioinformatics, Federal Research Center for Information and Computational Technologies, Novosibirsk, Russian Federation, BIOSOFT.RU, Ltd, Novosibirsk, Russian Federation

  • Fedor A. Kolpakov

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Software, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Laboratory of Bioinformatics, Federal Research Center for Information and Computational Technologies, Novosibirsk, Russian Federation, BIOSOFT.RU, Ltd, Novosibirsk, Russian Federation

Assessment of transcriptional importance of cell line-specific features based on GTRD and FANTOM5 data

  • Ruslan N. Sharipov, 
  • Yury V. Kondrakhin, 
  • Anna S. Ryabova, 
  • Ivan S. Yevshin, 
  • Fedor A. Kolpakov
PLOS
x

Abstract

Creating a complete picture of the regulation of transcription seems to be an urgent task of modern biology. Regulation of transcription is a complex process carried out by transcription factors (TFs) and auxiliary proteins. Over the past decade, ChIP-Seq has become the most common experimental technology studying genome-wide interactions between TFs and DNA. We assessed the transcriptional significance of cell line-specific features using regression analysis of ChIP-Seq datasets from the GTRD database and transcriptional start site (TSS) activities from the FANTOM5 expression atlas. For this purpose, we initially generated a large number of features that were defined as the presence or absence of TFs in different promoter regions around TSSs. Using feature selection and regression analysis, we identified sets of the most important TFs that affect expression activity of TSSs in human cell lines such as HepG2, K562 and HEK293. We demonstrated that some TFs can be classified as repressors and activators depending on their location relative to TSS.

Introduction

The identification of complex mechanisms of regulation of gene expression in higher eukaryotes is a major challenge for modern computational biology. The key question is to better understand the role of transcription factors (TFs), which regulate the transcriptional machinery in cells. Over the past decade, ChIP-Seq has become the most popular experimental technology for studying the genome-wide interactions between TFs and DNA. To date, several databases, such as GTRD (http://gtrd.biouml.org/) [1, 2], ENCODE (https://www.encodeproject.org/) [3], ChIP-Atlas (https://chip-atlas.org/) [4], and ReMap (http://tagc.univ-mrs.fr/remap/) [5] have been created to systematically process and collect ChIP-Seq datasets obtained by applying different peak callers to the primary ChIP-Seq data.

To study the effect of TF binding on gene expression, it is common practice to analyze the integrated ChIP-Seq and RNA-Seq data [6, 7], since RNA sequencing is a source of transcription level data. Another source of experimental data on the level of transcription is the CAGE (Cap Analysis of Gene Expression) technology. Thus, FANTOM5 (fifth edition of the FANTOM database) contains profiled TSSs in the human genome using CAGE with single-molecule sequencers (HeliScope) and a generated atlas of CAGE expression levels (TSS activities) in primary cells, tissues and cell lines [8]. Initially, the GRCh37/hg19 assembly was used as the reference human genome. This atlas was later redesigned to fit newer genome assembly–GRCh38/hg38 [9].

The aim of our study was to assess the direct influence of TF binding on activities of TSSs in most of the studied human cell lines. For this purpose, we initially generated a large number of features that were defined as the presence or absence of TFs in the different promoter regions around each available TSS. To generate features, we used the ChIP-Seq datasets of human TF binding regions (TFBRs) collected in the GTRD database and TSS activities from the FANTOM5 atlas. For further selection of the most important features, we used the stepwise forward regression where the selection of features was carried out by an automatic stepwise procedure. As a result, the constructed regression models made it possible to compose narrow lists of TFs, which had significant influence on TSS activities in the considered cell lines. In other words, the composed lists consisted of features that directly related with TSS activity.

Finally, it is important to note that efforts to create atlases of candidate cis-regulatory elements (promoters, enhancers, silencers, insulators) of human and mammalian genomes has been increased over the past decade [1019]. A breakthrough in high-throughput sequencing technologies [20], which made it possible to analyze the genomic landscape and gene expression from different points of view, as well as large amounts of data obtained for various types of cells and activation stimuli, made it possible to approach the creation of such atlases for the most studied taxa, human and mouse. Nevertheless, due to the extreme complexity (a wide variety of types of primary cells and cell lines; cell-specific functions of enhancers [21]; features of gene expression in various cells; differences in the implementation of the cell program depending on an external or internal stimuli, etc.), the solution of this problem is far from complete. Most of the research has focused on gene expression activators such as enhancers, while the regions that suppress gene expression–silencers–are poorly understood [22].

Materials and methods

In general, the key datasets for our study were the overlapped sets of TFBRs that were compiled through a three-step meta-processing of the ChIP-Seq datasets collected in the GTRD database. Thus, for a given cell line and a given TF, we initially selected only those ChIP-Seq experiments in which the cell line was not treated. In the first step of meta-processing, the following peak callers were applied to the same raw data obtained from individual ChIP-Seq experiment: GEM [23], MACS2 [24], PICS [25], and SISSRs [26], see Fig 1. In the second step, four resulting sets of TFBRs were overlapped and the False Positive Control Metric (FPCM) [27] was applied to perform quality control for the overlapped dataset.

thumbnail
Fig 1. The workflow of meta-processing the ChIP-Seq datasets.

https://doi.org/10.1371/journal.pone.0243332.g001

If FPCM exceeded the pre-specified threshold value of 3.0, then all so-called orphans (such TFBRs that did not overlap with other initial TFBRs) were removed from the overlapped dataset. Thus, the single refined dataset was identified for data from the given ChIP-Seq experiment. Finally, in the third step, the single final dataset was obtained for the given TF as a union of all the refined datasets corresponding to distinct experiments.

To determine the primary set of the regression features (say, PRIMARY_FEATURES), we initially defined the following eight promoter regions (in base pairs) around each available TSS: (1)

The genomic coordinates of 209,911 TSSs and their activities were extracted from the FANTOM5 atlas [9]. The first eight real-valued features were defined as relative numbers of TFs that bonded (at least, partially) these promoter regions. In detail, if m different TFs were available for a given cell line and TFBRs of m0 TFs overlapped with the [x1, x2]-promoter region, then the feature Abundance[x1, x2] was determined as the ratio m0 / m. In other words, the feature Abundance[x1, x2] is an estimate of the concentration of TFBRs within a given [x1, x2]-promoter region. According to its definition, each feature Abundance[x1, x2] varies in the range [0, 1]. In general, these features indicate the abundances of promoter regions with TFBRs. It is important to note that these Abundance-features can be interpreted as indicators of cis-regulatory modules. Indeed, according to their definitions, cis-regulatory modules represent the stretches of DNA, where a number of TFs can bind and regulate the expression of nearby genes and regulate the rate of their transcription [28]. The next features were binary. Each binary feature took values {1, 0} depending on the presence or absence of TFBRs of individual TFs in a given promoter region. Thus, PRIMARY_FEATURES consisted of 8×(m+1) features. One can expect that considerable number of the primary features in PRIMARY_FEATURES may be irrelevant in particular regression models. In general, if there are hundreds or even thousands of features, then it is advisable to perform feature selection to create a regression model that includes only the most important features. For this purpose, we used well-known stepwise forward regression approach. According to this approach, we selected at each step the single feature from PRIMARY_FEATURES the inclusion of which into ordinary least squares regression gave the highest correlation (say, Ro-p) between the predicted and observed transcriptional activity.

Finally, it is important to note that we have used all of the available 209,911 TSSs, although one might expect some of them to be falsely generated due to CAGE technology. For such TSSs, regression models must correctly predict negligible (or almost negligible) expression levels due to the binarity of features. Indeed, falsely generated TSSs are not transcriptionally active, therefore, promoter regions around such TSSs should not contain TFBRs. According to the definition of our features, at least the majority of features for such TSSs have to take zero values. In turn, for our regression analyses, we used only linear regression models. Therefore, levels of expression are predicted as the inner products of regression coefficients and zero-valued features. As a result, these products also have zero values (or near zero-values).

Results and discussion

Primary regression models

Basically, we focused on the following three human cell lines: HepG2 (hepatoblastoma), K562 (myelogenous leukemia), and HEK293 (embryonic kidney). We selected these cell lines because they were the most representative cell lines in GTRD. Thus, HepG2 was represented by 230 initial ChIP-Seq datasets obtained for 169 TFs (see Table 1); HEK293 was represented by 210 datasets for 177 TFs, and 304 datasets for 186 TFs were available for K562.

PRIMARY_FEATURES sets were generated as described in the Materials and Methods for each cell line independently. Thus, PRIMARY_FEATURES for HepG2 consisted of 1360 (= 8 × 170) features that represented the presence/absence of TFBS in the promoter regions defined in (1).

The stepwise forward regression was applied to the composed PRIMARY_FEATURES to select the most important features and obtain a primary regression model. This regression described the relationship between TSS activities and the 20 most important features. The log-transformed expression levels (say, LTE-levels) from the FANTOM5 atlas were used as TSS activities hereinafter. For a given expression level EL, the LTE-level was defined as {0, if EL < 2; lg(EL) otherwise}.

Table 2 contains the primary regression model obtained for the HepG2 cell line. S1 and S2 Tables contain the primary regression models obtained for K562 and HEK293, respectively. All the most important features selected by stepwise forward regression were sorted in the order of their selection. The accuracy of each intermediate regression model was measured by the Pearson correlation coefficient Ro-p between the predicted and observed transcriptional activities. The values of Ro-p demonstrated that it was sufficient to implement only 20 steps of stepwise forward regression, since increments of Ro-p in the last steps became almost negligible, see Table 2. All selected features turned out to be statistically significant, p-value < 10−67.

thumbnail
Table 2. Primary regression model for the HepG2 cell line.

https://doi.org/10.1371/journal.pone.0243332.t002

In general, the accuracy of the primary regression models turned out to be quite acceptable, since Ro-p varied in the range [0.626, 0.726], see Table 3. To assess the reliability of regression models we cross-validated them. For this purpose, we have split at random the entire set of features into training and test sets of the equal size. After that, a regression model was built on the training set, and LTE-levels were predicted independently in both sets using the constructed regression model. Thus, Table 3 contains also the accuracies of primary regression models obtained on training and test sets. It turned out that the constructed regression models are quite reliable because the differences between Ro-p are negligible. In other words, the regression models were not overfitted. The regression coefficients were obtained using the ordinary least square regression, which was built in step 20.

thumbnail
Table 3. Accuracies of the primary regression models measured by the Ro-p correlation coefficient.

https://doi.org/10.1371/journal.pone.0243332.t003

The sign of the regression coefficient may clarify the function of some TFs. If the coefficient is positive, then TF can be classified as a transcription activator or coactivator. If the coefficient is negative, then TF can be classified as a repressor or corepressor. According to Table 2, some TFs can act as activator and repressor depending on location of the binding site with respect to TSSs. For example, HEY1 acted as an activator in the three promoter regions [1, 100], [501, 1000], [-500, -201], while it acted as a repressor in the [-100, 0]-promoter region. It is important to note that our regression model re-revealed this well-known role for HEY1 [29]. It is important to note that HEY1 prefer to act as activator for some genes and as repressor for other genes. In particular, HEY1 preferred to avoid binding to both [-100, 0] and [1, 100]-promoter regions simultaneously. In order to confirm this avoidance, we calculated the ratio of the observed and estimated probabilities of simultaneous binding to these promoter regions. It turned out that the ratio was equal to 0.543, hence the observed simultaneous binding is essentially rare than can be expected. Table 2 also demonstrates the same effect for KLF10. It can be classified as an activator if it is located in the [-100, 0]-promoter region, while it acts as a repressor in the [501, 1000]-promoter region. Our regression model once again confirmed the well-known fact that KLF10 is a repressor of multiple genes in many cell types [30]. Finally, SMAD5, JARID1B and MLL can be classified as activators and repressors in K562 or HEK293 (see S1 and S2 Tables) depending on their location.

It is important to note that the influence of TF binding on TSS activity in the K562 cell line was also studied [31] using data from the ENCODE consortium [3]. TSS activities were represented by CAGE expression levels, and approximately 120 ChIP-Seq datasets were used in this study. A single feature for given TSS and TF was defined as the average number of ChIP-Seq reads within the [-50, 50] region of the promoter. As a result, a list (say, List-40) of 40 the most important TFs (features) was identified by random forest regression model. It is difficult to compare the results directly because of the differences of feature and regression types. To overcome this difficulty, we extracted 320 (8 × 40) binary features from PRIMARY_FEATURES, which represented TFs in List-40 and applied stepwise forward regression to them. The resulting primary regression model (say, K562_List-40) is available as S3 Table. Comparison of K562_List-40 and our primary model in S1 Table indicated that the sets of selected features are quite different. Thus, only three (15%) features, namely, NF-YA [-100, 0], Sp1 [-200, -100] and SIX5 [-100, 0], were represented in both models. It seems likely that the features in S1 Table are more preferable and more reliable than the features selected by K562_List-40, since the accuracy of the primary regression in Table 2 (Ro-p = 0.704) is significantly higher than the accuracy of K562_List-40 (Ro-p = 0.617).

Comparative analysis of cell lines

To increase the accuracy of regression models it is necessary to generate additional features and involve them in regression models. For this purpose, we performed a comparative analysis of cell lines using their transcription activity profiles. We determined the transcription activity profile for the given cell line as a set of 209,911 TSS activities from the FANTOM5 expression atlas.

In general, this atlas contains expression levels for the following three types of objects: cell line, primary cell, and tissue. We analyzed the similarity of objects of the same type using correlations between their transcription activity profiles. In addition, we considered a randomly selected sample to control the similarities between objects of various types. It turned out that there was a high correlation between the considered objects, see Fig 2. Moreover, the highest correlations were observed between different cell lines (see Table 4). It is important to note that RNA-Seq data also confirmed similarity between cell lines. To demonstrate this, we calculated correlations between 25 distinct cell lines. For this purpose, we used the RNA-Seq datasets generated by the ENCODE3 consortium. It turned out that correlation coefficient varied in the range [0.423, 0.845], and mean correlation was equal to 0.681, when protein-coding transcripts from Ensembl were used for calculation of correlation.

thumbnail
Fig 2. Empirical densities of the Pearson correlation coefficient between objects in the FANTOM5 atlas.

https://doi.org/10.1371/journal.pone.0243332.g002

thumbnail
Table 4. Summary on correlations between objects in the FANTOM5 atlas.

https://doi.org/10.1371/journal.pone.0243332.t004

Obviously, from a biological point of view, it is not surprising that there are relationships between primary cells, or/and tissues, or/and cell lines, because tissues are composed of different types of primary cells, and cell lines are immortalized or cancer-transformed cells that resemble their tissue of origin [32]. In other words, one can expect that many pairs of primary cells, tissues, and cell lines can be similar in terms of their transcriptional activity. However, Table 4 and Fig 2 allowed not only confirming this fact, but also estimating the strength of these relationships from a statistical point of view.

Due to the revealed similarities, it is not difficult to accurately predict the transcriptional activity profile of one cell line using the profile of another cell line. In particular, the following two regression models expressed the relationship between transcriptional activity profile for HepG2 and the profile for HEK293 or the profile for K562:

S1 Fig demonstrates these two regression models.

Advanced regression models

Based on comparative analysis performed in the previous section, we can confidently conclude that cell lines are similar in terms of their transcription activity profiles. In other words, the activities of many TSSs are almost identical in many cell lines. To incorporate this cell line commonality into regression models, we generated a new feature called ‘mean profile’. It was defined as a set of 209,911 mean values of activities, where an individual mean activity for each TSS was determined by averaging all of its activities in cell lines available in the FANTOM5 atlas.

The accuracy of regression model was significantly improved when stepwise forward regression was applied to the combination of PRIMARY_FEATURES and the mean profile. Thus, a comparison of Ro-p values in the first row of Table 5 with Ro-p values achieved using primary regression models (see Table 3) indicated that the accuracy increased 1.23–1.48 times. However, such a regression model has a serious disadvantage, since it is completely useless for predicting the activities of novel TSSs, which are absent in the FANTOM5 atlas. To avoid this disadvantage, we generated a new feature called ‘predicted mean profile’. This feature was determined using the following two-step procedure. In the first step, the stepwise forward regression was applied three times to PRIMARY_FEATURES determined for HepG2, K562 and HEK293 independently. As a result of the first step, three predicted profiles were obtained. In the second step, the ‘predicted mean profile’ was generated by averaging three predicted profiles. Thus, ‘predicted mean profile’ was determined by applying stepwise forward regression technique to all the PRIMARY_FEATURES defined for HepG2, K562 and HEK293. Finally, the stepwise forward regression was applied to the combination of PRIMARY_FEATURES and the predicted mean profile to select the most important features and get an advanced regression model. Table 6 contains the advanced regression model obtained for the HepG2 cell line. S4 and S5 Tables contain the advanced regression models obtained for cell lines K562 and HEK293, respectively. The accuracy of advanced regression models is demonstrated in the second row of Table 5.

thumbnail
Table 5. Accuracy of advanced regression models for the HepG2, K562 and HEK293 cell lines.

https://doi.org/10.1371/journal.pone.0243332.t005

thumbnail
Table 6. Advanced regression model for the HepG2 cell line.

https://doi.org/10.1371/journal.pone.0243332.t006

It is important to note that the predicted mean profile was the most important feature in all three advanced regression models. Therefore, one can conclude that common (i.e. not specific to the cell line) transcription processes are dominant in different cell lines. However, cell line specificity can also be detected using advanced regression models. Thus, it is well known that HEY1 is involved in the regulation of self-renewal of liver cancer cells [33]. Therefore, it was not surprising that HEY1 was the most represented TF in the most important features for HepG2. According to Table 6, HEY1 was observed in the six most important features. In other words, its binding to 6 promoter regions [-5000, -1001], [-1000, -501], [-500, -201], [-100, 0], [1, 200] and [501, 1000] was important for transcription in the HepG2 cell line. Moreover, based on comparison of the primary and advanced regression models (see Tables 2 and 6), one can conclude that the features of the advanced regression model were more specific for cell lines than those of the primary regression model. Indeed, HEY1 was observed only in the four most important features of the primary model. Additionally, it is well-known that HNF3G (hepatocyte nuclear factor 3-gamma) plays an important role in the development, differentiation and regeneration of the liver [34]. Therefore, it was not surprising that HNF3G was selected using the advanced regression (see Table 6) but it was not selected using primary regression.

It is interesting to note that the most important features correlated with some additional features from PRIMARY_FEATURES. Thus, Table 7 contains five features that most correlate with individual most important feature identified for the HepG2 cell line. It is not difficult to see that almost all important features (excluding GR[-500, -201]) highly correlated with the corresponding Abundance-features. In particular, the correlation coefficient between HEY1[501, 1000] and Abundance[501, 1000] is equal to 0.689 while for the pair (HEY1[-1000, -501], Abundance[501, 1000]) it is equal to 0.706. According to the definition of Abundance-features, they can obviously be interpreted as indicators of cis-regulatory modules. Therefore, one can conclude that almost all TFs involved in the most important features prefer to bind to putative cis-regulatory modules. For example, according to the information for the important feature HEY1[501, 1000] in Table 7, we can expect the existence of a putative cis-regulatory module within [501, 1000] promoter regions of some genes, and this module contains HEY1, ZNF205, and IRF2. According to the information for TAF1[1, 100], another putative cis-regulatory module within [1, 100] promoter regions contains TAF1, NONO, CEBPD, and HEY1. On the one hand, HEY1, TAF1 and NONO are involved in the most important features. On the other hand, ZNF205, IRF2 and CEBPD are not directly involved and, possibly, can be classified as less important. Nevertheless, their impact on TSS activity was also taken into account because they participated in the selected ‘Abundance[-100, 0]’ feature. Thus, from the point of view of regression models, the most important features were related to TSS activity directly and individually while less important features were related with TSS activity mutually.

thumbnail
Table 7. Features that most correlate with the individual most important features identified for the HepG2 cell line.

https://doi.org/10.1371/journal.pone.0243332.t007

According to the signs of the advanced regression coefficients, the most important features identified by advanced regressions can also be classified as activators or repressors. In particular, based on Table 6, one can conclude that five features (namely, HEY1 [-100, 0], AhR [-100, 0], NF-YC [501, 1000], TAF1 [-1000, -501] and HNF3G [101, 500]) can be classified as repressors, while the remaining features–as activators. However, one can expect that this classification can be distorted by the presence of Abundance-features among the most important features, since Abundance-features include information about many TFs. To understand how reliably stepwise regression models can actually classify features into activators or repressors, we conducted the following test. We removed all Abundance-features from the most important features and built the ordinary least squares regression models using the remaining important features. As a result, we observed a slight decrease in the accuracy of the regression but the regression coefficients and their significance were changed imperceptibly. Thus, in the case of the HepG2 cell line, the removal of the feature Abundance [-100, 0] resulted in a slight decrease in the Ro-p correlation coefficient from 0.743 to 0.739. However, the same five features can be classified as repressors due to the negative signs of their regression coefficients: HEY1 [-100, 0] (regression coefficient = -0.087, p-value = 1.512 × 10−86), AhR [-100, 0] (-0.048, 1.117 × 10−78), NF-YC [501, 1000], (-0.171, 2.736 × 10−78), TAF1 [-1000, -501] (-0.093, 9.954 × 10−91) and HNF3G [101, 500] (-0.087, 1.512 × 10−86). Thus, the presence of Abundance-features had no essential influence on repressor/activator classification.

It is interesting to note that we also considered an additional way of repressor/activator classification. In this case, each TF was analyzed independently. For each TF, we constructed the ordinary least squares regression model for which only eight binary features were used. Based on the signs of the regression coefficients and the p-values, we considered the following three categories: TF was classified as a significant repressor in a given promoter region if the sign of the regression coefficient of the corresponding feature was negative and p-value < 10−5. If the sign was positive and the p-value < 10−5, then TF was classified as a significant activator. If the p-value > 10−5, then TF was considered insignificant. Fig 3 shows the results of this classification for all 11 TFs, which were selected as the most important for the HepG2 cell line. The new classification approach confirmed most of the features from Table 6. Only three features (namely, c-Myc[-100, 0], NF-YC[501, 1000] and HNF3G[101, 500]) were classified as insignificant. However, it is necessary to note that the results of the new classification are less reliable, as the accuracy of ordinary least squares regressions was quite moderate, since Ro-p varied in the range [0.102, 0.628], see Fig 3.

thumbnail
Fig 3. Classification of repressors/activators obtained by ordinary regressions using eight features for the HepG2 cell line.

https://doi.org/10.1371/journal.pone.0243332.g003

On the one hand, for construction of regression models it is sufficient to use the ‘predicted mean profile’ and approximately 20 the most important features due to small increments of Ro-p in last steps of feature selection. According to their p-values, selected features are extremely significant. On the other hand, it seems likely that these sets of features can be extended by additionally composed the lists of attendant features that also play a role in cell-specific regulation. For this purpose, we continued to select features with the help of stepwise forward regression. In this case, we stopped selection when the p-value of least significant regression coefficient exceeded the threshold 10–20. The lists of attendant features for HepG2, K562 and HEK293 cell lines are available as S6 to S8 Tables, respectively. The attendant features can be classified as less important, but still highly significant for cell-specific regulation.

Additionally, we briefly considered the possibility of using advanced regression models obtained for one cell line (for example, HepG2) to predict TSS activities in another highly correlated cells (for example, primary hepatocytes). Unfortunately, this approach is not applicable (at least, intensively) to our features due to frequent incompleteness of the ChIP-Seq data. This incompleteness is due to the fact that, for example, for the cell line, the ChIP-Seq experiments were carried out using one set of TFs, while for the primary cells, the experiments were carried out with a different set of TFs. In particular, according to Table 6, to predict TSS activities in hepatocytes using the advanced regression model, it is necessary to have TFBRs of selected eleven TFs, while currently (according to the GTRD database) only CTCF, EZH2 and NR1H4 have been studied in ChIP-Seq experiments. Therefore, the advanced regression model mentioned in Table 6 is currently still not useful for predicting TSS activities in hepatocytes.

Finally, to demonstrate the usefulness of the predicted mean profile, we performed a regression analysis of three rare cell lines DU145 (prostate carcinoma), THP-1 (acute monocytic leukemia) and U937 (adult acute monocytic leukemia). Only a few TFs were studied in ChIP-Seq experiments on these cell lines, see Table 8. It was therefore not surprising that the primary regression models could only achieve low accuracy: 0.426 ≤ Ro-p ≤ 0.538. However, the accuracy increased 1.30–1.57 times when we used the predicted mean profile to build advanced regression models, see Table 8. These models are available in S9 to S11 Tables. The significant increment of Ro-p values indicated that a non-cell-specific feature (the predicted mean profile) could compensate, at least in part, for the absence of a large number of features in poorly studied cell lines.

thumbnail
Table 8. Accuracy of the primary and advanced regression models for the DU145, THP-1 and U937 cell lines.

https://doi.org/10.1371/journal.pone.0243332.t008

Sum-transformation of expression levels for closely spaced TSSs

One of the specific properties of TSSs in the FANTOM5 atlas was that many of them are located close to each other. In particular, 116,620 TSSs (55.6%) had other TSSs nearby at a distance of less than 100 bp. For such closely spaced TSSs, we replaced their individual expression levels with sums of their expression levels and then calculate LTE-levels for sums. For the remaining TSSs, which were separated by at least 100 bp, we did not change their individual LTE-levels. As a result, for the given cell line, we created a new transcription profile, say, sum-transformed profile.

To predict sum-transformed profile, we also applied stepwise forward regression to PRIMARY_FEATURES. In other words, we have constructed primary regression models for prediction of sum-transformed profiles. S12 to S14 Tables contain the resulting sum-transformed regression models. The first row of Table 9 contains the accuracy of sum-transformed regression models for the HepG2, K562 and HEK293 cell lines. Comparison of Ro-p values in the first row of Table 9 with Ro-p values achieved using primary regression models (see Table 3) indicated that the accuracy increased 1.073–1.15 times. Thus, the transition from transcription activity profiles to sum-transformed profiles has become the second way to increase the accuracy of regression models.

thumbnail
Table 9. Ro-p correlations and percentages of identical features for sum-transformed regression models for the HepG2, K562 and HEK293 cell lines.

https://doi.org/10.1371/journal.pone.0243332.t009

Finally, it is interesting to note that the list of features selected by the sum-transformed regression models and the list of features selected by the primary regression models were significantly overlapped. For example, Table 2 and S12 Table contained 12 (60%) identical features. The second row of Table 9 contains the percentage of identical features for the three analyzed cell lines.

Conclusions

  1. Using the stepwise forward regression method, we identified the sets of the most important TFs that affect expression activity of TSSs in human cell lines such as HepG2, K562 and HEK293.
  2. With the help of the constructed regression models, we demonstrated that some TFs can be classified simultaneously as repressors and activators depending on their location relative to TSS.
  3. A comparative analysis of cell lines revealed high similarity between them. We expressed the commonality of cell lines using the novel feature ‘predicted mean profile’. We demonstrated that this feature is useful for improving the accuracy of regression models, as well as for analyzing rare cell lines.

Supporting information

S1 Fig. Relationships between the transcriptional activity profile for HepG2 and the profile for HEK293 (upper figure) and the profile for K562 (lower figure).

https://doi.org/10.1371/journal.pone.0243332.s001

(PDF)

S1 Table. Primary regression model for the K562 cell line.

https://doi.org/10.1371/journal.pone.0243332.s002

(DOCX)

S2 Table. Primary regression model for the HEK293 cell line.

https://doi.org/10.1371/journal.pone.0243332.s003

(DOCX)

S3 Table. Primary regression model K562_List-40.

https://doi.org/10.1371/journal.pone.0243332.s004

(DOCX)

S4 Table. Advanced regression model for the K562 cell line.

https://doi.org/10.1371/journal.pone.0243332.s005

(DOCX)

S5 Table. Advanced regression model for the HEK293 cell line.

https://doi.org/10.1371/journal.pone.0243332.s006

(DOCX)

S6 Table. List of attendant features that are significantly cell-specific for regulation of HepG2.

https://doi.org/10.1371/journal.pone.0243332.s007

(DOCX)

S7 Table. List of attendant features that are significantly cell-specific for regulation of K562.

https://doi.org/10.1371/journal.pone.0243332.s008

(DOCX)

S8 Table. List of attendant features that are significantly cell-specific for regulation of HEK293.

https://doi.org/10.1371/journal.pone.0243332.s009

(DOCX)

S9 Table. Advanced regression model for the DU145 cell line.

https://doi.org/10.1371/journal.pone.0243332.s010

(DOCX)

S10 Table. Advanced regression model for the THP-1 cell line.

https://doi.org/10.1371/journal.pone.0243332.s011

(DOCX)

S11 Table. Advanced regression model for the U937 cell line.

https://doi.org/10.1371/journal.pone.0243332.s012

(DOCX)

S12 Table. Sum-transformed regression model for the HepG2 cell line.

https://doi.org/10.1371/journal.pone.0243332.s013

(DOCX)

S13 Table. Sum-transformed regression model for the K562 cell line.

https://doi.org/10.1371/journal.pone.0243332.s014

(DOCX)

S14 Table. Sum-transformed regression model for the HEK293 cell line.

https://doi.org/10.1371/journal.pone.0243332.s015

(DOCX)

Acknowledgments

Authors are grateful to their colleague Semyon Kolmykov for technical support.

References

  1. 1. Yevshin I, Sharipov R, Valeev T, Kel A, Kolpakov F. GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments. Nucleic Acids Res. 2017;45(D1):D61–D67. pmid:27924024
  2. 2. Yevshin I, Sharipov R, Kolmykov S, Kondrakhin Y, Kolpakov F. GTRD: a database on gene transcription regulation-2019 update. Nucleic Acids Res. 2019;47(D1):D100–D105. pmid:30445619
  3. 3. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. pmid:22955616
  4. 4. Oki S, Ohta T, Shioi G, Hatanaka H, Ogasawara O, Okuda Y, et al. ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data. EMBO Rep. 2018;19(12):e46255. pmid:30413482
  5. 5. Cheneby J, Gheorghe M, Artufel M, Mathelier A, Ballester B. ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments. Nucleic Acids Res. 2018;46(D1):D267–D275. pmid:29126285
  6. 6. Pepke S, Wold B, Mortazavi A. Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009;6(11 Suppl):S22–S32. pmid:19844228
  7. 7. Angelini C, Costa V. Understanding gene regulatory mechanisms by integrating ChIP-seq and RNA-seq data: statistical solutions to biological problems. Front Cell Dev Biol. 2014;2:51. pmid:25364758
  8. 8. The FANTOM Consortium and The RIKEN PMI and CLST, et al. A promoter-level mammalian expression atlas. Nature. 2014;507:462–71. pmid:24670764
  9. 9. Abugessaisa I, Noguchi S, Asegawa A, Harshbarger J, Kondo A, Lizio M, et al. FANTOM5 CAGE profiles of human and mouse reprocessed for GRCh38 and GRCm38 genome assemblies. Sci Data. 2017;4:170107. pmid:28850105
  10. 10. Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, et al. Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010;465:182–7. pmid:20393465
  11. 11. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507(7493):455–61. pmid:24670763
  12. 12. Gao T, He B, Liu S, Zhu H, Tan K, Qian J. EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types. Bioinformatics. 2016;32(23):3543–51. pmid:27515742
  13. 13. Abugessaisa I, Noguchi S, Carninci P, Kasukawa T. The FANTOM5 computation ecosystem: genomic information hub for promoters and active enhancers. Methods Mol Biol. 2017;1611:199–217. pmid:28451981
  14. 14. Niu M, Tabari E, Ni P, Su Z. Towards a map of cis-regulatory sequences in the human genome. Nucleic Acids Res. 2018;46(11):5395–409. pmid:29733395
  15. 15. Wang J, Dai X, Berry LD, Cogan JD, Liu Q, Shyr Y. HACER: an atlas of human active enhancers to interpret regulatory variants. Nucleic Acids Res. 2019;47(D1):D106–D112. pmid:30247654
  16. 16. Yoshida H, Lareau CA, Ramirez RN, Rose SA, Maier B, Wroblewska A, et al. The cis-regulatory atlas of the mouse immune system. Cell. 2019;176(4):897–912. pmid:30686579
  17. 17. Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, Adrian J., et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710. pmid:32728249
  18. 18. Deviatiiarov R, Gams A, Syunyaev R, Tatarinova T, Gusev O, Efimov I. Human atlas of cardiac promoters and enhancers reveals important role of regulatory elements in heritable diseases. Research Square rs.3.rs-37530 [Preprint]. 2020 [cited 2020 Oct 14]. Available from: https://www.researchsquare.com/article/rs-37530/v1.
  19. 19. Li YE, Preissl S, Hou X, Zhang Z, Zhang K, Fang R, et al. An atlas of gene regulatory elements in adult mouse cerebrum. bioRxiv 2020.05.10.087585 [Preprint]. 2020 [cited 2020 Oct 14]. Available from: https://www.biorxiv.org/content/10.1101/2020.05.10.087585v2.
  20. 20. Dailey L. High throughput technologies for the functional discovery of mammalian enhancers: new approaches for understanding transcriptional regulatory network dynamics. Genomics. 2015;106(3):151–8. pmid:26072436
  21. 21. Zhang XO, Gingeras TR, Weng Z. Genome-wide analysis of polymerase III-transcribed Alu elements suggests cell-type-specific enhancer function. Genome Res. 2019;29(9):1402–14. pmid:31413151
  22. 22. Pang B, Snyder MP. Systematic identification of silencers in human cells. Nat Genet. 2020;52(3):254–63. pmid:32094911
  23. 23. Guo Y, Mahony S, Gifford DK. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol. 2012;8(8):e1002638. pmid:22912568
  24. 24. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R137. pmid:18798982
  25. 25. Zhang X, Robertson G, Krzywinski M, Ning K, Droit A, Jones S, et al. PICS: probabilistic inference for ChIP-seq. Biometrics. 2011;67(1):151–63. pmid:20528864
  26. 26. Narlikar L, Jothi R. ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. Methods Mol Biol. 2011;802:305–22.
  27. 27. Kolmykov SK, Kondrakhin YV, Yevshin IS, Sharipov RN, Ryabova AS, Kolpakov FA. Population size estimation for quality control of ChIP-Seq datasets. PLoS One. 2019;14(8):e0221760. pmid:31465497
  28. 28. Davidson EH. cis-Regulatory modules, and the structure/function basis of regulatory logic. In: Davidson EH. The regulatory genome: gene regulatory networks in development and evolution. Cambridge: Academic Press; 2006. p. 31–86.
  29. 29. Nakagawa O, McFadden DG, Nakagawa M, Yanagisawa H, Hu T, Srivastava D, et al. Members of the HRT family of basic helix-loop-helix proteins act as transcriptional repressors downstream of Notch signaling. Proc Natl Acad Sci USA. 2000;97(25):13655–60. pmid:11095750
  30. 30. Memon A, Lee WK. KLF10 as a tumor suppressor gene and its TGF-β signaling. Cancers. 2018;10(6): E161. pmid:29799499
  31. 31. Cheng C, Alexander R, Min R, Leng J, Yip KY, Rozowsky J, et al. Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res. 2012;22(9):1658–67. pmid:22955978
  32. 32. He B, Lang J, Wang B, Liu X, Lu Q, He J, et al. TOOme: a novel computational framework to infer cancer tissue-of-origin by integrating both gene mutation and expression. Front Bioeng Biotechnol. 2020;8:394. pmid:32509741
  33. 33. Zhu P, Wang Y, Du Y, He L, Huang G, Zhang G, et al. C8orf4 negatively regulates self-renewal of liver cancer stem cells via suppression of NOTCH2 signalling. Nat Commun. 2015;6:7122. pmid:25985737
  34. 34. Costa RH, Kalinichenko VV, Holterman AXL, Wang X. Transcription factors in liver development, differentiation, and regeneration. Hepatology. 2003;38(6):1331–47. pmid:14647040