Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Relationship between regulatory pattern of gene expression level and gene function

Relationship between regulatory pattern of gene expression level and gene function

  • Masayo Inoue, 
  • Katsuhisa Horimoto
PLOS
x

Abstract

Regulation of gene expression levels is essential for all living systems and transcription factors (TFs) are the main regulators of gene expression through their ability to repress or induce transcription. A balance between synthesis and degradation rates controls gene expression levels. To determine which rate is dominant, we analyzed the correlation between expression levels of a TF and its regulated gene based on a mathematical model. We selected about 280,000 expression patterns of 355 TFs and 647 regulated genes using DNA microarray data from the Gene Expression Omnibus (GEO) data repository. Based on our model, correlation between the expressions of TF–regulated gene pairs corresponds to tuning of the synthesis rate, whereas no correlation indicates excessive synthesis and requires tuning of the degradation rate. The gene expression relationships between TF–regulated gene pairs were classified into four types that correspond to different gene regulatory mechanisms. It was surprising that fewer than 20% of these genes were governed by the familiar regulatory mechanism, i.e., through the synthesis rate. Moreover, we performed pathway analysis and found that each classification type corresponded to distinct gene functions: cellular regulation pathways were dominant in the type with synthesis rate regulation and terms associated with diseases such as cancer, Parkinson’s disease, and Alzheimer’s disease were dominant in the type with degradation rate regulation. Interestingly, these diseases are caused by the accumulation of proteins. These results indicated that gene expression is regulated structurally, not arbitrarily, according to the gene function. This funding is indicative of a systematic control of transcription processes at the whole-cell level.

Introduction

Gene expression is an essential process for all living systems [1, 2]. In general, expression levels are controlled via the balance between the synthesis rate and the degradation rate. When the synthesis rate is dominant, the expression level of a regulated gene is controlled by the expression level of a transcription factor (TF). In each transcription process, a TF induces or represses the expression of the gene alone or with the help of other proteins constituting a complex [35]. More than 2,000 TFs are thought to be encoded in the human genome [6, 7] and the expression levels of many genes are actually controlled through the synthesis rate. In contrast, some genes are not regulated by the synthesis rate, but by TFs that simply set the on/off state of the synthesis process and are not responsible for the synthesis rate [8, 9]. In such cases, the expression level is regulated via the degradation process; i.e. the degradation rate is dominant for the control of the expression level. Thus, the transcription of some genes is regulated by the synthesis rate, and the transcription of other genes is based on on/off regulation. However, which rate is dominant for each gene is still unclear.

The regulatory mechanisms of some genes have been studied intensively, but a comprehensive study is still difficult from a technological standpoint. Recent advances in protein quantification technologies have enabled draft maps of the human proteome to be analyzed [10, 11]; however, no high-throughput technology is currently available for analyzing the abundance of proteins at the whole-cell level. It has been widely reported that alterations in protein abundance are strongly associated with changes in mRNA expression levels [10, 1214]. Based on this reported relationship, we used available mRNA data [1518] to obtain a perspective view for the regulatory mechanisms of each gene at the whole-cell level.

Here, our objective is to determine which rate is dominant, the synthesis rate or the degradation rate, in the control of each gene expression level. Based on a simple mathematical model, the expression levels of a TF and the regulated gene show a correlation when the synthesis rate is dominant, but no such correlation is shown when the degradation rate is dominant. We studied this correlation by constructing approximately 280,000 scatter diagrams of “TF–regulated gene” pairs. All the scatter diagrams were classified into four types depending on the regulatory mechanisms. We also characterized each type in terms of gene function and found that the regulatory mechanisms were assigned systematically (not arbitrarily), according to the gene functions. This result illustrates that the regulatory mechanisms of gene expression levels correspond to gene function at the whole-cell level.

Results

Four types in scatter diagrams of TFs and regulated genes

We constructed about 280,000 scatter diagrams of expression levels of TFs and their regulated genes using DNA microarray data from the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) at the NCBI [19]. We selected 135 series of GEO DataSets (GDS) for Homo sapiens that were composed of more than 50 GEO samples. The list of the selected GDS records is in S1 Table. For details of the data preparation, see Methods. To match TFs to regulated genes, we used the TRANSFAC database provided by BioBase and selected 2,073 TF–regulated gene pairs from among 355 TFs and 647 regulated genes. Note that we analyzed TF–regulated gene pairs only with regulations are already confirmed.

We identified four typical types of scatter diagrams depending on the regulatory mechanisms between the TF–regulated gene pairs (Fig 1): constant expression levels for both a TF and the regulated gene, albeit with small fluctuations (no-change type); correlation between expression levels of a TF and the regulated gene (correlation type); no correlation between expression levels of a TF and the regulated gene because the gene has a constant expression level (horizontal-distribution type); and no correlation between expression levels of a TF and the regulated gene because the TF has a constant expression level (vertical-distribution type). We analyzed 2,073 (pairs) ×135 (GDS) to give approximately 280,000 scatter diagrams, and all the diagrams could be classified into these four types.

thumbnail
Fig 1. The four typical types of the scatter diagrams for TF–regulated gene pairs.

The expression level of a transcription factor (TF; X-axis) and the regulated gene (Y-axis) are plotted. Data from GDS1962 were used. One point represents one sample in the GDS, namely, 180 points are shown in each diagram. (a) The no-change type; both a TF (RELA) and the regulated gene (IKBKE) are expressed at constant levels with small fluctuations. (b) The correlation type: a strong correlation between a TF (STAT1) and its regulated gene (PSMB9). (c) The horizontal-distribution type: a regulated gene (CTNNB1) shows a constant expression level regardless of changes in the TF (NKX2-5) expression level. (d) The vertical-distribution type: a regulated gene (CCK) undergoes changes in the expression level even at a constant expression level of the TF (CREB1).

https://doi.org/10.1371/journal.pone.0177430.g001

Classification of TF–regulated gene relationship in four correlation types

We studied how the four classification types are implemented. We also characterized each scatter diagram with six indicators to define the classification criteria. Four of them are standard variables: absolute value of slope (|s|) and coefficient of determination (R2) from a least squares approximation, and variance in TF distribution (VTF) or regulated gene distribution (VRG). The other two parameters, the uniformity count for a TF (UTF) or its regulated gene (URG), were introduced to distinguish a uniform distribution from the no-change type with a few outliers. The uniformity count is defined as the number of filled units when the area between a maximum and a minimum value is divided into 10 units. These indicators are shown schematically in Fig 2.

thumbnail
Fig 2. The six classification indicators used to classify the scatter diagrams.

(a) An absolute value of slope (|s|) and a coefficient of determination (R2) defined from a least squares approximation, and variance in TF distribution (VTF) or regulated gene distribution (VRG) representing characteristic ranges for data distributions. (b) The uniformity count for a TF (UTF) or its regulated gene (URG) defined as the number of filled units among 10 units dividing the area between a maximum and a minimum value, respectively. A unit is filled if there is at least one data point in it and it is unfilled if there are no data points.

https://doi.org/10.1371/journal.pone.0177430.g002

The no-change type.

The no-change type (Fig 1a) is trivial and is the type that occurs most frequently because this relationship exists when there is no need to change the expression levels of both a TF and the regulated gene under the experimental conditions. Environmental changes are known to change the expression levels of some relevant genes, while the expression levels of many other genes are unchanged (thereby contributing to homeostasis, an essential attribute for all living organisms). In addition, the experimental conditions for each GDS differed and only some specific genes were affected. Thus, the observation that many genes have constant expression levels is only natural.

The correlation and horizontal-distribution types.

The mechanisms of the correlation type (Fig 1b) and the horizontal-distribution type (Fig 1c) can be described by a simple mathematical model for a transcription process (see Methods for details). Suppose a TF molecule stochastically binds to or dissociates from a promoter sequence, and the regulated gene is transcribed only when the TF binds to the sequence. Assuming an equilibrium state, the mRNA level of the regulated gene in a steady state ([RG]*) can be written as a function of the expression level of the TF in the steady state ([TF]*) as, (1)

Here, K is the dissociation constant for the TF and the promoter sequence, and γ is the ratio of the degradation rate to synthesis rate for the regulated gene. Eq (1) describes two characteristic relations between [TF]* and [RG]* depending on the K value (Fig 3). [RG]* shows a strong correlation with [TF]* when K ≫ 1, corresponding to the correlation type (Fig 1b). The expression level of the regulated gene changes depending on the TF expression level; in other words, the expression level of the regulated gene is finely regulated by the TF. Conversely, when K ≪ 1, [RG]* remains at a constant level regardless of [TF]*, corresponding to the horizontal-distribution type (Fig 1[c]). The regulated gene is always synthesized in large excess because the binding–dissociation equilibrium is strongly shifted toward the binding state, i.e., synthesis. Therefore, fine regulation of the gene expression level is impossible, and the on/off state of the process can only be regulated when K ≪ 1.

thumbnail
Fig 3. The correlation and horizontal-distribution relationships between TF–regulated gene pairs.

Typical examples of the relation between a transcription factor ([TF]*: the X-axis) and the regulated gene ([RG]*: the Y-axis) according to Eq (1). Both axes have a logarithmic scale. [RG]* changes as a function of [TF]* when K ≫ 1 (left). [RG]* maintains a constant expression level when K ≪ 1 (right).

https://doi.org/10.1371/journal.pone.0177430.g003

In Eq (1), [TF]* represents the protein concentration of a TF, but in our study we have used mRNA expression data. Although protein concentration and mRNA expression data represent different biological processes, studies have shown that there is a tolerably good correlation between the two [10, 1214]. By assuming this correlation, we can conclude that the correlation and horizontal-distribution type relations result from differences in the transcriptional regulation mechanisms. Even without this assumption, we can say that the correlation between a TF and its regulated gene indicates fine regulation, whereas the horizontal distribution indicates the absence of regulation by the TF.

Now, we define the classification criteria of the two types as follows. The scatter diagram of the correlation type is approximated by a straight line with a finite slope (0.1 < |s|<10) with a certain level of accuracy (R2 > 0.3). The expression levels of both a TF and its regulated gene need to change significantly (VTF > 0.25 and VRG > 0.25) to show such a linear correlation. On the other hand, the diagram of the horizontal-distribution type is approximated by a horizontal line (|s|<0.1). The TF expression level needs to change significantly compared with the expression level of its regulated gene (VTF > 0.25 and VTF/VRG > 3) and also show uniform distribution (UTF ⩾ 7).

The vertical-distribution type.

We have not elucidated the mechanism behind the vertical-distribution type (Fig 1d). It is conceivable that genes in this type are regulated not only by TFs, but also by other factors, such as translational mechanisms.

The diagram of this type is characterized by a vertical line (|s|>10). In contrast to the horizontal-distribution type, the expression level of the regulated gene changes significantly compared with the TF (VRG > 0.25 and VTF/VRG < 1/3) and is distributed uniformly (URG ⩾ 7). The numerical classification criteria are summarized in Table 1 (see Methods for details).

thumbnail
Table 1. Classification criteria used to classify the scatter diagrams into the four types.

https://doi.org/10.1371/journal.pone.0177430.t001

Assignment of the regulated genes to the four types using the classification criteria

After classifying the nearly 280,000 diagrams into the four types (Fig 1), we assigned one correlation type to each regulated gene. It should be noted that one TF can regulate multiple genes and one gene can be regulated by multiple TFs. In addition, one regulated gene can be classified into different types depending on the experimental conditions or cellular states. To avoid ambiguous classifications, we defined a logical rule (see Methods) and assigned one type for each regulated gene depending on the GDS.

Each regulated gene was classified into different relation types depending on the GDS, as shown in Fig 4. However, some genes were classified into one definite type in most GDS. To fix the type for each gene, we integrated the results from the 135 GDS by selecting a majority type from among the correlation, horizontal-distribution, and vertical-distribution types. We ignored the no-change type because our aim was to study how gene expression levels are controlled through the TF–regulated gene correlation. The no-change type occurs when there is no need to change expression levels under the experimental condition of a GDS. For example, PSMB9 was assigned into the correlation type as 53 GDS showed the correlation and only 3 GDS showed the vertical-distribution type (Table 2). In a similar way, CTNNB1 was assigned into the horizontal-distribution type and CCK was assigned into the vertical-distribution type according to the selecting a majority type rule (Table 2). Using the classification criteria, we successfully classified most of the 647 regulated genes: 111 into the correlation type, 178 into the horizontal-distribution type, and 318 into the vertical-distribution type; 40 genes could not be classified because they fell into two majority types (S2 Table).

thumbnail
Fig 4. The classification results for all 647 regulated genes and for all 135 GEO DataSets.

Blue represents the correlation type, magenta represents the horizontal-distribution type, green represents the vertical-distribution type, and the white areas with no points represent the no-change type or unclassified. Elements on both axes were arranged in the order of the descending proportion of the horizontal-distribution type.

https://doi.org/10.1371/journal.pone.0177430.g004

thumbnail
Table 2. The assigned type and the numbers of GDS classified into each type are shown for regulated genes used in Fig 1.

https://doi.org/10.1371/journal.pone.0177430.t002

Gene functions of the regulated genes in three types of scatter diagrams

We performed pathway analysis of the gene functions in the correlation, horizontal-distribution and vertical-distribution types using the curated gene sets in the Canonical pathways (C2:CP) from the Molecular Signatures Database (MSigDB; http://www.broadinstitute.org/gsea/msigdb) [20] with the hypergeometric test at the 1% level of significance (see Methods). We obtained 25 significant pathways for the correlation type, 19 significant pathways for the horizontal-distribution type, and 14 significant pathways for the vertical-distribution type (S3 Table). To compare the different relation types, we categorized the pathways according to the hierarchical framework by denoting a pathway by the top-class entity of its hierarchical framework (Fig 5).

thumbnail
Fig 5. Analysis of pathways of gene functions in three types of scatter diagrams at the 1% level of significance.

The number of appearances (X-axis) is shown for each hierarchical framework denoted by the top-class entity (Y-axis). Only the hierarchical frameworks that contain at least one significant pathway are shown from four pathway databases: REACTOME (from Cell Cycle to Signal Transduction), KEGG (from Metabolism(K) to Human Diseases), the Pathway Interaction Database (from Pathways of replication, repair, gene expression, and protein biosynthesis to Transcription factor-mediated signaling pathways), and the BioCarta (from Adhesion to Neuroscience). If a pathway belonged to more than one (np) hierarchical framework, we counted 1/np for each.

https://doi.org/10.1371/journal.pone.0177430.g005

To our surprise, we found that some of the genes in each relation type were associated with type-specific functions: cellular regulation (e.g., Cell Cycle and DNA Replication) for the correlation type, Human Diseases for the horizontal-distribution type, and Metabolism or Signal Transduction for the vertical-distribution type. It is interesting that serious diseases, such as cancers, Parkinson’s disease, Alzheimer’s disease, and Huntington’s disease, were observed in the horizontal-distribution type, i.e., the degradation dominant type. The implications of this observation are considered in the Discussion. To summarize, the scatter diagrams for the TF–regulated gene pairs were characterized systematically according to the functions of the regulated genes. The results indicate that the mechanisms that regulate gene expression levels correspond to gene functions at the whole-cell level.

Discussion

In this work, we studied regulatory patterns of gene expression where TFs regulate the transcription process in a fine or on/off manner. We drew scatter diagrams for TF–regulated gene pairs using publicly available DNA microarray data and classified the diagrams into four types based on our simple mathematical model of a transcription process. We also performed pathway analysis and found that the relation types could be linked to the gene functions. Genes related to cellular regulation processes belonged to the correlation type, which indicates fine regulation of the transcription rate. Genes related to diseases belonged to the horizontal-distribution type, which indicates on/off regulation of the transcription process. Genes related to metabolism or signal transduction belonged to the vertical-distribution type, where the regulatory mechanism is unclear. These findings imply that the regulatory mechanisms for transcription processes are determined not arbitrarily but systematically depending on gene function, and pointing to the presence of a whole-cell regulatory mechanism.

Here, we classified 647 regulated genes into four classification types. To our surprise, the correlation type (fine regulation of a transcription process) was observed less frequently than we expected (less than 20% of the genes), although such fine regulation has often been assumed. The regulatory mechanism of the correlation type requires the expression levels of both the TF and the regulated gene to be fine-tuned to specific values depending on cellular states. Such fine-tuning would be a challenging task for many genes and would be impossible on a whole-cell level. This might explain why the correlation type was rarer than expected.

We used mRNA expression data in this study because of technical limitations. The final product of gene expression processes is usually a protein, and we plan to study protein data in the near future. Studies on the human proteome are still a developing field [11, 21, 22], and there are several challenges for protein quantification and for the organization of such data into databases [10, 12]. The mechanisms regulating protein abundance are more complicated than for mRNA, but in the simplest terms, protein abundance can be regulated via the balance between synthesis rates and degradation rates. The analytical method that we developed here could also be applied to protein data. Analyses of protein data will shed more light on the mechanisms that govern the transfer of the quantitative property of genomic information.

Finally, it is worth discussing genes classified in the horizontal-distribution type (on/off regulation of a transcription process). They are often over-expressed; therefore, it can be hypothesized that the abundance of the encoded protein needs to be controlled by degradation to the appropriate levels after excessive synthesis. Regulation through degradation is not as common as the regulation via synthesis [23, 24]. Examples of regulation through degradation include the well-studied proteins p53, which is a tumor suppressor that also regulated the cell cycle [2527], and β-catenin, which is a signal transducer in the Wnt signaling pathway that also regulates cell-cell adhesion [28]. In addition, some reports have indicated that HIF-1α, a TF that responds to a shortage of oxygen [29], may be regulated through degradation. In the present study, p53 and β-catenin were classified into the horizontal-distribution type during our analysis, but we did not have sufficient data to classify HIF-1α.

It should also be noted, that the pathway analysis showed that genes in the horizontal-distribution type were associated with diseases, especially serious diseases such as cancer, Parkinson’s disease, and Alzheimer’s disease, and both p53 and β-catenin have been strongly implicated in cancer [30]. These diseases are caused by the abnormal accumulation of some proteins [31], and for good health, their abundance needs to be kept at low levels. Interestingly, we found that their abundance was regulated not through synthesis but through degradation after over-expression, although such regulation seems irrational and risky in cases when protein accumulation causes diseases. We expect that our future theoretical research will give some clues to such inconsistencies.

Methods

Preparation of DNA microarray data sets

We used the DNA microarray data from GDS as the expression data in this study. First, we normalized the expression data and removed measurement specificity, generally involving different DNA microarray instruments, to compare the different GDS. Several normalization procedures are available and each has its own advantages [3237]. In this study, we needed a general-purpose method applicable to various measurement platforms and used a popular method, Z scores [33, 37, 38], as follows.

For each sample in a GDS, we first transformed the original expression data given as {x1,x2,…,xs} to the log-scale, (2)

Then, we normalized the values using the Z-score method by defining (3)

The normalized value is given as (4) for every xi.

TF-regulated gene scatter diagrams

We constructed scatter diagrams for expression levels of each TF (TFi) and its regulated gene (RGi) from a GDS. Suppose a GDS contains Ni(⩾50) samples, then the diagram has Ni points as explained below. When a sample had only one data point for TFi (RGi), we used this value for plotting, and when a sample contained more than one data point for TFi (RGi), we used the average value. Thus, one sample yielded one point, and the diagram had Ni points in total.

One GDS normally contains subclasses such as an experimental group and a control group. It could be that each subclass produces a different domain structure in the diagram and falsely represents an imaginary correlation. Namely, if the samples in one subclass show smaller TFi and RGi and the samples in another subclass show higher TFi and RGi because of the experimental conditions, a correlation may be observed between TFi and RGi even if there is no real correlation. We confirmed that such imaginary correlations appeared rarely and did not influence the results.

TF-regulated gene binding transcription model

We considered a general and simple mathematical model of a transcription process. We analyzed two situations: TF promoting gene expression (up-regulation), and TF suppressing gene expression (down-regulation). First, we explain the up-regulation case in detail and next the down-regulation case in brief.

For the up-regulation case, suppose a TF molecule stochastically binds to or dissociates from a promoter sequence, and transcription takes place only when the TF binds to the sequence. Then, Pb is the probability of the TF’s binding to the promoter sequence, and is defined as a fraction of bound TF molecules among all TF molecules. By assuming that the binding process and dissociation process are in equilibrium, we get the following equation: (5)

Here, [TF] represents the TF concentration, and kb and ku are the reaction coefficients for the binding and dissociation processes, respectively. The left side of Eq (5) represents the reaction rate of the TF binding process proportional to the product of the TF concentration ([TF]) and the unbound promoter sequence (1−Pb). On the other hand, the right side represents the TF dissociation reaction rate regulated by the bound promoter sequence(Pb). From Eq (5), we obtain (6) where the dissociation constant K = ku/kb. Because transcription occurs only when the TF binds to the promoter region, the mRNA synthesis rate of the regulated gene should be proportional to Pb. Therefore, we can write the time dependence of the expression of the regulated gene mRNA ([RG]) as (7)

Here, a and b are reaction coefficients for the synthesis and degradation. By considering a steady state of Eq (7), , we finally obtain the steady-state mRNA level of the regulated gene ([RG]*) as a function of the steady-state TF concentration ([TF]*) as shown in Eq (1), (1)

For the down-regulation case, we obtain the following equation from a similar analysis except that the production rate is proportional to the dissociation probability (1−Pb): (8) and therefore (9)

Eq (9) describes the same two types of characteristic behaviors as Eq (1) depending on the dissociation constant K′, although [RG]* shows a strong negative correlation with [TF]* when K′ ≪ 1 and [RG]* remains at a constant level regardless of [TF]* when K′ ≫ 1. In the up-regulation or down-regulation cases, the correlation between a TF and the regulated gene indicates fine-tuned rate regulation, whereas the horizontal distribution indicates the absence of regulation.

Classification criteria

We defined the criteria for classifying the scatter diagrams into the four types as follows. First, we excluded the data with VTF = 0 or VRG = 0 because they probably originate from a measurement flaw. It is virtually impossible for all the samples in a GDS to show exactly the same expression level of a gene. We also assumed that the expression level of a TF (or a regulated gene) changed significantly when VTF > 0.25 (VRG > 0.25), whereas it is constant, albeit with small fluctuations, when VTF ⩽ 0.25 (VRG ⩽ 0.25). After that, we classified those with 0.1 < |s|<10 and R2 > 0.3 into the correlation type when VTF > 0.25 and VRG > 0.25. We then classified diagrams with |s|<0.1, VTF > 0.25 (VRG > 0), VTF/VRG > 3, and UTF ⩾ 7 into the horizontal-distribution type, and diagrams with |s|>10, VRG > 0.25 (VTF > 0), VTF/VRG < 1/3, and URG ⩾ 7 into the vertical-distribution type. The remaining diagrams were assigned to the no-change type. The baseline values used here were set arbitrarily, but the discussion will not change if the values are changed to some extent.

Logical rule for combining multiple TFs

In many cases, TFs and regulated genes do not have a one-to-one correspondence. When a regulated gene has several TFs, some of the TFs finely regulate the transcription process whereas others simply switch the process on or off. The former TF–regulated gene pairs may match the correlation type, whereas the latter often correspond to the vertical-distribution type. In such a mixed case, the regulated gene should be classified into the correlation type not into the vertical-distribution type. Similarly, when TF–regulated gene pairs match the horizontal-distribution type and others correspond to the no-change type, the regulated gene should be classified into the horizontal-distribution type. Using these rules, we classified every regulated gene as follows.

Suppose a gene has M TFs (M ⩾ 1) and among them, Mn TFs are of the no-change type, Mc of the correlation type, Mh of the horizontal-distribution type, and Mv TFs are of the vertical-distribution type. When Mh > Mc and Mv, the regulated gene is classified into the horizontal-distribution type; when MvMhMc = 0, the regulated gene is classified into the vertical-distribution type; and when Mc > Mh ⩾ 0, the regulated gene is classified into the correlation type, regardless of Mv. This is because a TF, even if it serves as a single TF, can determine the correlation type as explained above. We classified a regulated gene into the no-change type only when Mn = M. For the remaining cases, we aborted the classification because there was not sufficient evidence. We thus assigned one classification type to each regulated gene depending on the GDS.

The hypergeometric test

We determined whether a list of genes (genes of each relation type) over-represents a biological process (gene sets for a pathway from MsigDB) using the hypergeometric test. Suppose we listed n genes from a total of N genes; i.e., we selected n genes without replacement from the N genes. M genes among the total of N genes are involved in the biological process, and m genes among the listed n genes are involved in the same process. Then, the probability distribution of m (p(m)) is described by the hypergeometric distribution as (10)

Our goal is to determine whether the case of m genes being involved in the biological process (out of the n listed genes) is statistically significant or happened by chance. Because we are testing whether our gene set corresponds to over-representation, the hypergeometric p value (p) is calculated as the probability of random involvement of m or more genes in the biological process (out of n genes) and is expressed as (11)

When the p value is less than the value we set as the level of significance (1%), we conclude that our set of genes is over-represented, i.e., the m genes occurred non-randomly. However, when the p value is greater than the threshold value, we conclude that the m genes are selected by chance.

Supporting information

S1 Table. List of the GEO DataSets used in this study.

Title and the number of samples are shown for each DataSet from the Gene Expression Omnibus (GEO) at the NCBI.

https://doi.org/10.1371/journal.pone.0177430.s001

(PDF)

S2 Table. Assignment of genes into the four types.

For each assigned type, gene names and the numbers of GDS classified into each type are shown. The no data column shows the number of GDS data was not available.

https://doi.org/10.1371/journal.pone.0177430.s002

(PDF)

S3 Table. List of the selected pathways from pathway analysis.

The category (hierarchical framework), the pathway name, and the name of involved genes are shown for each selected pathway.

https://doi.org/10.1371/journal.pone.0177430.s003

(PDF)

Acknowledgments

We thank Dr. Toru Natsume for his constructive comments.

Author Contributions

  1. Conceptualization: MI KH.
  2. Data curation: MI.
  3. Formal analysis: MI.
  4. Funding acquisition: MI KH.
  5. Investigation: MI KH.
  6. Methodology: MI KH.
  7. Writing – original draft: MI KH.

References

  1. 1. Crick F. Central dogma of molecular biology. Nature. 1970; 227: 561–563. pmid:4913914
  2. 2. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell, 5th ed. New York: Garland Science.; 2008
  3. 3. Latchman DS. Transcription factors: an overview. Int. J. Biochem. Cell Biol. 1997; 29: 1305–1312. pmid:9570129
  4. 4. Lefstin JA, Yamamoto KR. Allosteric effects of DNA on transcriptional regulators. Nature. 1998; 392: 885–888. pmid:9582068
  5. 5. Ravasi T, Suzuki H, Cannistraci CV, Katayama S, Bajic VB, Tan K, et al. An atlas of combinatorial transcriptional regulation in mouse and man. Cell. 2010; 140: 744–752. pmid:20211142
  6. 6. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001; 409: 860–921. pmid:11237011
  7. 7. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science. 2001; 291: 1304–1351. pmid:11181995
  8. 8. Orphanides G, Reinberg D. A unified theory of gene expression. Cell. 2002; 108: 439–451. pmid:11909516
  9. 9. Eser P, Demel C, Maier KC, Schwalb B, Pirkl N, Martin DE, et al. Periodic mRNA synthesis and degradation co-operate during cell cycle gene expression. Mol. Syst. Biol. 2014; 10: 71F7.
  10. 10. Wilhelm M, Schlegl J, Hahne H, Moghaddas GA, Lieberenz M, Savitski MM, et al. Mass-spectrometry-based draft of the human proteome. Nature. 2014; 509: 582–587. pmid:24870543
  11. 11. Kim MS, Pinto Sneha, M, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, et al. A draft map of the human proteome. Nature. 2014; 509: 575–581. pmid:24870542
  12. 12. Schwanhausser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, et al. Global quantification of mammalian gene expression control. Nature. 2011; 473: 337–342. pmid:21593866
  13. 13. Vogel C, Abreu RS, Ko D, Le SY, Shapiro BA, Burns SC, et al. Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line. Mol. Syst. Biol. 2010; 6: 400. pmid:20739923
  14. 14. Orntoft TF, Thykjaer T, Waldman FM, Wolf H, Celis JE. Genome-wide study of gene copy numbers, transcripts, and protein levels in pairs of non-invasive and invasive human transitional cell carcinomas. Mol. Cell. Proteomics. 2002; 1: 37–45. pmid:12096139
  15. 15. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995; 270: 467–470. pmid:7569999
  16. 16. Shalon D, Smith SJ, Brown PO. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res. 1996; 6: 639–645. pmid:8796352
  17. 17. Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays. Nat. Genet. 1999; 21: 20–24. pmid:9915496
  18. 18. Mata J, Marguerat S, Bahler J. Post-transcriptional control of gene expression: a genome-wide perspective. Trends Biochem. Sci. 2005; 30: 506–514. pmid:16054366
  19. 19. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic. Acids. Res. 2002; 30: 207–210. pmid:11752295
  20. 20. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005; 102: 15545–15550. pmid:16199517
  21. 21. Munoz J, Heck AJR. From the human genome to the human proteome. Angew. Chem. Int. Ed. 2014; 53: 10864–10866.
  22. 22. Rolland T, Ta An M, Charloteaux B, Pevzner SJ, Zhong Q, Sahni N, et al. A proteome-scale map of the human interactome network. Cell. 2014; 159: 1212–1226. pmid:25416956
  23. 23. Seufert W, Jentsch S. Ubiquitin-conjugating enzymes UBC4 and UBC5 mediate selective degradation of short-lived and abnormal proteins. EMBO J. 1990; 9: 543–550. pmid:2154373
  24. 24. Hochstrasser M. Ubiquitin, proteasomes, and the regulation of intracellular protein degradation. Curr. Opin. Cell Biol. 1995; 7: 215–223. pmid:7612274
  25. 25. Kubbutat MH, Jones SN, Vousden KH. Regulation of p53 stability by Mdm2. Nature. 1997; 387: 299–303. pmid:9153396
  26. 26. Asher G, Tsvetkov P, Kahana C, Shaul Y. A mechanism of ubiquitin-independent proteasomal degradation of the tumor suppressors p53 and p73. Genes Dev. 2005; 19: 316–321. pmid:15687255
  27. 27. Wade M, Wang YV, Wahl GM. The p53 orchestra: Mdm2 and Mdmx set the tone. Trends Cell Biol. 2010; 20: 299–309. pmid:20172729
  28. 28. Gomperts B, Kramer I, Tatham P. Signal Transduction. Academic Press.; 2002.
  29. 29. Huang LE, Gu J, Schau M, Bunn HF. Regulation of hypoxia-inducible factor 1alpha is mediated by an O2-dependent degradation domain via the ubiquitin-proteasome pathway. Proc. Natl Acad. Sci. USA. 1998; 95: 7987–7992. pmid:9653127
  30. 30. Rosenbluh J, Nijhawan D, Cox AG, Li X, Neal JT, Schafer EJ, et al. β-Catenin-driven cancers require a YAP1 transcriptional complex for survival and tumorigenesis. Cell. 2012; 151: 1457–1473. pmid:23245941
  31. 31. Ross CA, Poirier MA. Protein aggregation and neurodegenerative disease. Nat. Med. 2004; 10 Suppl.: S10–S7. pmid:15272267
  32. 32. Bilban M, Buehler KL, Head S, Desoye G, Quaranta V. Normalizing DNA microarray data. Curr. Issues Mol. Biol. 2002; 4: 57–64. pmid:11931570
  33. 33. Quackenbush J. Microarray data normalization and transformation. Nat. Genet. 2002; 32 Suppl.: 496–501. pmid:12454644
  34. 34. Yang HY, Dudoit S, Luu P, Lin MD, Peng V, Ngai J, et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic. Acids. Res. 2002; 30: e15. pmid:11842121
  35. 35. Irizarry AR, Hobbs B, Collin F, Beazer-Barclay DY, Antonellis JK, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003; 4: 249–264. pmid:12925520
  36. 36. Irizarry AR, Wu Z, Jaffee AH Comparison of Affymetrix GeneChip expression measures Biostatistics. 2006; 22: 789–794.
  37. 37. Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations. BMC Bioinformatics. 2009; 10 (Suppl 1): S24.
  38. 38. Cheadle C, Vawter PM, Freed JW, Becker GK. Analysis of microarray data using Z score transformation. J. Mol. Diagn. 2003; 5: 73–81. pmid:12707371