Atlas of Signaling for Interpretation of Microarray Experiments

Microarray-based expression profiling of living systems is a quick and inexpensive method to obtain insights into the nature of various diseases and phenotypes. A typical microarray profile can yield hundreds or even thousands of differentially expressed genes and finding biologically plausible themes or regulatory mechanisms underlying these changes is a non-trivial and daunting task. We describe a novel approach for systems-level interpretation of microarray expression data using a manually constructed “overview” pathway depicting the main cellular signaling channels (Atlas of Signaling). Currently, the developed pathway focuses on signal transduction from surface receptors to transcription factors and further transcriptional regulation of cellular “workhorse” proteins. We show how the constructed Atlas of Signaling in combination with an enrichment analysis algorithm allows quick identification and visualization of the main signaling cascades and cellular processes affected in a gene expression profiling experiment. We validate our approach using several publicly available gene expression datasets.


Introduction
Microarray-based expression profiling of living systems is a quick and inexpensive method to obtain insights into the nature of various diseases and phenotypes. It is also a great way of studying the functions of individual proteins or drugs by looking at the affected targets after system distrurbances or genetic modifications (siRNA, knock-outs, gene over-expression, etc).
The greatest challenge of microarray-based expression profiling is interpreting the obtained results. In a typical microarray experiment, mRNA profiles are generated for thousands of genes on a chip from a collection of samples derived from studied experimental conditions. Thus, the difficulty is finding an underlying biological theme or specific mechanisms hidden behind the expression profiles. Many of the genes changed in an experiment may fall outside the area of expertise of an individual researcher. A common approach has always been focusing on a handful of most highly changed probes. The main limitation of this approach is a risk to miss small, but concerted changes in a group of functionally related genes.
The recent advancement in interpreting the microarray data is development of the gene set enrichment analysis (GSEA) [1]-a statistically robust algorithm which compares the entire differential expression profile against biologically meaningful gene sets, defined by prior knowledge (e.g. pathways, cellular processes, etc). The goal of the GSEA is to determine whether all members of each gene set tend to be synchronously changed in a microarray experiment. As a result, the microarray experiment is projected on a much smaller list of statistically significantly changed gene sets which can summarize the observed expressional changes on a gene-systems level. A drawback of focusing only on highly differentially expressed genes lies in the fact that signaling proteins participating in the observed cellular response might not be changed on the level of expression even though corresponding pathways are activated or inhibited.
In this paper, we present a novel approach for analysis of differential gene expression profiles aimed at identification of key protein regulators and pathways involved in the differential response. The major ideas of our approach are: -Utilizing a gene expression regulatory network built using facts extracted from literature to generate a comprehensive collection of gene sets, each representing immediate downstream neighbors (sub-networks) of every individual protein in the network. -Grouping proteins into functionally coherent groups (either by protein families performing similar functions or by participation in common cellular processes) and connecting these groups by well-established biological regulatory links into a single overview pathway (Atlas of Signaling) depicting main cellular signaling channels. -Interpreting differential gene expression by ''projecting'' subnetworks significantly enriched with differentially expressed genes onto the Atlas of Signaling in order to identify key regulatory proteins and pathways involved in the differential response.
Using publicly available gene expression datasets, we demonstrate that this approach can successfully identify main signaling cascades involved in the regulation of the cellular response.

Methods
All the analyses described in this paper have been performed using PathwayStudioH software version 6.2. PathwayStudio is a commercial product for pathway analysis which contains a comprehensive database of protein-protein relationships extracted from literature using MedScanH-a fully automated biomedical information extraction engine.

An Overview of the Atlas of Signaling
The principal components of the Atlas of Signaling are protein groups (classes) representing either protein families or molecular-level cellular processes. Conceptually, we distinguish 5 subcategories of proteins: ligands, receptors, signaling proteins, transcription factors and ''workhourse'' proteins (Table 1).
Workhorse proteins are grouped into cellular processes. We thought to develop a minimal set of ubiquitous tissue-independent molecular events intrinsic to normal physiology of a eukaryotic cell, e.g. actin cytoskeleton assembly, DNA replication, or translation. More complex structural processes, such as mitosis, apoptosis or vesicular transport can be represented in terms of these elementary processes. For instance, the molecular events of mitosis include chromatin condensation, spindle assembly, centrosome separation, kinetochore assembly etc. The higher-level structural process can be represented as a chain of elementary events, each performed by a limited set of proteins-executors. The same pertains to the majority of tissue-specific processes. For example, the process ''neurotransmitter secretion'' describes a neuron-specific version of secretory vesicle exocytosis. Proteins of the SNARE complex and clathrin cage proteins comprise the group of executors responsible for membrane budding and fusion during various exocytosis events in different cell types. We assigned only direct executors (''workhorse proteins'') or their direct specific regulators to our minimal set of cellular processes: biochemical pathways include only biochemical enzymes, transport processes include only transporters, and structural processes include structural proteins responsible for physical integrity of a cell or its ''molecular machinery'' and their direct specific regulators (e.g. regulatory cytoskeleton-or microtubule-associated proteins). For instance, the ''microtubule sliding'' process contains tubulins, kinesins, dyneins and microtubule-asociated proteins. Cellular processes also include biochemical and transport processes that contain major cellular biochemical pathways and metabolite/ion transport processes respectively.
Ligands, receptors, signaling proteins, and transcription factors are grouped into protein families based on sequence and functional similarity. The complete list of protein groups can be found in the supporting File S1.
We have connected protein groups by regulatory relationships that are used to describe the signaling information flow inside the cell. The relationships between protein classes are based on wellestablished relationships between individual proteins from the two classes ( Figure 1) described in literature. There are several types of protein relationships that were originally introduced in ResNet database for Pathway Studio [2,3]. The complete list of relationship types and their statistics on the Atlas of Signaling is shown in Table 2. All relationships between functional groups can be visualized in Pathway Studio as one big pathway that we call the ''Atlas of Signaling'' (

Protein Expression Sub-Networks as Gene Sets
One of the key ideas behind our gene expression interpretation approach is utilization of a gene expression regulatory network built from facts extracted from literature. The network is used to generate a comprehensive collection of gene sets, each representing immediate downstream neighbors of each individual protein in the network ( Figure 3). We call a ''center'' protein of such sub-network the ''seed'' protein and assume that if the downstream expression targets of the seed protein are enriched with differentially expressed genes (i.e. the sub-network is found to be statistically significant in enrichment analysis) then the seed protein is one of the key regulators of the observed differential response. Since sub-networks are constructed from all the proteins in the entire expression network, including ligands, receptors, signaling proteins and transcription factors, the seed proteins of statistically significant sub-networks presumably constitute the components of a regulatory network involved in the modulation of the observed differential response.
The gene expression regulation relationships are extracted from PubMed abstracts and full-text papers using MedScan technology [4,5]. They describe involvement of various proteins: ligands, receptors, signaling proteins and transcription factors in regulation of protein expression. The complete gene expression regulatory network was extracted from entire 2007 PubMed and the content of 61 freely available full-text journals and contains 163,945 unique relationships. As noted, the extracted relationships represent not only direct regulation of expression by transcription factors, but also indirect relationships (e.g. ''protein A regulates expression of protein B'', and similar statements).
In addition to expression sub-networks, the separate collection of gene sets was constructed from the members of all the functional groups themselves to capture the concerted changes in expression among the members of the groups: ligands, receptors, signaling groups and transcription factors, as well as cellular processes ''groups''-biochemical pathways, transport processes and structural processes.

Enrichment Analysis of Differential Gene Expression Using Expression Sub-Networks and the Atlas of Signaling
The following steps describe our approach for interpreting differential gene expression in the context of the Atlas of Signaling 1. Differential gene expression between samples of interest is calculated using T-test. 2. The enrichment analysis is performed for normalized log ratio differential samples using Mann-Whitney test [6] against the collection of gene sets generated as described above. All gene sets with a p-value ,0.05 are considered significant and are selected for further analysis. 3. Significant gene sets are mapped to the Atlas of Signaling pathway as follows:    a. Gene sets corresponding to the cellular processes and functional groups are directly selected on the ''Atlas of Signaling'' pathway b. Individual signaling/regulatory proteins representing the seeds of gene sets are first mapped to corresponding functional groups by containment, which are then selected on the ''Atlas of Signaling'' pathway.
All unmapped functional groups and cellular processes not selected by this mapping procedure are then removed from the Atlas of Signaling. The resulting ''partial'' pathway constitutes the ''interpretation model'' for a specific differential expression profile; it contains classes of proteins whose members or ''expression targets'' have revealed statistically significant changes in the differential expression profile. Figure 3 illustrates the procedure of sub-network generation and mapping the seeds of significant subnetworks onto the Atlas of Signaling.

Results and Discussion
We have validated our approach by interpreting several publicly available expression microarray experiments using the Atlas of Signaling pathway. Since current focus of the Atlas is signal transduction from surface receptors to transcription factors, we have chosen experiments that measure response of normal human tissues to different hormones.
We first analyzed GDS1036 experiment from NCBI GEO that profiles gene expression in microglial cells in response to 200 u/ml interferon-gamma (IFN-gamma). IFN-gamma is a soluble cytokine and the only member of the type II class of interferons. In contrast to interferon-alpha and interferon-beta, which can be expressed by all cells, IFN-gamma is secreted only by T-lymphocytes, dendritic cells and NK cells. Produced by lymphocytes activated by specific antigens or mitogens, IFN-gamma, in addition to having antiviral activity, has important immunoregulatory functions. It is a potent activator of macrophages, it has anti-proliferative effects on transformed cells, and it can potentiate the antiviral and antitumor effects of the type I interferons.
The imported data set contained three time points: 1, 6, and 24 hours of IFN-gamma exposure. Data for each time point were generated using microglial cell isolated from four different brain specimens. Analysis of differential expression between untreated samples and samples after 6 or 24 hours of exposure returned very similar lists of differentially expressed genes. Therefore, for further analysis, time points 6 and 24 hours were combined and compared to corresponding untreated control samples. The differential gene expression was calculated, followed by enrichment analysis using Mann-Whitney test [6] against ''expression subnetwork'' gene sets as described above. The significant subnetworks (p,0.05) were mapped on the Atlas of Signaling and the resulting pathway is shown on Figure 4. The majority of mapped functional classes on the Atlas of Signaling formed a wellconnected sub-network with principal components being the classical JAK-STAT pathway and interferon-response factors (IRFs) that have been shown to be downstream of IFN-gamma in numerous publications [7,8]. In addition, such cell processes as prostaglandin synthesis, spindle assembly, lipid transport, and Ser/ Gly metabolism were also affected by IFN-gamma treatment in microglial cells. Down-regulation of prostanoid production by IFN-gamma has been shown previously [9]. Activation of spindle assembly and Ser/Gly metabolism are indicative of cell proliferation that occurs upon microglia activation [10]. It was shown that activated microglia export apolipoprotein E and J involved in lipid transport [11]. Apolipoprotein E was linked to microglial activation and to increased neurotoxicity in several publications [12,13]. Our enrichment-mapping analysis also revealed all components of tumor necrosis factor (TNF)-NF-kappa-B pathway as significant. IFN-gamma-dependent production of TNF-alpha mediated by STAT1 was previously reported [14]. Activation of NF-kappa-B suggests activation of apoptosis in microglia that was also documented previously [15]. Thus, TNF-alpha can also provide a negative feedback loop to slow-down the spread of microglia activated by IFN-gamma. One possible scenario is that IFN-gamma induction activates microglia migration towards the   injury site and initiates TNF-alpha release to kill the damaged neurons, but also to initiate death of activated microglia in order to stop the spread of inflammation beyond the site of injury.
Next, we analyzed the experiment GDS1313 which profiles the response of the glial fibrillary acidic protein (GFAP)-negative lamina cribrosa (LC) cells to a 24-hour exposure to Transforming Growth Factor beta (TGF-beta) 1 at 10 ng/ml concentration. TGF beta is a multifunctional protein that controls proliferation, differentiation, and other functions in many cell types. Application of the Transforming Growth Factors to normal rat kidney fibroblasts induces proliferation in cultured cells that results in overgrow that is no longer subjected to the normal cell-contact inhibition. The GFAP-negative LC cells are used as a model for primary open-angle glaucoma (POAG) when TGF-beta level is elevated in human LC tissue. We have performed a similar analytical workflow: differential gene expression followed by enrichment analysis against ''expression sub-network'' gene sets followed by mapping of significant sub-network seeds to the Atlas of Signaling. The results of the analysis are shown on Figure 5. As expected, the classical TGFbeta/SMAD pathway is identified as significant along with the extracellular matrix proteins that are directly regulated by SMADs. Involvement of extracellular matrix (ECM) proteins identified by our analysis is in perfect agreement with the fact that pathological hallmarks of the glaucomatous optic nerve head include retinal ganglion cell axon loss and extracellular matrix (ECM) remodeling of the lamina cribrosa layer [16]. Interestingly, the results also suggest involvement of p38 MAPK, AKT and PKA pathways in the observed response which regulate ECM deposition, cytoskeleton and focal junction assembly, protein folding and several metabolic processes ( Figure 5A). There is a handful of unconnected protein classes (upper right corner of the map), but they can be joined to the core network by a few secondary messengers or functional classes whose expression was not significantly affected by TGF-beta treatment ( Figure 5B).
Finally, we have analyzed the experiment GDS1543 from NCBI GEO repository measuring the effect of tumor necrosis factor (TNF)-alpha on microvascular endothelial cells (HMEC) (5 h, 2 ng/ml TNF). TNF-alpha is a cytokine involved in systemic inflammation and is a member of a group of cytokines that stimulate the acute phase reaction. It is mainly secreted by macrophages and can induce cell death of certain tumor cell lines. It is a potent pyrogen causing fever by direct action or by stimulation of interleukin-1 secretion. Under certain conditions it can stimulate cell proliferation and induce cell differentiation.
For the analysis, 2 samples (GSM50775 and GSM50773) were excluded from the experiment because they have shown an inconsistent clustering: they represent control and treatment samples which clustered to each other as opposed to clustering with other control and treatment samples. The pathway obtained by mapping significant sub-network seeds on the Atlas of Signaling is shown on Figure 6.
The universal feature of the TNF pathway conserved in evolution and among different human tissues is believed to be the activation of NF-kappa-B transcription factor [17]. Our analysis has revealed an activation of NF-kappa-B in HMEC cells. Nevertheless, TNF-alpha exhibits diverse effects on different tissues [18,19]. The tissue specificity of TNF-alpha action is determined by the tissue-specific expression of TNF receptorassociated adaptors acting as scaffolds to associate different sets of downstream signaling molecules with TNF-receptor in different tissues, resulting in regulation of tissue-specific processes [20]. Gene expression analysis helps to select a particular set of such target processes. For instance, our analysis revealed large number of transport processes affected in HMEC cells by TNF-alpha treatment. This can be a specific feature of endothelial cells. On the other hand, membrane depolarization [21] and changes in cell volume due to deregulation of ion homeostasis [22] is now believed to play a pivotal role in early stages of apoptosis, which is the main process regulated by NF-kappa-B.
In conclusion, we have developed a novel approach for a systems-level interpretation of microarray expression data in the context of manually constructed ''overview'' pathway depicting the main cellular signaling channels. Using publicly available gene expression datasets, we have demonstrated that the developed approach can highlight the biologically plausible sub-networks of the global cellular signaling network. We believe that the developed approach can be used for general systems-level interpretation of differential expression profiling experiments.

Supporting Information
File S1 Complete list of protein classes in Ontology.

Author Contributions
Conceived and designed the experiments: AK. Performed the experiments: EK NI. Analyzed the data: EK NI AK ND. Wrote the paper: AY ND.