Meta-Analysis of Genetic Programs between Idiopathic Pulmonary Fibrosis and Sarcoidosis

Background Idiopathic pulmonary fibrosis (IPF) and pulmonary sarcoidosis are typical interstitial lung diseases with unknown etiology that cause lethal lung damages. There are notable differences between these two pulmonary disorders, although they do share some similarities. Gene expression profiles have been reported independently, but differences on the transcriptional level between these two entities have not been investigated. Methods/Results All expression data of lung tissue samples for IPF and sarcoidosis were from published datasets in the Gene Expression Omnibus (GEO) repository. After cross platform normalization, the merged sample data were grouped together and were subjected to statistical analysis for finding discriminate genes. Gene enrichments with their corresponding functions were analyzed by the online analysis engine “Database for Annotation, Visualization and Integrated Discovery” (DAVID) 6.7, and genes interactions and functional networks were further analyzed by STRING 9.0 and Cytoscape 3.0.0 Beta1. One hundred and thirty signature genes could potentially differentiate one disease state from another. Compared with normal lung tissue, tissue affected by IPF and sarcoidosis displayed similar signatures that concentrated on proliferation and differentiation. Distinctly expressed genes that could distinguish IPF from sarcoidosis are more enriched in processes of cilium biogenesis or degradation and regulating T cell activations. Key discriminative network modules involve aspects of bone morphogenetic protein receptor two (BMPR2) related and v-myb myeloblastosis viral oncogene (MYB) related proliferation. Conclusions This study is the first attempt to examine the transcriptional regulation of IPF and sarcoidosis across different studies based on different working platforms. Groups of significant genes were found to clearly distinguish one condition from the other. While IPF and sarcoidosis share notable similarities in cell proliferation, differentiation and migration, remarkable differences between the diseases were found at the transcription level, suggesting that the two diseases are regulated by overlapping yet distinctive transcriptional networks.


Background
Idiopathic pulmonary fibrosis (IPF), the most common type of idiopathic interstitial pneumonia, is a chronic, progressive, irreversible and lethal lung disease of unknown etiology [1,2,3]. The incidence of IPF is estimated at 7-10 cases per 100,000 per year with a presentation age of 50-70 years and no predilection by race or ethnicity [1,4]. Previous studies have shown that the alveolar epithelial cells rather than the inflammatory cells play a vital role in the initiation of the fibrogenic events, and a variety of cytokines and growth factors expressed by injured or activated alveolar epithelial cells are key actors in the development of IPF [1,5,6].
Sarcoidosis, a multisystemic disorder, is defined as the manifestation of immune granulomas found in many organs. The incidence varies depending on age, sex, race and geographic factors, but the predominantly affected systems are the lungs and lymphatic system [7,8,9,10]. Like IPF, the etiology of sarcoidosis remains unknown, and the prevailing hypothesis is that it is a chronic and exaggerated immunological response related to genetic susceptibility, specific infections or environmental factors [11,12]. We reported that human leukocyte antigen polymorphisms may have a role in susceptibility and manifestation of sarcoidosis [13]. Cytokines like TNF-a, IL-12 and interferon gamma (IFN-c) have been revealed to be involved in the formation of sarcoidosis [14,15].
The extensive use of microarray technology has had a profound impact on characterizing the transcriptional changes of lung diseases. Microarray studies based on different platforms using different sample sources (lung tissue, isolated cells or blood) have provided a huge amount of information about lung disease processes, and have been successfully used in determining gene expression patterns or finding potential biomarkers [16]. Gene expression profiles based on microarray technique have been successfully applied to indicate the gene expression patterns of IPF and its potential biomarkers [2,17,18,19,20]. Gene expression analysis was also attempted to find potential pathogenic mediators of pulmonary sarcoidosis and to provide evidence that sarcoidosis causes intense immune responses like hypersensitivity pneumonitis and IPF [21,22]. Meta-analysis of microarray data with the aim to identify significantly expressed genes and important signaling pathways across different platforms has been applied to cancer and gave high predictive models which are more validated than singleset analysis [23]. However, integrated comparison analysis has not been performed to compare the two typical interstitial lung disorders IPF and sarcoidosis.
This study was designed to identify genetic programs that differentially regulate IPF and sarcoidosis based on previously published microarray data from the Gene Expression Omnibus (GEO) datasets. We found similarities and diversities in gene expression between IPF and sarcoidosis. Our results also provided novel signature genes for pathogenic analysis and diagnostic or therapeutic purposes.

Study Datasets
Expression profile data related to either idiopathic pulmonary fibrosis or sarcoidosis were acquired from the GEO repository [24] (http://www.ncbi.nlm.nih.gov/projects/geo/). In this study, datasets of all human lung tissue were used for further analysis and, to make comparisons across platforms, datasets by commonly used platforms like Affymetrix and Agilent were retained.

Expression Intensity Extracting and Probe Mapping
Selected datasets downloaded from the GEO repository in different forms contain different forms of expression measurements and probe annotation files. In particular, expression intensities of Affymetrix HGU133 Plus 2.0 were saved in.CEL files, and can be extracted by robust multi-array average (RMA) using software AffymetrixH Expression Console TM version 1.1.
Probes and their measurements of each expression profile from different platforms were all mapped to a common gene list as previously described [25]. Probes of different data were replaced by official gene symbols, and multiple expression measurements were collapsed by median value when one gene has reduplicative measurements [26,27].

Data Processing
All expression estimates were log 2 transformed and then merged by cross platform normalization (xpn) which were performed using R 2.14.1 and Bioconductor package CONOR [27,28]. Two expression data of different studies with the same common gene symbols were normalized and produced a new dataset, and then the newly produced data was renormalized with the next data.  Finally, expression measurements of one study can be compared across all study populations within common genes.

Statistical Analysis
Unpaired student's t-tests. In order to find different expressed genes and make comparisons over normal, IPF and sarcoidosis lung tissues, we carried out unpaired student's t-tests on the integrated super array data. Significantly expressed genes are defined as those with P,0.05, the same criterion for discovering differently expressed genes [2,20], and higher statistical significances at P,0.01were used by Lockstone et al. [21] and at P,0.005 by Crouser et al. [22].
Unsupervised clustering analysis. Significantly expressed genes were analyzed by hierarchical clustering algorithms using the software Gene Cluster 3.0. Expression estimates were first adjusted by centering genes and arrays, then followed by two-way clustering (TWC) using the Euclidean distance similarity metric and the complete linkage clustering method, and clustering results were visualized by Java TreeView. Unsupervised TWC analysis can not only reveal the expression trend of genes but also isolate the outlier samples that possess unique expression features. We carried out our data mining by alternately using unpaired student's t-tests and TWC until one circumstance can be differentiated from another completely.
Supervised classification. To get significantly expressed genes for each comparison, (comparisons between normal and IPF, between normal and sarcoidosis, and between IPF and sarcoidosis), the supervised learning method was carried out on the final unsupervised TWC clustering results using the software Significance Analysis of Microarrays (SAM) 4.0 (http://www-stat. stanford.edu/tibs/SAM/) [30], which will produce more significant expressed genes with lower false discovery rate (FDR or the qvalue). During classifications, the response type of two class unpaired was chosen, and arrays were median centered before analysis. The minimum fold change was set by two to get more obvious significantly expressed genes as far as possible. The top 130 significant genes with the highest SAM scores that equals to the T-statistic value (65 top significantly up-regulated genes and 65 top significantly down-regulated genes) were used as the signature for each comparison in our study. This was adopted as in Meltzer's study, where151 genes (probesets in fact) which correspond to 136 unique genes that identified by Student's t test were used to develop the IPF model, and 148 features from another dataset GSE10667 can mapped to features of the IPF model [2]. The threshold of 130 unique genes we use will capture main features for our super array data.

Pathway and Network Analyses
Signature genes that can be used to discriminate each condition were subsequently submitted to the online software ''Database for Annotation, Visualization and Integrated Discovery'' (DAVID) 6.7 (http://david.abcc.ncifcrf.gov/) for inquiring functional annotations and gene enrichments [31,32]. Also, significant expressed genes were mapped according to their direct or indirect interactions by the web-server STRING 9.0 (http://string.embl. de/) [33]. Complex gene networks related with different conditions were analyzed by Cytoscape 3.0.0 Beta1 (http:// www.cytoscape.org/) [34].

Research Datasets and Integrated Data
In this study, meta-analysis using lung tissue DNA microarray analyses crossing different studies by different platforms was performed in an attempt to make comparisons across lung tissue samples from normal, IPF and sarcoidosis, and to extract more validated genes together with relevant biological pathways for discriminating different conditions. By querying the key words ''idiopathic pulmonary fibrosis'' and ''sarcoidosis'', five hundred and thirty-nine records are found in GEO database at present (218 IPF related and 321 sarcoidosis related records). After filtering datasets by organism of Homo sapiens and sample source of lung tissue, ten datasets remained (Table S1). In this study, four representative datasets are used for further analysis: GSE24206 (Affymetrix HGU133 Plus 2.0), GSE10667 (Agilent-014850 Whole Human Genome 4644K Microarray), GSE19976 (Affymetrix HG 1.0 ST) and GSE16538 (Affymetrix HGU133 Plus 2.0). Four reports based on the above expression datasets with related patient information have been published [2,20,21,22]. Flow diagram of this study is illustrated in Figure 1. And patient population for further analysis is summarized in Table 1.
To make comparable estimates for the four datasets, xpn method was carried out on the log 2 transformed expression estimates of the selected four datasets. Cross-platform normalization (xpn) method based on a simple block-linear model is the procedure of measurements normalization of data from two or more studies. Gene expression estimates in each study are represented as specific matrix. Data normalizing procedure is an iterative clustering process until convergence to a local minimum of the squared Euclidean distance sum. Xpn is not gene-wise affine compared with other methods in the literature, which can successfully remove systematic differences between platforms while preserving biological information [27]. The log 2 transformation is used to make our data more symmetric that will be helpful for plotting, and the transformation can also make the random variation more constant [29]. Probes of the four studies containing the corresponding expression intensities were mapped to the MAQC 12,091 common genes. Four datasets from four studies were normalized and have produced an integrated super array dataset. The super array data contains 96 lung tissue samples and 10,212 common genes in total (Table S2). All samples were assigned with new labels together with their GEO accession numbers. Measurements of the supper array were log 2 transformed then xpn normalized. Here, 1,897 genes were filtered out as they were not shared by the four studies.

Comparisons Across Normal, IPF and Sarcoidosis
TWC was carried out on data with genes that are significantly expressed between two conditions (P,0.05). In order to get perfect clustering results, outliers with different clustering patterns were and sarcoidosis (n = 12) with 2,051 IPF up regulated genes and 961 IPF down regulated IPF genes. Colorbar in the upper right corner shows the relative quantity of each gene, from light green to bright red corresponds to relatively lower to higher expressions. Abbreviations: Normal = normal lung tissue samples, IPF = idiopathic pulmonary fibrosis lung tissue samples, Sar = sarcoidosis lung tissue samples. Samples numbers correspond to those labeled in Table S2. doi:10.1371/journal.pone.0071059.g002 filtered out (see Table S3 for sample participations during clustering analysis, there are 27 normal, 48 IPF and 21 sarcoidosis tissue samples), and the appearance of outliers is the product of special gene expression patterns of special patients. When we compared normal and IPF samples, 20 normal and 38 IPF samples can make a perfect clustering pattern and 3,412 genes can be used to distinguish two conditions. Within the 3,412 genes, 2,946 genes and 466 genes showed up and down regulated expression patterns in IPF, respectively (Figure 2A). The same clustering analyses were also carried out on normal versus sarcoidosis and IPF versus sarcoidosis samples. 17 normal samples and 16 sarcoidosis samples can finally make a perfect clustering result ( Figure 2B) and that produced 1,096 distinct genes (for sarcoidosis to normal samples, there are 920 up regulated and 176 down regulated genes). IPF and sarcoidosis can be discriminated by 3,018 genes (there are 2,057 up regulated and 961 down regulated genes when sarcoidosis is compared with IPF), and 16 IPF versus 12 sarcoidosis can make a perfect clustering result ( Figure 2C). Potential valuable gene expression information could be lost from those filtered outlier samples. For comparison of normal vs. IPF, 17 samples/202 genes out of 75 samples/10,212 genes were removed. And for normal vs. sarcoidosis, 15 samples/ 284 genes out of 48 samples/10,212 genes were removed. Specially, for IPF vs. sarcoidosis, in 69 samples/10,212 genes, 41 samples were removed whereas significant expressed genes increases from 37 to 3,018. Outlier samples were removed to increase uniformity of datasets and significant expressed genes obtained from such uniform datasets could represent more typical features for each comparison. Differentially regulated genes for each comparison together with fold changes and p-values are listed in Table S4. No gender preference was found in our results, but some significant expressed genes were found to be related with previously published arrays. For instance, Genes MMP1, MMP7, AGER and COL1A2 were reported to be distinguished expressed from normal in IPF [20], and MMP12 was also mentioned to be different from normal in sarcoidosis [21], our findings for those genes have the same gene expression pattern related with those published data.

Classification Produced Signature Genes
To further analyze the significance of genes that can differentiate one condition from another, the supervised learning method SAM was carried out based on the unsupervised TWC results, and that produced three groups of significant expressed genes for normal vs. IPF, normal vs. sarcoidosis, and IPF vs. sarcoidosis comparisons. IPF samples have 472 significantly expressed genes that could be used to distinguish from normal samples, which include 197 up regulated and 275 down regulated genes ( Figure 3A). Sarcoidosis samples have 270 significantly expressed genes (96 up regulated and 174 down regulated genes) that can be used to differentiate from normal samples ( Figure 3B). IPF can also be separated from sarcoidosis by 708 significant expressed genes which include 326 up regulated and 382 down regulated genes ( Figure 3C). All signature genes for all comparisons are summarized in Table S5.
The top 130 genes with highest SAM scores (top 65 upregulated genes and top 65 down-regulated genes) in each comparison can be used as signature genes to separate the sample from the opposite condition more clearly (Figure 4). In both Figure 4A and 4B, we could find that in the disorder condition (either IPF or sarcoidosis), there are two clusters showing stronger or relatively weaker gene expression patterns. Stronger clusters do not show any patient or disease state preferences. IPF and sarcoidosis samples that possess stronger distinct signature gene expressions are randomly derived from different samples under different status. Signature genes are listed in Table S6. Distinct signature genes that could differentiate IPF or sarcoidosis from normal samples show strong similarities, 34 out of 65 up-regulated top signature genes and 33 out of 65 down-regulated top signature genes are shared by IPF versus normal and sarcoidosis versus normal comparisons, and those signature genes are either co-up regulated or co-down regulated. No cross regulated signature genes exist for comparison although signature genes can be found when directly comparing IPF with sarcoidosis. Signature gene expression patterns indicate that IPF and sarcoidosis do possess similarities which are not patient correlated.

Gene Functional and Interactional Analyses
To illustrate the gene network of biological systems for each circumstance, signature genes were submitted to the online analysis software DAVID for gene enrichment and functional analysis. Gene-gene interactions were further retrieved and displayed for validating gene behaviors using STRING software.
In comparison with normal lung samples, IPF may have more glycoprotein-, signal-, disulfide bond-, and extracellular regionrelated features according to the gene enrichment (P,0.001) ( Figure 5A). Similarly, sarcoidosis also has many extracellular region-, signal-, glycoprotein-and disulfide bond-related features (P,0.001) ( Figure 5B). Therefore, IPF and sarcoidosis are similar disorders due to extracellular biological pathways. Most significantly expressed discriminative genes are more related to extracellular biological pathways. This confirmed the previous viewpoint that IPF and sarcoidosis possess characteristics of extracellular matrix accumulation in lung tissue [1,12,22,35], and in some extent confirm that these disorders share some morphologic similarities [36].
However, they are different in their most enriched pathways. For IPF, the strongest term is ''glycoprotein'', but for sarcoidosis, it is ''secreted''. Glycoprotein is a term related to cell-cell interactions, but the term secreted is related to chemical compounds releasing or oozing processes. Comparison of IPF with sarcoidosis shows stronger gene enrichment in cilium biogenesis/degradation (P,0.001) and T cell activation (P,0.01) ( Figure 5C). This may confirm that cell-cell interaction, for instance the activation of alveolar epithelial cells rather than inflammatory cells is the main pathogenic event for IPF but not for sarcoidosis, which could probably in turn confirm that fibrotic sarcoidosis is more similar to hypersensitivity pneumonitis than IPF [21].
For comparison between normal and IPF, significantly expressed genes strongly enriched in ''glycoprotein'' process show strong interactions on the protein level ( Figure 6A). Eleven signature genes ASPN, POSTN, MMP1, MMP13, CTSK, COL1A1, COL3A1, COMP, FIGF, SPP1, and COL15A1 are the main interactive skeletons. From these interactive skeletons gene cluster, FIGF expression is down regulated, and the rest ten genes are up regulated. These interactive signature genes are mainly related with extracellular matrix and collagen changes. For instance, ASPN, a cartilage extracellular protein, POSTN, a periostin, and SPP1, a secreted phosphoprotein, are genes that could affect extracellular changes, especially ASPN, which inhibits the expression of transforming growth factor beta 1 (TGF b-1) and also induces collagen mineralization by binding collagen and calcium [37]. Matrix metalloproteinase family proteins MMP1 and MMP13 together with a noncollagenous extracellular matrix protein COMP here are distinctive, and their reproduction and tissue modeling activities reflect IPF development characteristics surrounding cell matrixes. Similarly, the fibrillar collagens COL1A1, COL3A1 and COL15A1 are over expressed and may play important roles in collagen accumulation. Particularly, the lysosomal cysteine proteinase CTSK also presents distinctively, and previous investigations have found that it collaborates fibroblasts in tumor invasiveness [38], possibly suggesting that IPF may possess expansionary tumor-like properties. The distinctive down-regulated gene that shows interactions with other signature genes is the gene FIGF, also known as vascular endothelial growth factor D. This gene was found to be active in angiogenesis, lymphangiogenesis and endothelial cell growth [39]. Lower expression associated with other signature genes may indicate that IPF is repressed in those aspects.
Significantly expressed genes of sarcoidosis compared with normal that also show some protein to protein interactions ( Figure 6B) include 10 genes: COL3A1, MMP7, POSTN, GREM1, MMP1, BDNF, COMP, FIGF, SPP1, and COL15A1. Genes BDNF and FIGF are down regulated, and the rest eight genes in this interactive skeletons cluster are up regulated. In forming collagens and extracellular matrix properties, sarcoidosis also has over expressions in POSTN, SSP1 and GREM1 (a bone morphogenic protein) that used for collagen mineralization, in MMPs family protein MMP1, MMP7 and the COMP used to break down and form extracellular matrix, and in collagen proteins COL3A1 and COL15A1, used for collagen accumulations. The gene FIGF, which induces vessel, lymph or endothelial cell growth, is also down-regulated, and a nerve growth factor BDNF also shows lower expression characteristic for sarcoidosis, suggesting that sarcoidosis may reduce stress responses like Alzheimer's and Huntington diseases [40].
Surprisingly, within the above significantly expressed genes that show strong interactive relationships, seven genes POSTN, MMP1, COL3A1, COMP, SOO1, COL15A1 and FIGF show consistency by IPF and sarcoidosis. This suggests that extracellular matrix forming or degradation and the accumulation of collagen could be the key similar biological pathways shared by IPF and sarcoidosis based on our research. By comparing with IPF, sarcoidosis shows significantly up regulated expression of CD274, IL7R and PAG1, which show strong protein to protein interactions ( Figure 6C). Down-regulated genes CCNA1, NME5, SPA17 and SPAG6 show interactive effects. By contrast with IPF, up regulation of the sarcoidosis signature genes on interactive skeletons CD274, IL7R and PAG1 are all associated with T cell activations, and down regulated genes CCNA1, NME5, SPA17 and SPAG6 that show interactive activities are all associated with cell division. Discriminative transcriptional changes between IPF and sarcoidosis could be mainly revealed by the above differentiated interactive skeletons. These differences reflect the diseases' developmental differentiations: proliferation of IPF is much higher than that of sarcoidosis but T cell activation is lower. That suggests that the two diseases could have different pathogenesis in aspects of immune responses and proliferations.  Cytoscape, respectively, and that gave three gene network complexes as shown in Figure 7A-7C (the gene lists are in Table  S7). Table S7 shows three network features for each comparison. The network features were extracted as modules for each complex. Network features of IPF and sarcoidosis are related with biological pathways of proliferation or differentiation. Discriminative network features that cause different performances are concerned with different signaling pathways. More reliable pathways concentrate on two aspects. For sarcoidosis, there is a mediator called BMPR2 (bone morphogenetic protein receptor, type II) that plays an important role and shows relatively higher expression compared with IPF, which has a relationship with TGF b-1.
Recent studies have shown that TGF b superfamily collaborated with BMP cytokines can trigger proliferation, tissue regeneration, and angiogenesis, and BMPR2 was reported to be expressed higher on endothelial cell but lower on fibroblasts [41,42]. The mediator MYB (v-myb myeloblastosis viral oncogene), which could trigger tumorigenesis, is overexpressed in IPF compared with that in sarcoidosis. It has been reported that MYB is a transcription gene that could sustain and enhance the inflammation process during breast cancer development [43], which potentially suggests that IPF possesses a higher risk of cancer development.
In-depth learning of critical genes that could draw interactive networks opened insight into pathogenesis. Critical gene networks provided descriptions of disease development in aspects of angiogenesis, neurotrophy and immune activations, which in turn reflect disease morphology. Our results suggest that IPF mainly results from epithelial injury/activation followed by inflammation in response to fibrosis. Excessive inflammation then leads to CD4 + T cell activation followed by formation of granulomas. Non self healing pulmonary sarcoidosis will develop to pulmonary fibrosis due to genetic variations [14,44,45,46,47]. However, limitations caused by data integration and clustering analysis could make us neglect some pathogenic factors.
Although a number of genes have been filtered out for analysis, we can pry out main genomic changes and provide information about pathogenic mechanism. Outlier samples that do not contribute to unsupervised clustering analysis are likely due to other risks of diseases. For instance, smoking has been confirmed to be an environmental factor of IPF [48], and evidence has also shown that sarcoidosis is affected by environmental exposure [49].
Although no complete signaling pathway has been described in this study, and no exact pathogenic mechanisms have been clearly illustrated for IPF and sarcoidosis, our study opened up new insights in genetic change analysis based on multiple databases, and our investigations may be more validated than single studies. Significantly expressed genes, especially those in critical modules used for discriminating different conditions, can be used to draw a regulatory map in disease development. Further studies with the aim of verifying gene expressions and investigating biological functions can at least in theory give a guide for new discoveries in pathogenic research.

Conclusions
In this study, we performed meta-analysis with published datasets with IPF and sarcoidosis and found there were significant differences at the transcriptional level between them, while they did share some similarities. Compared with healthy lung tissue, IPF and sarcoidosis tissue are similar due to formations of extracellular matrixes and collagens. However, transcription networks containing significant expressed genes when comparing those two disorders further present discriminative signaling regulations by MYB and BMPR2, which suggest different prognoses. Furthermore, we believe that our research can promote illustrations of disorder mechanisms, and our method could be used as an essential tool for studying disorders on a transcriptional level. Table S1 GEO Datasets Summary of IPF and Sarcoidosis. Summary of datasets that fit all criteria of what we used, filtered datasets were summarized by accession number, title, summary, organism, platform, total transcripts, total samples (n = 182), sample source, control (normal) sample number (n = 97), control (normal) sample sources, data format, data contributor and reference which the datasets were published. (XLS)

Table S3
Sample participations during clustering analysis. To get perfect TWC results, outlier samples that overlap between two conditions were filtered out. When compare normal and IPF samples, 7 out of 27 normal samples and 10 out of 48 IPF samples were filtered out. In comparison between normal and sarcoidosis samples, 10 out of 27 normal samples and 5 out of 21 sarcoidosis samples were filtered out. In addition, 32 out of 48 IPF samples and 9 out of 21 sarcoidosis samples were filtered out. Key: positive signs = samples were used; negative signs = samples were filtered out; backslashes = samples were absent. (XLS)

Table S4
Expression patterns for each comparison. Unsupervised clustering analysis (TWC method) was carried out to figure out general expression pattern for each comparison. Between normal and IPF, 20 out of 27 normal and 38 out of 48 IPF can be generally separated by 2,946 up regulated and 466 down regulated genes. Between normal and sarcoidosis, 17 out of 27 normal and 16 out of 21 sarcoidosis samples can be described by 920 up regulated and 176 down regulated genes. By comparing with sarcoidosis, IPF has 2,051 up regulated and 961 down regulated discriminative genes. Table S4 lists all discriminated genes for the above analysis, and the fold change and the corresponding p-value is calculated for each gene. (XLS)