Systems Level Analysis of Systemic Sclerosis Shows a Network of Immune and Profibrotic Pathways Connected with Genetic Polymorphisms

Systemic sclerosis (SSc) is a rare systemic autoimmune disease characterized by skin and organ fibrosis. The pathogenesis of SSc and its progression are poorly understood. The SSc intrinsic gene expression subsets (inflammatory, fibroproliferative, normal-like, and limited) are observed in multiple clinical cohorts of patients with SSc. Analysis of longitudinal skin biopsies suggests that a patient's subset assignment is stable over 6–12 months. Genetically, SSc is multi-factorial with many genetic risk loci for SSc generally and for specific clinical manifestations. Here we identify the genes consistently associated with the intrinsic subsets across three independent cohorts, show the relationship between these genes using a gene-gene interaction network, and place the genetic risk loci in the context of the intrinsic subsets. To identify gene expression modules common to three independent datasets from three different clinical centers, we developed a consensus clustering procedure based on mutual information of partitions, an information theory concept, and performed a meta-analysis of these genome-wide gene expression datasets. We created a gene-gene interaction network of the conserved molecular features across the intrinsic subsets and analyzed their connections with SSc-associated genetic polymorphisms. The network is composed of distinct, but interconnected, components related to interferon activation, M2 macrophages, adaptive immunity, extracellular matrix remodeling, and cell proliferation. The network shows extensive connections between the inflammatory- and fibroproliferative-specific genes. The network also shows connections between these subset-specific genes and 30 SSc-associated polymorphic genes including STAT4, BLK, IRF7, NOTCH4, PLAUR, CSK, IRAK1, and several human leukocyte antigen (HLA) genes. Our analyses suggest that the gene expression changes underlying the SSc subsets may be long-lived, but mechanistically interconnected and related to a patients underlying genetic risk.


Introduction
Genome-scale gene expression profiling of systemic sclerosis (SSc) skin has identified distinct intrinsic molecular subsets (inflammatory, fibroproliferative, and normal-like) within the subset of patients diagnosed with diffuse cutaneous SSc (dSSc) based upon the extent of skin involvement. These subsets are identified by an intrinsic gene analysis [1] that shifts the focus to differences between patients rather than patient biopsies. The inflammatory subset is characterized by increased expression of genes associated with inflammation and extracellular matrix (ECM) deposition, while the fibroproliferative subset is characterized by increased expression of genes associated with cell proliferation [1,2]. Biopsies from patients in the normal-like subset show gene expression most similar to healthy control skin biopsies. The presence of three distinct molecular SSc subsets within patients diagnosed with dSSc underscores the molecular heterogeneity of SSc. However, it is unclear whether the subsets represent distinct diseases with different etiologies or whether they represent disease progression. To address this question, we identified the conserved molecular pathways characteristic of each subset that are reproducible between different datasets from multiple patient cohorts, and examined the connectivity of these genes and SSc-associated polymorphic genes in a predicted functional network.
SSc is a rare disease without validated disease progression markers and no known cure. SSc affects between 49,000-276,000 Americans; one in three patients dies within 10 years of diagnosis [3]. The rarity of SSc makes this disease an excellent test case for a genomic meta-analysis to understand disease mechanism. Using this approach, we have begun to understand the molecular and clinical complexity of SSc. Our findings may assist in generating patient-specific therapies [4] and delivering real-time quantitative feedback regarding therapeutic response during clinical trials [4,5].
High-throughput gene expression data has demonstrated that genes that function together are almost always co-expressed and thus highly correlated with each other [6,7]. Indeed, the expressed genes in a biopsy form a co-expression network, where genes serve as nodes and correlations as links between genes. (The language of networks is technical and beyond the scope of this paper. The supporting information (S1 Text) contains a glossary of keywords that are used in a technical sense in the main body of the paper.) This observation forms part of the basis for ''network medicine'' [8,9]. The co-expression network contains groups of highly correlated genes that represent the biological processes at work in the tissue [6,10]. These groups of genes can be found using a variety of procedures for co-expression clustering (e.g. [6,10]). All of these procedures group highly correlated genes together, i.e. they partition the genome into non-overlapping groups of genes with similar expression patterns sometimes called modules. The output of co-expression clustering is a data-driven partition of the expressed genes in the genome. As a result of technical differences in data acquisition protocols as well as true biological variation (patient heterogeneity, treatment), the exact modules identified in one SSc dataset differ somewhat from those identified in another dataset, although the same pathways, biological processes, and many core genes are found in each dataset. We developed a tool to compare gene co-expression modules derived from multiple disparate datasets to identify the modules that are reproducibly expressed in each dataset. We then derived gene sets from the overlaps between conserved modules across datasets. These ''consensus clusters'' are the gene clusters that are conserved across all datasets.
A naive approach to solving this problem is to simply intersect the ''intrinsic gene lists'' derived for each of the cohorts [1,4,11].
The methodological issue with this approach is that these lists are derived under a large multiple hypothesis testing burden, and although the same biological processes and some genes are found reproducibly, the gene sets do not exactly recapitulate across data sets [11]. This simple intersection approach would be much too conservative and consequently exclude many biologically important genes. Our alternative approach is to consider ''modules first''. In brief, our goal is to identify the modules that are conserved across datasets first and then extract the consensus genes as those that are consistently assigned to those modules. This transfers the multiple testing burden onto the much smaller list of modules and allows genes to be included in the consensus even if they do not achieve extremely high statistical significance in all datasets simultaneously.
We developed this idea into a novel data mining procedure called Mutual Information Consensus Clustering (MICC) to identify conserved gene expression modules across multiple gene expression datasets. Consensus clustering is a set of techniques from computer science and bioinformatics that refers to strategies for extracting robust clusters from an ensemble of partitions. Typically this is done using a large ensemble of partitions. Early work focused on weak clustering algorithms and consensus clustering was used to ''boost'' the weak partitions into an aggregate, consensus partition [12]. In bioinformatics, consensus clustering algorithms have been developed to aggregate ensembles of partitions that are derived from data resampling [13]. These techniques have in common that they do not ''trust'' a particular partition from one of their clustering algorithms. Here, we use a strong clustering algorithm called weighted gene co-expression network analysis (WGCNA). Our ensemble of partitions is the collection that we obtain from having multiple, clustered datasets from independent cohorts. While WGCNA extracts meaningful signals in each data set, the potentially interesting modules in one dataset are not precisely replicated in all others.
Mutual information [14] provides a rigorous criterion by which modules from different datasets can be said to have significant overlap (i.e. are conserved) and allows one to identify when the available information between two partitions is exhausted. Mutual information is a sum of positive and negative contributions from each pair of modules across datasets, and MICC automatically disregards all overlaps that do not contribute positively to the total mutual information, thus giving an objective measure of conserved gene expression that is both comprehensive and parsimonious. As such, MICC is a metaclustering procedure that ''clusters the clusters'' [12], but does not produce a complete partition. Instead, it generates only a partition of the subset of the genome that has strongly conserved gene co-expression.
Using previously published gene expression data from skin biopsies from patients with SSc recruited at three independent academic centers [1,4,11] and new samples analyzed as part of this study (Table 1), we identified the consensus clusters that were present in all datasets. Due to the unbiased nature of highthroughput screening, these datasets contain information about SSc-specific biology as well as the general biology of skin. We showed that MICC yields consensus clusters that are biologically specific. At the level of the whole transcriptome, we demonstrated that the consensus clusters are enriched for hubs defined by coexpression network analysis. We then filtered the consensus clusters down to those that were intrinsic subset-specific. The existence of the intrinsic subsets is a robust observation in each of these studies and the consensus clusters associated with them provide a rigorous picture of the core gene set underlying the subsets. The comprehensive and concise annotation of the conserved differential gene expression that we developed suggests

Author Summary
Systemic sclerosis (SSc) is a rare autoimmune disease characterized by skin thickening (fibrosis) and progressive organ failure. Previous studies of SSc skin biopsies have identified molecular subsets of SSc based upon gene expression termed the inflammatory, fibroproliferative, normal-like, and limited intrinsic subsets. These gene expression signatures are large and although the biological processes are conserved, the exact list of genes can vary across datasets due to random variation, as well as minor differences in the composition of the study cohorts (e.g. early vs. late disease). We developed a computational tool to identify the consensus genes underlying the subsets across heterogeneous data and characterized the biological role of the consensus genes in SSc in order to obtain a systems level perspective of the SSc subsets. Our analysis reveals a complex network of genes connecting two of the major SSc intrinsic subsets, inflammatory and fibroproliferative. Many genetic loci associated with SSc risk show connections with the consensus genes of the intrinsic subsets, indicating that differential expression of genes defining the subsets may be related to genetic risk for SSc, thus for the first time placing the genetic risk factors in the context of, and showing putative relationships with, the intrinsic gene expression subsets.
that the intrinsic subsets represent pathophysiological states of one disease. Our major findings include the following: 1. We show that the subset-specific consensus clusters are part of a gene-gene network and for the first time to our knowledge, demonstrate putative connections between the intrinsic gene expression subsets of SSc and SSc-associated genetic polymorphisms identified by candidate and genome-wide association studies (GWAS). 2. We provide additional unbiased data to support the hypothesis that immune system activation is an early event and plays a central role in SSc pathogenesis as SSc risk alleles are linked to the immune system nodes of our network. 3. The consensus gene-gene network provides insights into genes that may be central to the major disease processes and identifies genes and pathways that may connect these major groups of genes. 4. We show a link between the inflammatory and fibroproliferative patient groups through a shared TGFb/ECM subnetwork, suggesting a theoretical path by which these gene expression subsets may be linked. Collectively, these findings demonstrate that MICC is a powerful tool that identifies the reproducible signals in gene expression data across multiple datasets and shows how they may relate to the genetic polymorphisms associated with SSc.

Results
We analyzed a compendium of three whole transcriptome datasets from SSc skin biopsies (Milano et al. [1], Pendergrass et al. [11], and an expanded version of Hinchcliff et al. [4]; see Materials and Methods). These datasets consist of 70 patients with dSSc, 10 patients with limited SSc (lSSc), 4 morphea samples, and 26 healthy controls (Table 1). Our aim was a comprehensive picture of the gene expression abnormalities in SSc skin and we integrated several publicly available tools with a novel consensus clustering procedure. As demonstrated in Fig. 1, our analysis began with gene coexpression clustering (Fig. 1A), followed by a novel post-processing step called Mutual Information Consensus Clustering (MICC) that identified conserved gene expression modules across the three cohorts (Fig. 1B). The outputs from MICC were consensus clusters, i.e. modules that were conserved across datasets, which were the objects of further study, including ontology annotation and functional interaction analysis (Fig. 1C).
To understand the molecular processes at work in SSc skin biopsies, we constructed data-driven partitions of the expressed genes across multiple SSc skin gene expression datasets using weighted gene co-expression network analysis (WGCNA) [10] (Figs. 1A, 2). Each co-expression cluster, or module, in the partition corresponds to a collection of correlated molecular processes present in the SSc tissue at the time of biopsy. To compare these modules across SSc datasets, we used mutual information to detect when a module from one dataset is present in another dataset. The partitions of the genome-wide expression data vary from one dataset to the next due to clinical heterogeneity and treatment effects, as well as technical variation in RNA processing protocols. All samples were analyzed on Agilent DNA microarrays with the same DNA probes in the same laboratory, providing consistency of the gene expression data and genes analyzed.
To identify genes with conserved expression across multiple datasets, we developed a procedure called Mutual Information Consensus Clustering (MICC) that detects significant conservation of a piece of a module and groups these conserved modules into collections called communities, which are sets of modules with considerable mutual overlap between datasets. Each community is associated with a gene set; namely all genes that are annotated to a module in that community for each dataset. We call these gene sets consensus clusters. The basis of MICC is the concept of mutual information from information theory [14]. Specifically, we use mutual information of partitions (MIP), which is an information measure specific for partitions. MIP quantifies the amount of information one partition has about another; i.e. it measures the correlation of cluster labels across datasets. MICC identifies consensus clusters using MIP to build a module similarity network of significant module overlaps, which we call the information graph (Figs. 1B, 3A). Then MICC algorithmically identifies communities in that network ( Fig. 1B; Fig. 3A, colored nodes). These communities are collections of modules that have substantial overlap among each other, and they represent nearly all of the mutual information between the genomic partitions. In this way, MICC extracts almost all of the available information present in the separate clusterings of individual datasets and reports the clusters that are conserved across the three cohorts (see Materials and Methods for a detailed description).
WGCNA identifies that large numbers of genes in the genome are deregulated in SSc skin Weighted gene coexpression network analysis (WGCNA) [10] is a gene co-expression clustering procedure that automatically detects the number of modules in a dataset and removes outlier genes. WGCNA performed on a single SSc skin dataset (Milano et al. [1], S1 Data file), demonstrates the complexity of comparative studies across multiple datasets (Fig. 2). The molecular subsets termed 'SSc intrinsic subsets' were first identified by Milano et al. [1]. The Milano et al. dataset is the best characterized dataset showing the SSc intrinsic subsets that has been analyzed to date. These data were clustered into modules by WGCNA, and the resulting modules were summarized by their first principal component or module eigengene (Fig. 2). Module eigengenes are a one-dimensional summary of the gene expression within a module that captures the bulk of the variance within that module. To identify those modules that were intrinsic subset-specific, we performed Kruskal-Wallis tests on the module eigengenes with groups defined according to intrinsic subsets. Of the 54 total modules, 23 had a significant p-value for association with the intrinsic subsets (all p,0.05 after Bonferroni correction; Fig. 2 shows six representative examples). These 23 modules comprise approximately 40% of the expressed genes in the genome ( Table 2), demonstrating that the intrinsic subsets found in SSc skin are defined by deregulation of a very large fraction of the expressed genes in any given cell. Gene expression for six significant modules along with their corresponding module eigengenes demonstrates clear association with the intrinsic subsets (Fig. 2). These modules are enriched for broad functional categories previously associated with SSc, including chemokine signaling, NFkB signaling, RAS-RAC signaling in the inflammatory subset, and cell cycle processes in the fibroproliferative subset [1,4,11].
To identify the core set of genes reproducibly found in each SSc intrinsic subset, we performed WGCNA on two additional SSc skin gene expression datasets: Pendergrass et al. [11] and an expanded version of Hinchcliff et al. [4] (Data files S2-S3). The Hinchcliff data are from an ongoing clinical trial of mycophenolate mofetil (MMF). Preliminary data have been published [4], but we also analyzed data for an additional 82 unpublished samples from that trial (the full data available from NCBI GEO at GSE59787). A summary of the three cohorts, including the expanded Hinchcliff cohort, is available in Table 1. Each of the three datasets had approximately 60 modules. The module eigengenes were tested for association with the intrinsic subsets, and it was found that, within a dataset, between 18% and 67% of  the modules were subset-specific (Table 2). This shows that for each dataset a substantial fraction of modules was associated with the intrinsic subsets and, that 4,004-10,373 genes out of the approximately 19,500 in the human genome were in differentially regulated modules associated with the intrinsic subsets (Tables 2  and 3). The gene co-expression modules represent biological processes that are active in skin and some are reflective of disease pathogenesis. To determine which processes were conserved across all three datasets, we constructed the information graph for the three separate WGCNA partitions of the genome (Fig. 3A). The information graph is a network where a node in the network is a module from one dataset, and a link between modules indicates that the overlap between those modules is significantly larger than would be expected at random. In other words, an edge represents conservation of a significant part of a module across two datasets (see Materials and Methods for a detailed discussion of module overlap scores). Triangles in the information graph correspond to a significant three-way overlap of modules or, equivalently, a module conserved across all three datasets. We enumerated all triangles in the information graph to identify all such conserved modules. There were 157 triangles and approximately 2000 genes in their corresponding triple overlaps. Most (129 out of 178) of the modules across all SSc datasets are present in at least one triangle, i.e. most co-expression modules had a significant portion co-expressed in the other datasets ( Table 2, bottom row). This indicates that the WGCNA-derived modules are reproducible features of SSc gene expression. Nine of the triangles had all three nodes (modules) significantly associated with the subsets (five inflammatory, four fibroproliferative; see below and Fig. 3).
The consensus genes are hubs in the gene-gene co-expression networks. To see this, we noted that module eigengenes represent hubs in the gene-gene correlation network [15]. A module eigengene does not correspond to an actual gene, but rather represents a theoretical gene that is most central in the module. . Each module is a set of genes that was highly correlated within a dataset. (B) Modules were compared across datasets using a novel procedure (MICC) to determine which were approximately conserved across all three datasets. The network in (B) is called the information graph and encodes the nontrivial overlaps of modules across datasets. Triangles in this network correspond to approximately conserved modules across all three datasets. Communities in this network (dotted ovals) represent collections of modules that are conserved together and thus have similar biological function. Note that communities in the network can overlap (e.g. module P1 in the schematic belongs to two communities). (C) Genes derived from the module communities are called consensus genes and were used for downstream bioinformatics analyses including gene ontology enrichment analysis using the g:Profiler tool, testing for intrinsic subset-specificity, and functional interaction network analysis using the IMP functional network. Each of these downstream analyses is independent and complementary. doi:10.1371/journal.pcbi.1004005.g001 Therefore, genes that are highly correlated to their module eigengene are more central within their module. We calculated the correlation of each gene to its corresponding module eigengene (S1 Fig.). The density of these gene-eigengene correlations is shown for all genes in the genome (blue curve) and for only the consensus genes (red curve). The consensus genes are significantly more correlated with their module eigengene than randomly selected genes are with their module eigengene, indicating that the consensus genes are significantly enriched for hub genes in their (dataset-specific) co-expression network. This is a useful positive control for the MICC method because it shows that the consensus genes are enriched for ''hubness'' in the SSc co-expression network and thus MICC finds genes that have salient network features.

The information graph reveals conserved, subset-specific molecular modules
While most modules are partially conserved between the three datasets and many of them are intrinsic subset-specific, not all intrinsic subset-specific modules are conserved across all datasets (S4 Data file). To find the conserved, intrinsic subset-specific modules, we noted that the information graph has groups of triangles with considerable mutual edge-sharing (Fig. 3A). Many of the triangles in the information graph overlap and form communities of triangles ( Fig. 3A, S2 Fig.). This was intriguing because it opened up a broader interpretation of ''consensus cluster''. If the information graph had been a disconnected collection of single triangles, this would have implied that there was a one-to-one mapping between the modules from different datasets. Instead, a single module from one dataset gets broken into pieces in the other datasets. The community structure of the information graph indicates what we have known from many prior microarray studies, namely that specific groups of genes are commonly expressed together and that the aggregate set of genes underlying these multiple co-expression clusters constitutes the truly conserved processes in SSc [1,6,16].
We detected communities in the information graph using a variant of clique percolation [17], a network community detection procedure that, in this case, explicitly identifies communities of triangles (S1 Text). Clique percolation identified 26 communities, 13 of which were single, isolated triangles, while the rest were groups of more than one triangle (Fig. 3A, S2 Fig.).
To derive a gene set associated with a community in the information graph, we took all modules within the community, computed their union within datasets, and computed their intersection across datasets (S3 Fig.). In this way, we captured all genes whose co-expression was conserved across the three datasets. (A mathematical description of this procedure is presented in the Gene expression modules associated with the intrinsic subsets of SSc. We identified 54 major sets of genes (modules) using WGCNA that define the spectrum of gene expression in SSc skin using Milano as a test case. The top 6 most significant modules are shown and each shows a statistically significant association with the intrinsic subsets (including the limited subset). Module assignment for each gene is unique. The genes that compose the subset-specific modules represent more than 40% of the protein-coding genes in the human genome. Therefore, the intrinsic subsets seem to be determined by a large fraction of the encoded genes. The module eigengene of each module is shown in a stem-plot below each heatmap with intrinsic subsets indicated by color above the heatmap. Proliferative, red; inflammatory, purple; limited, yellow; normal-like, green. Materials and Methods.) We termed these community-derived gene sets consensus clusters (CCs). Using g:Profiler [18], we found that the consensus clusters are enriched for many biological processes (summary in Table 3; raw data in S5 Data file) present in both healthy and SSc biopsies. For example, CCs 1, 4, 5, 7, 8, and 11 are enriched for basic metabolic and cellular processes, while CC 12 showed enrichment for keratinocyte-specific processes (Table 3; Fig. 3A, cyan). These consensus clusters show  Fig.), indicating approximate conservation of gene expression modules across datasets. The information graph is tripartite by construction, so a triangle in the graph necessarily connects modules across all three datasets. The triangles form communities of mutual edge sharing. Colored nodes and edges highlight four of these communities. The purple community contains modules that are up-regulated in the inflammatory subset (cf. panel B). The red community contains modules that are up-regulated in the fibroproliferative subset (cf. panel B). The cyan community contains modules that are enriched for keratinocyte-specific processes. The orange community contains modules that are enriched for fatty acid metabolism genes. The remaining communities (22 in all and not colored to avoid cluttering the display) are enriched primarily for housekeeping processes and are neither skin-nor disease-specific (see Table 3). (B) Modules from the communities were tested for their enrichment in the subsets. Each row corresponds to a triangle in the information graph and each column corresponds to a dataset. The black lines separate communities, e.g. all of the rows in the block marked ''1'' correspond triangles in community 1. The cells are colored according to whether the module was significantly differentially expressed in a subset with dark colors representing up-regulation and light colors representing down-regulation (Bonferroni-corrected Wilcoxon rank sum p-value p,0.05). We assessed statistical significance of modules within each dataset for each of the three diffuse SSc intrinsic subsets, as well as all SSc vs. healthy controls (Purple-Inflammatory, Red-Proliferation, Green-Normal-like, Blue-All SSc). Note the inflammatory up community (*) and the fibroproliferative up community (**). Note also that community 2 is significantly highly expressed in the inflammatory subset and lowly expressed in the proliferative subset in Milano only. Likewise, community 9 appears to be expressed at low levels in the inflammatory subset in Milano, but none of the other data sets. doi:10.1371/journal.pcbi.1004005.g003 Portion conserved 59% 76% 81% The number of modules identified by WGCNA varied across datasets. However, a large majority of modules from each dataset were at least partially conserved across the three cohorts, meaning that they were present in at least one triangle of the information graph (Fig. 3A). A smaller fraction of the modules were subset-specific within their dataset. The Hinchcliff dataset had the largest fraction of subset-specific modules, which may be due to subject enrollment in the cohort rather than to SSc biology. The Milano dataset had the largest number of genes present in subset-specific modules. This shows that the subsets are associated with a large number of genes (up to 40% of the genome in Milano  that MICC extracts biologically coherent sets of genes that are known to be active in skin as consensus clusters. This provides an additional positive control for the MICC method. More importantly, CC3 and CC9 showed enrichment for processes implicated in SSc (Table 3; S5 Data file). CC3 was enriched for response to interferons, B cell receptor signaling, monocyte chemotaxis, and TGFb and PDGF signaling, as well as ECM remodeling processes. CC9 showed enrichment for cell cycle and cell proliferation processes, as well as integrin interactions with fibrin. Note that CC3 and CC9 both show enrichment for distinct ECM-related molecular processes. These data are consistent with the analysis of experimentally derived pathway signatures [19].
The consensus clusters CC3 and CC9 map to the major intrinsic subsets previously described [1]. We tested every module for association with the intrinsic subsets (see Materials and Methods) and we constructed a ''heatmap'' of the triangles in the information graph by dataset (Fig. 3B). The rows were ordered by community membership and the columns were ordered by dataset. We concatenated each of these plots so all subsets, datasets, and consensus clusters can be viewed simultaneously. Only consensus clusters 3 and 9 were enriched for SSc intrinsic subset specificity. Consensus cluster 3 contained modules that are almost all significantly expressed at high levels in the inflammatory group of patients (Fig. 3A, purple nodes; Fig. 3B). Consensus cluster 9 contains modules that are almost all significantly expressed at high levels in the fibroproliferative group (Fig. 3A, red nodes; Fig. 3B). We also included tests for all SSc biopsies versus healthy controls to determine if there were any consensus clusters that were generally conserved across all SSc biopsies. There were no consensus clusters that were enriched for all SSc versus healthy controls, which illustrates quantitatively SSc heterogeneity. Furthermore, there were no consensus clusters that were consistently expressed at low levels in any of the subsets.
Some consensus clusters are enriched for a subset in some of the datasets, but are not replicated across all datasets (Fig. 3B). For example, CC2 is expressed at high levels in the inflammatory subset and low levels in the proliferative subset in Milano, but neither of the other datasets. Inflammatory-specific CC3 is expressed at low levels in the proliferation subset in Milano and in the normal-like subset in Pendergrass and Hinchcliff, and is expressed at high levels in all SSc versus healthy controls in Pendergrass and Hinchcliff only. Similarly, CC9, which is proliferative-specific, is expressed at low levels in the inflammatory subset in Milano only. These observations demonstrate that genes with increased expression should be the focus in SSc.

Conserved inflammatory and fibrosis genes form a network with putative SSc risk alleles
The biology of CC3 and CC9 show the processes common to the intrinsic subsets that have been observed across multiple gene An analysis with g:Profiler resulted in many biological processes enriched in the each of consensus clusters. This table contains a condensed list of the significant pathways (p,0.05, corrected for multiple testing by default in g:Profiler), retaining those that are specific for skin or other housekeeping biology. Note particularly consensus clusters 3 and 9, which are enriched for the inflammatory and fibroproliferative subsets respectively. Consensus clusters 3, 9 11, and 12 are highlighted in expression datasets: inflammation, cell interactions with ECM, and cell proliferation (Table 3). To determine if there was a more interconnected relationship between these conserved processes (such as genes related to specific cell types) than could be gained from an ontological annotation analysis like g:Profiler, we used CC3 and CC9 as a query gene set for the IMP gene-gene interaction Bayesian network (IMP) (Fig. 4) [20]. IMP is a genegene interaction network developed using a large compendium of high-throughput biological data including all publicly available microarray data that predicts the probability that pairs of genes have a co-expression interaction. A list of genes is imported into IMP, and a list of high-probability interactions between the genes on the imported list and (up to 50 additional genes in) the rest of the genome is generated. IMP is completely agnostic to SScspecific biology and reports predicted interactions that are based on the preponderance of evidence across all publicly available gene expression data. As our query, we pooled the two consensus clusters CC3 and CC9 to discover possible molecular links between the inflammatory and fibroproliferative intrinsic subsets. We added polymorphic genes from genome-wide association studies (GWAS), as well as genes from candidate gene studies that have been replicated in at least one follow up study (see Materials and Methods; S6 Data file). In addition, we added four genes that are putative predictors of Modified Rodnan Skin Score (MRSS), a widely used clinical measure of skin fibrosis [5]. The output network from IMP was dominated by one large interconnected network that had five distinct subnetworks ( Fig. 4; S7-S9 Data files). The five molecular subnetworks were each enriched for a distinct biological process: interferon response, M2 macrophage activation, adaptive immunity, ECM deposition and remodeling and TGFb signaling, and cell proliferation.
One subnetwork was dominated by interferons and interferoninducible genes (Fig. 4, top middle; S9 Data file). The interferon subnetwork contained genes solely from the inflammatory consensus cluster (Fig. 4, purple nodes). This subnetwork contained the interferon inducible genes IFI16 and IFI44, the latter Fig. 4. Molecular network of inflammatory and fibroproliferative consensus genes. The consensus genes for the inflammatory and fibroproliferative subsets are connected in the IMP functional network. Inflammatory genes are colored purple, while fibroproliferative genes are colored red. Genes with polymorphisms are colored in green and MRSS biomarker genes are colored yellow. One MRSS biomarker gene (IFI44) was also an inflammatory consensus gene (pink), while three polymorphic genes were inflammatory consensus genes (turquoise). Note the five distinct subnetworks corresponding to type I interferons, M2 macrophages, ECM proteins and TGFb signaling, adaptive immunity, and cell proliferation. The interferon, M2 macrophage, and adaptive immunity subnetworks are composed almost exclusively of inflammatory genes, while the ECM subnetwork shares genes from both intrinsic subsets. Furthermore, the polymorphic genes interact primarily with inflammatory subset genes indicating that the genetic risk in SSc is related to immune abnormalities. doi:10.1371/journal.pcbi.1004005.g004 of which is a putative biomarker of fibrosis [5]. This subnetwork also contains the polymorphic interferon regulatory factor genes IRF5, IRF7, and IRF8.
A second subnetwork contained genes characteristic of M2 macrophage activation (Fig. 4, bottom left; S9 Data file). The genes in this network, which include major histocompatibility complex (MHC) class II genes with SSc-associated polymorphisms, are derived primarily from the inflammatory consensus cluster, implicating macrophages as mediators of inflammation. Polarized macrophages can broadly be categorized as ''classically activated'' (M1) or ''alternatively activated'' (M2), although it is important to recognize that macrophage polarization encompasses a broad spectrum of activation states. M1 macrophages may be elicited through stimulation with IFN-c and LPS, are microbicidal, and promote Th1-mediated immune responses. In contrast, M2 cells, which mediate immune suppression, may be activated by various stimuli, including IL-4 and/or IL-13, which are elevated in SSc sera [21,22]. Genes associated with M2 activation, including CX3CR1 [23], IL10R [24], and HLA-DMB [25], were consistently expressed in this subnetwork, in accordance with previous studies that found increased M2-polarized macrophages in SSc skin compared to healthy skin [26]. As M2-polarized cells regulate vascularization and are a potent source of TGFb, PDGF, and inflammatory cytokines [27][28][29], activated M2 macrophages may play a role in mediating fibrosis and inflammation in SSc.
A third molecular subnetwork contained genes related to adaptive immunity (Fig. 4, top left; S9 Data file). There are relationships to both B and T cells in the genes in this subnetwork. Two chains of the T cell receptor complex are represented: CD3G (gamma chain of the T cell receptor (CD3)) and CD247 (the zeta chain of the T cell receptor), which contains SSc-associated polymorphisms. The IL-12 pathway, which mediates Th1 cell differentiation and activation [30,31], is represented through IL12RB2. Binding of IL-12 to IL12RB2 on activated T cells initiates a signal transduction cascade that results in activation of STAT transcription factors, including STAT4 [32] (also represented in this subnetwork), which regulate T cell signaling and immune activation [33]. Aberrant expression of IL12RB2 has been reported in autoimmune and infectious diseases [34,35], implicating this gene as an important regulator of inflammation and immune defense.
B cell receptor activation and signaling are also represented in this subnetwork. DOCK10 expression is up-regulated in B cells by pro-inflammatory IL-4 [36], and BANK1 and BLK are B cell proteins that have polymorphisms associated with SSc. Both LYN and CSK appear in this subnetwork and are directly connected to each other. The tyrosine kinase LYN, which plays a critical role in down-regulating B cell activation and mediating self-tolerance [37,38], is phosphorylated by CSK [39]. Polymorphisms in CSK have been linked to both SSc and systemic lupus erythematosus (SLE) and are associated with aberrant B cell signaling [40]. CSK also associates with Lyp [41], which is the product of the tyrosine phosphatase PTPN22. The PTPN22 gene also contains an SScassociated polymorphism. Mutations in PTPN22 that interfere with its ability to bind to CSK also interfere with both B and T cell receptor activation [42,43]. Moreover, mutations in PTPN22 have been reported in a variety of other autoimmune diseases, including SLE, rheumatoid arthritis, and type 1 diabetes [44].
Negative regulators of B and T cell activation such as SOCS2 and SOCS3, are included in this network. SOCS3 has been shown to directly inhibit IL-12-induced STAT4 activation [45]. The cooccurrence of pro-and anti-inflammatory signals in this subnetwork is notable and is likely because our data are derived from whole skin biopsies (see Discussion).
The fourth molecular subnetwork contained TGFb pathway genes (which have long been implicated in the activation of fibrosis in SSc [46,47]) and ECM structural proteins (Fig. 4, bottom middle). This TGFb/ECM subnetwork contained genes from both the inflammatory and fibroproliferative consensus clusters (Fig. 4, red and purple nodes; S9 Data file). We also found expression of genes associated with Notch signaling such as NOTCH4, which contains SSc-associated polymorphisms, and with the epithelialmesenchymal transition (EMT) such as LATS2. Alternatively activated macrophages are known to produce large quantities of TGFb in SSc pulmonary fibrosis [29], suggesting that the M2 macrophage subnetwork could drive activation of the TGFb/ ECM subnetwork.
The final molecular subnetwork contained cell cycle/cell proliferation genes, which were primarily from the fibroproliferative consensus cluster (Fig. 4, right). The expression of proliferation genes is commonly observed in cancer [2,7,48] and their presence in the gene expression data of SSc was a surprising and unexpected finding [1]. The large and densely interconnected subnetwork of genes in Fig. 4 (right, red nodes) was composed almost exclusively of cell cycle-regulated genes including AURKA/ B, CCNA2, CCNB1, CHK1, and DHFR [7]. This subnetwork was conserved and showed increased expression in the fibroproliferative subset of patients across all three cohorts, and constituted the core gene expression signature in that subset of patients (Fig. 4, red nodes). Therefore, the cell proliferation signature of the fibroproliferative subset of patients first observed in Milano et al. [1] is a conserved feature of SSc across three independent cohorts from three separate clinical centers. This molecular subnetwork has connections to each of the other four subnetworks (interferon, M2 macrophages, adaptive immunity, and TGFb/ECM) suggesting that cell proliferation in SSc skin is modulated by the inflammatory and ECM remodeling processes in skin.
IMP predicts that the genes linked to SSc-associated polymorphisms (30/41 total) and the putative MRSS biomarker genes of Lafyatis and co-workers (4/4) have interactions within this large component of the molecular network (Fig. 4). Polymorphisms in IRF5, IRF7, and IRF8 were linked to the interferon subnetwork. IRF7 is also differentially expressed in the inflammatory subset. The polymorphisms associated with human leukocyte antigen (HLA) alleles predominantly have interactions with the M2 macrophage subnetwork of genes. Polymorphisms in and differential expression of NOTCH4 were linked to the TGFb/ECM subnetwork. The same was true for the MRSS biomarker genes; IFI44 was linked to the interferon subnetwork; SIGLEC1 was linked to the M2 macrophage subnetwork; and both COMP and THBS1 were linked to the TGFb/ECM subnetwork. These results suggest that prediction of worsening skin disease requires sampling genes from each molecular subnetwork.

The molecular network contains SSc-pathology-specific hubs
The molecular network contains genes that are hubs (i.e. highly connected nodes) of the subnetworks.
Interferon-induced protein 44 (IFI44) is a hub of the interferon subnetwork. It has conserved high expression across all three of our cohorts in the inflammatory subset and is one of the most highly connected genes in the interferon subnetwork (Fig. 5, top  right). IFI44 is predicted to have co-expression interactions with several other interferon-inducible and interferon-regulating genes, including IFI16, IRF7, IFITM2, ISG20, GBP1, and TRIM22.
Allograft Inflammatory Factor 1 (AIF-1) is a hub of the M2 macrophage. AIF1 is consistently highly expressed in the inflammatory subset across all three SSc skin cohorts and is one of the most highly connected genes in the M2 macrophage subnetwork (Fig. 5, bottom left). In the molecular network (Fig. 5, bottom left), AIF1 has many links including: ITGB2, a binding partner of the monocyte marker ITGAL, and MHC class II genes HLA-DMB, HLA-DPA1, and HLA-DQB1. In addition, AIF1 has connections to chemokine receptors CCR1 and CX3CR1, which are connected to chemokines CX3CL1 (fractalkine) and CCL2 (MCP-1).
The tyrosine kinase gene LYN is a hub of the adaptive immunity subnetwork (Fig. 5, top left). LYN has predicted edges with four polymorphic genes in this subnetwork: BLK, BANK1, CSK, and GRB10. LYN also has connections to the polymorphic, bridge genes PLAUR and LCP2 (see below), and suppressors of cytokine signaling genes SOCS2 and SOCS3. The conserved finding of high expression of LYN in the inflammatory subset and its centrality within the adaptive immune subnetwork suggests that LYN plays a key role in the adaptive immune component of SSc in skin.
Fibrillin-1 (FBN1) is a hub of the TGFb/ECM subnetwork (Fig. 5, bottom right). High expression of FBN1 is conserved across the inflammatory subset of all three cohorts of SSc skin, and FBN1 is highly connected within the TGFb/ECM subnetwork of the molecular network (Fig. 5, bottom right). The TGFb/ECM subnetwork includes genes that primarily show high expression in the inflammatory subset but also includes genes that are highly expressed in the fibroproliferative group, thus providing a putative molecular link between the two groups. FBN1 has predicted connections to many genes whose increased expression is conserved, including: pro-fibrotic genes including COL1A2, COL5A2, and elastin (ELN), CTGF, SPARC, THBS1, THBS4, COMP, TNC and ECM remodeling and wound response genes LOX, NNMT, and FBLN5. In addition, FBN1 has connections with growth factor genes and receptors such as HTRA1 and NOTCH4; cell adhesion genes CDH11 and LAMA4; as well as the complement system gene C1S.

The molecular network shows genes that bridge subnetworks
In addition to containing discrete subnetworks, the molecular network also shows genes that bridge the subnetworks (Fig. 6). These genes are of particular interest because they have predicted connections between multiple, distinct subnetworks. The primary reason for using CC3 and CC9 simultaneously as queries to IMP Fig. 5. Hubs in the inflammatory and ECM components of the network. The putative MRSS biomarker gene IFI44 is a hub of the type 1 interferon subnetwork. AIF1, which contains SSc-associated polymorphisms and is related to M2 macrophage polarization, is a hub of the M2 macrophage network. FBN1, which contains SSc-associated polymorphisms in some populations and is a key component of ECM that regulates matrix stiffness, is a hub of the TGFb/ECM network. The tyrosine kinase gene LYN is associated with B cell activation and mediating self-tolerance and is a hub in the adaptive immunity subnetwork. doi:10.1371/journal.pcbi.1004005.g005 was to identify possible molecular connections between the core molecular processes of the inflammatory and fibroproliferative intrinsic subsets. The bridge genes live at the interfaces between the subnetworks that constitute these core molecular processes.
The genes CXCR4 and LCP2 are the major connections between the adaptive immunity subnetwork and the M2 macrophage subnetwork (Fig. 6). LCP2 (SLP-76), which modulates T cell activation [49], has predicted interactions with AIF1 and IL10RA in the M2 macrophage subnetwork and to SOCS2, SOCS3, STAT4, LYN, and CSK in the adaptive immunity subnetwork (see edges extending from LCP2 in Fig. 6). The chemokine CXCR4 has predicted interactions with the cytokines/ chemokines IL10RA, CX3CR1, CCR1, and the polymorphic CCR6 in the M2 macrophage subnetwork (Fig. 6). CXCR4 has predicted interactions with SOCS3 and JAK3 in the adaptive immunity subnetwork.
GRB10 contains an SSc-associated polymorphism and is also expressed at high levels in the inflammatory subset (see blue GRB10 node, Fig. 6). GRB10 is part of a complex path from the adaptive immune subnetwork to the M2 macrophage subnetwork that includes genes containing pleckstrin homology domains including PLEKHO1, PLEKHO2, CYTH4 and ADAP2.
The major connection between the M2 macrophage subnetwork hub AIF1 and the interferon subnetwork hub IFI44 is through RAC2. RAC2 encodes a member of the Rac family of signaling molecules and has multiple predicted interactions with both the interferon subnetwork and the M2 macrophage subnetwork (Fig. 6). In the interferon subnetwork (Fig. 6, upper middle), RAC2 connects to CTSC (cathepsin C), IFITM1 and IFI16, as well as the Rho GTPase related genes ARHGDIB and RAB31. In the M2 macrophage subnetwork (Fig. 6, lower left), RAC2 connects to ITGB2, the actin cytoskeleton related proteins LCP1 and COTL1, and GMFG. COTL1 is also related to leukotriene biosynthesis through a known interaction with ALOX5. These diverse interactions suggest that RAC2 is involved simultaneously in macrophage motility, leukotriene biosynthesis, and interferon signaling.
The major bridges between the M2 macrophage subnetwork and the ECM subnetwork are THY1 (CD90) and CD14 (Fig. 6, lower left). THY1 connects to SIGLEC1, MXRA5 and COL1A2. THY1 mediates adhesion of leukocytes and monocytes to endothelial cells and fibroblasts [50], may also have a role in lung fibrosis (a major complication of SSc); THY1 knockout mice have increased lung fibrosis [51,52]. CD14 is a cell surface protein PLAUR is a gene that contains SSc-associated polymorphisms that forms a bridge between the interferon subnetwork and TGFb/ECM subnetwork. The gene RAC2 is a bridge between the interferon and M2 macrophage subnetworks. The genes LCP2 and CXCR4 are bridges between the M2 macrophage subnetwork and the adaptive immunity subnetwork. There are also several paths through GRB10 to ADAP2 between the M2 macrophage subnetwork and the adaptive immunity subnetwork. The genes CD14 and THY1 (CD90) are bridges between the M2 macrophage subnetwork and the TGFb/ECM subnetwork. The genes IRAK1 and PXK are bridges between the TGFb/ECM subnetwork and the cell proliferation subnetwork. doi:10.1371/journal.pcbi.1004005.g006 mainly expressed by macrophages, is inducible by and connected to AIF1 [53]. It also has connections to the polymorphic genes TLR2 and HLA-DRA (Fig. 6, lower left).
PLAUR (UPAR) contains a putative SSc-associated polymorphism, is a member of the interferon subnetwork, and has numerous links with the ECM, M2 macrophage, and the adaptive immunity subnetworks (Fig. 6). PLAUR encodes the plasminogen activator, urokinase receptor protein and is a pleiotropic gene at the interface of ECM remodeling, as a component of the fibrinolysis system, and in both adaptive and innate immune processes, including monocyte migration [54]. PLAUR is inducible by proinflammatory cytokines IL1b and TNFa. PLAUR connects to the tyrosine kinase LYN, the hub gene of the adaptive immunity subnetwork, and to the integrin gene ITGB2 in the M2 macrophage subnetwork. It is also connected to the polymorphic genes TNFSF10B and TNFAIP in the interferon network and to TPM4, INHBA, THBS1, and CCL2 in the ECM subnetwork. The centrality of PLAUR within the consensus gene network suggests that PLAUR may be a key mediator of inflammatory and ECM remodeling signals in SSc skin.
The proliferation subnetwork has predicted interactions with the inflammatory and ECM subnetworks. The most pronounced connection is between the ECM subnetwork and the cell proliferation subnetwork through TGFb pathway genes (Fig. 6). The TGFb pathway is known to modulate cell proliferation. There are multiple paths from the TGFb pathway genes TGFB3 and TGFBR2 to the cell proliferation subnetwork through the polymorphic genes IRAK1 and PXK, which have predicted interactions with the serine/threonine kinases LATS2, WNK4, and PRKAA1. Serine/threonine kinases are well known to be important regulators of cell proliferation and they are bridges between the ECM subnetwork and cell proliferation network.

Discussion
The intrinsic subsets of SSc have been found in multiple skin gene expression datasets. Until now, the majority of experimental data has indicated that the subsets are mutually exclusive-i.e. patients are categorized as being in one of the subsets, and that the core molecular processes and subsets of genes, are reproducible across cohorts. Despite this consistency, the exact set of intrinsic genes varies across datasets. We address both of these issues here. Our consensus clustering approach allowed us to detect a conserved set of genes from a module perspective across the three independent SSc patient cohorts by considering molecular processes first and constituent genes second. The predicted coexpression interactions between these consensus genes indicate that the key processes represented by the consensus genes (inflammation, ECM remodeling, and cell proliferation) may interact at a molecular level, with specific links between the subnetworks. Thus, we have demonstrated theoretical connections between the genes of the SSc intrinsic subsets that are difficult to capture experimentally.
It is clear that in addition to its clinical heterogeneity, SSc is a genetically complex disease. Many risk alleles for SSc have been identified, but each has only a modest odds ratio and the complete picture of SSc will likely develop from the interactions between various risk factors. The network of consensus genes demonstrates that a significant fraction of the genes with risk alleles for SSc have probable interactions with the consensus genes that underlie the intrinsic gene expression subsets. This implicates these polymorphic genes as interacting with genes differentially expressed in the subsets. This simultaneously provides a picture of the key gene expression abnormalities in the intrinsic subsets and the validated genetic associations at a systems level.
These data and the resulting network were developed from a detailed meta-analysis of SSc skin gene expression datasets using MICC, a consensus clustering framework we developed. Our method reports only consensus clusters that are conserved across all input datasets and dispenses with non-conserved gene expression. The concept of mutual information gives MICC a theoretical foundation, but like any data mining algorithm, its value is gauged by performance on real data.
The rationale for gene coexpression clustering algorithms like WGCNA is that co-expression networks are inherently modular and that co-expression hub genes are likely related to the regulation of the modules. This has been borne out by several studies in humans [55], mice [56], and even across species [57]. The genes identified by MICC are disproportionately more hublike than a random population of the same size (S1 Fig.). Therefore, MICC does not identify spurious overlaps but rather detects network-relevant overlaps that are enriched for key hub genes. At the same time, the information graph used by MICC is not simply a disconnected set of triangles, which would indicate a one-to-one mapping of modules between datasets. Instead, the modules in one dataset are broken into a small set of pieces that are re-assorted to build the modules in another dataset. This is likely due to variations in study design and protocols between the datasets, but also the inherent heterogeneity of SSc, therapy effects, and environmental exposures. The MICC method is explicitly designed to handle this unavoidable variance by broadening the definition of consensus cluster to allow for imperfect conservation of gene coexpression. We also note that MICC is completely general with respect to the data that are clustered and which clustering algorithms are used. In principle, gene expression from different tissues (e.g. blood and skin) or different species (e.g. mouse and human) or data from multiple experimental modalities (e.g. transcriptomics and proteomics) can be compared using MICC. These types of data exist for multiple tissues in SSc and multiple animal models of SSc. Follow-up studies will integrate these to further elaborate the molecular underpinnings of SSc.
The consensus clusters from MICC show both skin-specific processes that represent basic biological processes in this tissue as well as disease-specific processes. Nearly all (24 out of 26) consensus clusters are enriched only for general cellular or otherwise skin-specific biology: metabolism, cell turnover, keratinocyte-specific gene expression, etc. We view these consensus clusters as a useful positive control for the MICC method. Such housekeeping processes are clearly biologically relevant and MICC would be missing important structure in the data if these were not found. By taking a ''module first'' approach, MICC is able to identify consensus genes that are specifically clustered into pathologically active modules (the subset-specific modules).

The MICC-derived consensus clusters are enriched for known mediators of SSc pathology and genetic risk factors
Two of the consensus clusters were SSc subset-specific (Fig. 3B). These clusters contain the key gene expression abnormalities in SSc that are conserved across all three cohorts. The consensus clusters are enriched for inflammatory process Gene Ontology terms, as well as TGFb signaling, PDGF signaling, and cell proliferation (Table 3). Most (30 out of 41) of the genes with replicated SSc-associated polymorphisms are predicted to interact with genes in the consensus clusters; 28 out of 30 of these interact in the immune (interferon, M2 macrophage, and adaptive immunity) and TGFb/ECM subnetworks (Fig. 4). The inflammatory-specific consensus cluster also contains the genes FBN1 and AIF1. Previous work implicates FBN1 in SSc pathogenesis, as a duplication of FBN1 causes fibrosis in the Tsk1 mouse [58] and a point mutation in FBN1 causes the fibrotic phenotype in the Stiff Skin Syndrome mouse [59]. Fibrillin-1 forms a matrix of elastic microfibrils that provide a scaffold for elastins and collagens, and a means for sequestering matricellular growth factors. Mouse embryonic fibroblasts expressing the Tsk1 mutant FBN1 have altered microfibril morphology that results in increased collagen deposition [58]. While polymorphisms in FBN1 might cause dosage effects that result in fibrosis in some models (e.g. Tsk1), it is possible that chronic inflammation causes chronic high expression of FBN1 to similar effect in humans. Rare polymorphisms in FBN1 have been associated with SSc in some subpopulations [60][61][62].
Similarly, AIF1 is implicated in SSc disease progression. A SNP in AIF1 has been implicated in anticentromere antibody (ACA) positive SSc [63]. Moreover, AIF-1 is interferon-inducible, constitutively expressed in macrophages [64], and plays a role in vasculogenesis and endothelial cell proliferation and migration [65]. In the Sclerodermatous Graft-Versus-Host Disease (sclGVHD) mouse model of SSc, AIF1 was found to be highly expressed in skin [66] and to induce fibroblast and monocyte chemotaxis [53]. AIF1 has many predicted interactions with chemokine receptors CCR1 and CX3CR1, which are connected to chemokines CX3CL1 (fractalkine) and CCL2 (MCP-1). The genes CX3CL1 and CCL2 are M1 and M2 macrophage-related genes respectively [67] and are chemotactic for monocytes, macrophages, and T cells [68], suggesting enhanced recruitment of inflammatory cells to this subnetwork. A recent study of a mouse model of SSc demonstrated that both CCR2 and CX3CR1 regulate skin fibrosis, further implicating these mediators in the pathogenesis of SSc [69]. In addition, CCL2 has been shown to induce M2 macrophage polarization [70], which may result in persistent M2 activation. The repeated and conserved finding of high AIF-1 levels in the inflammatory subset and its tight connection to innate immune mediators of inflammation suggest it may be involved in enhanced macrophage chemotaxis and activation in SSc skin.
LYN, a hub of the adaptive immunity subnetwork, modulates B cell activation and plays a role in self-tolerance. B cell signaling has been implicated in SSc development and progression, as B cells have been shown to play a role in both the development of autoantibodies and cutaneous fibrosis in the Tight Skin 1 (Tsk1) mouse model of SSc. Notably, LYN is overactive in response to overexpression of CD19 in this model [71]. Thus, LYN may play a role in the autoimmune component of SSc in human patients.

The molecular network shows putative connections between subnetworks
The consensus gene network (Figs. 4 and 6) also implicates genes as bridges between the subnetworks. These notably include the polymorphic genes PLAUR, IRAK1, PXK, and GRB10. In addition, we find differentially expressed genes straddling the subnetworks including RAC2 and LCP2. The interconnections between the subnetworks present possible molecular paths through which these processes interact.
The finding that most SSc-associated polymorphisms are associated with immune system mediators suggests that the initial events in SSc are likely to be immune-regulated and to involve interferon activation (Fig. 7). The immune response in SSc likely differs from a normal response because of predisposing genetic variants in these and associated genes. This may lead to the secondary recruitment of macrophages via RAS-RAC signaling (Fig. 7).
We predict that the interferon network suppresses cell proliferation, given the clear distinction between the inflammatory and fibroproliferative subgroups. This inference is based on known interferon biology and not on the network itself. In contrast, it is possible that the ECM network stimulates cell proliferation through the TGFb pathway and serine/threonine kinases IRAK1, LATS2, WNK4, and PRKAA1. In this model, inflammatory gene expression creates a balancing feedback loop that modulates fibroproliferative gene expression (Fig. 7).
A major strength of the IMP network and its data integration capabilities derives from its ability to provide a more detailed picture of SSc development and progression compared with more conventional approaches. For example, while all of the purple nodes in Fig. 4 are highly expressed in the inflammatory group across all data sets, the IMP network provides information regarding gene-gene interactions in addition to expression data. In this example, the IMP network indicates which subnetworks correspond to discrete processes (interferon, M2 macrophages, ECM, and adaptive immunity) and which interactions are mediated through the network. Thus, we gain insight by recognizing that the interferon component is distinct from the M2 macrophage component, despite their co-expression and known interdependence. The value of the IMP network is as much in the connections that are not present as those that are.
What is the nature of the interactions between the intrinsic gene expression subsets?
Since the original publication of the intrinsic subsets, two important questions have been central to their interpretation and their clinical relevance: First, can a patient's subset change over the course of their disease? And second, can the subsets predict therapeutic response?
Pendergrass et al. [11] demonstrated that a patient's subset is stable over time scales of 6 to 12 months. This means either that patients never change subsets and the intrinsic subsets are effectively distinct diseases, or that the subsets are long-lived states of the same disease. Our analysis shows that the inflammatory and fibroproliferative subsets share a molecular network containing TGFb pathway genes and ECM component genes, suggesting that inflammatory patients may transition to the fibroproliferative subset, perhaps in response to successful immunosuppressive therapy. Indeed, immunosuppressive therapy has not been widely successful for treatment of SSc [72]. On the other hand, fibroproliferative biopsies still have some activation of the TGFb/ECM network despite the absence of the inflammatory signature (Fig. 4). The connection of the subsets through the TGFb/ECM subnetwork indicates that the fibroproliferative subset shares a common pathway with the inflammatory subset and that the fibroproliferative subset is tied to chronic TGFb activation and ECM deposition. Thus, based on the molecular network, it is possible that immunosuppressive therapy can move patients to the fibroproliferative subset rather than restoring their gene expression to that of healthy skin. Our data from an ongoing MMF clinical trial and analysis of mouse models of SSc suggests that gene expression changes precede clinical changes [4,66]; therefore gene expression could act as a readout for the effectiveness of a drug. This idea should be rigorously tested in clinical trials that carefully monitor gene expression in patient skin biopsies.
The pathogenesis of SSc has been enigmatic, but a number of genetic risk factors have been identified by genome-wide association studies and candidate gene studies. Three of these polymorphic genes, NOTCH4, IRF7, and GRB10, are in the inflammatory consensus cluster, and hence are consistently differentially expressed in the inflammatory subset (Fig. 4). This suggests that these may be cis-acting alleles and demonstrates the need for candidate gene studies to determine if differential expression is genetically driven in a subset of patients. The IMP functional network predicts that twenty-five of the remaining fortyone polymorphic genes interact with genes from the inflammatory consensus cluster (Fig. 4). Rather than being scattered evenly across all of the subsets or unrelated to any of the consensus genes, the risk alleles are overwhelmingly related to the inflammatory subset. The genetic studies, however, did not stratify their patients by intrinsic gene expression subset. The studies were carried out as case versus control or case versus case, when stratified by autoantibody status or other clinical outcomes. Risk alleles associated with a particular gene expression subset have not been reported. We reemphasize the fact that we found no consensus clusters that were differentially regulated in all SSc vs. healthy control biopsies.
These data support the hypothesis that the subsets are related to disease progression and that SSc starts with immune activation, perhaps in response to an environmental trigger [73,74]. The SNPs associated with SSc would then likely be risk factors for an aberrant immune response to this trigger. Should such a model be correct, we are still left with the question of why we have different subsets that generally show little or no correlation with disease duration. The simplest explanation for this result is that patients progress through the gene expression subsets at dramatically different rates and that our measures of disease duration are currently inadequate.
Another possibility is that any given patient transitions between these intrinsic gene expression groups in a dynamic manner that we do not observe using serial skin biopsies across 6-12 month time interval. This would mean that cross-sectional studies of patients would still capture all subsets while maintaining a weak correlation to disease duration. We think this is unlikely because serial biopsies are generally found in the same subset.
The final possibility is that the subset a patient stays in, and the duration in which they remain, is dependent on many outside and as yet poorly characterized factors. These could include environmental stimuli that trigger an inflammatory response, or genetic factors that determine the rate at which one progresses through the mechanistic stages of SSc. It is possible that patients in each intrinsic subset have a different set of predisposing genetic polymorphisms or similar environmental triggers. This can only be addressed if we can look for genetic risk factors in a cohort of  Fig. 4 is densely interconnected, implicating many possible interactions between the core molecular processes (interferon activation, M2 macrophage activation, adaptive immunity, ECM remodeling, and cell proliferation). Stepping back from the granular detail of single genes, we see a system of distinct parts through which SSc could be initiated and maintained. Among these are paths of particular interest. The interferon subnetwork and the M2 macrophage subnetwork are connected by RAC2. The M2 macrophage subnetwork in turn is connected to the ECM subnetwork through paths through CD14 and THY1. Suggesting macrophages may influence or drive ECM abnormalities in skin. The interferon subnetwork and the ECM subnetwork are connected through paths containing the pleiotropic and polymorphic gene PLAUR. The M2 macrophage subnetwork is connected to the adaptive immunity subnetwork through several distinct sets of paths through the genes GRB10, LCP2, and CXCR4. The ECM subnetwork is connected to the cell proliferation cluster through TGFb pathway genes and paths containing the polymorphic genes IRAK1 and PXK, which suggests that ECM remodeling modulates cell proliferation through the TGFb pathway. The interferon node may negatively regulate proliferation via the ERK/MAPK pathway resulting in the general mutual exclusivity of the inflammatory and fibroproliferative subsets. Thus we see a set of interconnected, balancing feedback loops that can enforce subset homeostasis, but also allow for patients to transition between the subsets, possibly in response to therapy. doi:10.1371/journal.pcbi.1004005.g007 patients stratified by gene expression subset for genetic risk alleles. There may be genetic risk factors that cause a patient to ''stall'' at particular point along the progression from inflammatory to proliferative to normal-like. Genetic modifiers of the molecular links in the consensus gene network (Fig. 4) might hold the key to showing why many patients go into spontaneous remission while others experience rapid clinical progression, and indeed, our network analysis suggests candidates for explaining this (Figs. 6, 7). For example, IRAK1 and PXK are polymorphic genes that exist on paths in the network between the TGFb/ECM network and the cell proliferation network. This strongly argues for future studies that test their possible roles in TGFb-modulated cell proliferation, with particular attention to their roles in influencing other serine-threonine kinases that modulate the cell cycle.

Associations with autoantibodies
The presence of antinuclear autoantibodies in patient serum is a widely used biomarker of SSc. To date the intrinsic subsets have shown no clear association with autoantibody status [1,4,11], which is consistent with a model by which the subsets represent disease progression. Several genetic polymorphisms are associated with autoantibody status (S9 Data file), including BLK and BANK1, which are related to ACA-and ATA-positive SSc respectively. These B cell proteins are already attractive candidates for autoantibody production, as they are directly associated with the cells that produce the antibodies, but our network analysis also shows that they are functionally related to adaptive immune genes that are highly expressed in the inflammatory subset.

Conclusions and limitations
A primary role of bioinformatics in complex diseases is to pare down the possibilities to a coherent set of candidates for future study. The many risk alleles for SSc each have modest odds ratio and the final picture of SSc will likely lie in the interactions between various risk factors, but the number of possible interactions between these combinatorial factors is prohibitively large. It is here that the network approach may be most useful in delineating candidates for interaction studies. We might speculate, for example, that SSc results from the presence of multiple, functionally distinct alleles, but that it does not matter what gene is mutated as long as the mutation has a particular functional outcome. The predicted interactions in the network suggest which alleles might be functionally related and which might be distinct from each other, as the alleles either cluster within a subnetwork or straddle the subnetworks. This report is limited by our utilization of whole skin biopsies, which are complex mixtures of cells, and in that the studies were observational. The use of whole skin means that we cannot directly ascribe gene expression to specific cell types. For example, we infer that the M2 macrophage subnetwork is related to that cell type based on the coherent expression of monocyte markers and cytokines related to M2 polarization of macrophages. Our study is therefore hypothesis generating. Mechanistic studies will be needed to evaluate the existence of the molecular links suggested by the network analysis.
Nevertheless, our analyses place the intrinsic subsets as a possible readout of SSc pathology. The consensus gene expression of the subsets implicates a number of molecular mechanisms that have been associated with SSc and suggests functional roles for a large fraction of the replicated SSc-associated polymorphisms. We demonstrate that the core molecular processes of the inflammatory and fibroproliferative subsets are molecularly connected to each other. This suggests the possibility that SSc subsets may be dynamic and interconnected.

Ethics statement
The analysis of prospectively collected human samples in this study was approved by the Committee for the Protection of Human Subjects at Dartmouth College (CPHS#16631) and by the IRB review panel at Northwestern Feinberg School of Medicine (STU00004428). All subjects in the study provided written consent, which was approved by the IRB review panels at Dartmouth College and Northwestern Feinberg School of Medicine.

Patient information
This study used data from three previously published cohorts ( Table 1) Each of the three study cohorts contained patients with SSc defined using the 1980 ACR criteria. Specifically, all patients met the American College of Rheumatology classification criteria for SSc [75] and were further characterized as the diffuse (dSSc), or the limited (lSSc) subsets. Limited SSc patients had 3 of the 5 features of CREST syndrome, or had Raynaud's phenomenon with abnormal nail fold capillaries and scleroderma-specific autoantibodies.

Preprocessing and clustering of microarray data
All three studies used Agilent Technologies 44,000 element DNA microarrays representing the full human genome. All samples were processed and all microarrays hybridized in the Whitfield lab providing consistency between the datasets. The DNA probes between these datasets are identical and thus were indexed using the same probe identifiers allowing direct mapping from one data set to another without significant loss of data. Microarray data from each cohort were Log2 Lowess-normalized and only spots with mean fluorescent signal at least 1.5 greater than median local background in Cy3-or Cy5-channels were included in the analysis. Genes with less than 80% good data were excluded. Since a common reference experimental design was used for all cohorts, each probe was centered on its median value across all arrays. Data were multiplied by -1 to convert them to Log2(Cy3/Cy5) ratios.
The three cohorts were clustered into coexpression modules using the WGCNA procedure. We used the WGCNA R package available on the Comprehensive R Archive Network (http://cran. r-project.org) and described in [10]. We used the default parameters for running the software except that we used the ''signed'' network option and a soft thresholding parameter d = 12. These parameters are described in depth in [10,15]. Genes that were classified as outliers were discarded from further analysis.

The information graph and consensus clusters
To each pair of modules from different datasets we associate an overlap score W. Specifically, if C i is a module in, say, Milano et al. and C j a module in Pendergrass et al., then we define where N is the total number of genes in the genome. The W-scores can be interpreted as edge weights in a module-module network (the information graph). This network encodes the mutual information between the WGCNA-derived genomic partitions. We computed the W-scores between each pair of modules across all three datasets by the above formula and set the small and negative W-scores below a threshold to zero (S4 Fig.). A mathematical derivation of the relationship between the W-scores and mutual information and a detailed description of the thresholding procedure are available in the supporting information (S1 Text). The resulting 3-partite information graph was mined for consensus clusters.
Since triangles in the information graph represent a module conserved across all three datasets, we clustered the information graph using a variant of triangle percolation [17], which is a community detection procedure designed to find sets of modules that are members of many triangles together. Specifically, from the information graph we constructed an auxiliary graph, called the triangle graph, and detected communities in the triangle graph by greedy modularity maximization [76]. A description of the construction of the triangle graph is available in the supporting information (S1 Text).
We define a final consensus cluster as all of the genes that are contained in a module from the community for each of the three data sets community. Note that triangle percolation allows for overlapping communities in the underlying information graph. For example, the inflammatory consensus cluster and the keratinocyte consensus cluster overlap by one module (Fig. 3A). This is one of MICC's strengths because it does not require a whole module from one dataset to be associated with only one consensus cluster. To derive a gene set associated to the consensus cluster, we took all modules within that community, computed their unions within their dataset, and then computed their intersection across datasets. In symbols, let Comm denote a set of modules that form a community in the information graph (e.g. the dotted circles Fig. 1B

Statistical tests for subset specificity
To determine if a WGCNA-derived module was significantly differentially regulated in a subset, we performed one-tailed Wilcoxon rank sum tests. Specifically, we computed the module eigengene of each module by first normalizing the gene expression so that each gene expression vector had Euclidean length 1. The module eigengene is the first principal component of the normalized gene expression vectors within the WGCNA module. The module eigengene is a one-dimensional summary score for the module's gene expression across all biopsies. To determine if the module was significantly up-or down-regulated in a particular subset, we determined if the median of the module eigengene for that subset was above or below that of the whole population, and then performed a one-tailed Wilcoxon rank sum test to determine the significance of the median being above or below that of the population as a whole. We used the subset assignments reported in the previous papers describing these datasets [1,4,11]. We used Bonferroni corrections for multiple comparisons. There were 178 modules in total across the datasets. In Table 2, we corrected for 17863 tests for each of the subset-specificity tests. In Fig. 3B, we corrected for 17864 tests because we included tests for all nonnormal-like SSc versus normal-like SSc and healthy controls (see also S4 Data file).

The IMP Bayesian network
The IMP Bayesian network is available through an online interface at (http://imp.princeton.edu). To build our network, we queried IMP with four gene sets: inflammatory and fibroproliferative consensus genes derived from the consensus clusters, SScassociated polymorphisms (as described below), and the four gene MRSS biomarker reported in [5]. IMP provides export of the subnetwork corresponding to the query genes as a weighted edge list (a three-column table indicating which genes are connected and with what probability). IMP automatically thresholds the probabilities at 0.5 and exports the network with up to an additional 50 genes that provide extra context for the query genes. In our case, the 50 genes were predominantly cell cycle genes. This is probably because the cell cycle is heavily studied in the microarray compendium from which IMP was built. In that case, IMP would be highly confident about predicting interactions between the fibroproliferative genes and other cell cycle genes.
We developed in-house Matlab and R scripts to transform the edge list data into the Graph Exchange Format (gexf), which allows for manipulation in Gephi, an open source network visualization program [77]. Data files S7-S8 contain postprocessed networks and Data files S10-S11 provide R data and code snippets for manipulating the network programmatically.

Genetic polymorphisms for network analysis
We collected genes with SSc-associated polymorphisms from the literature and curated them according to the following criteria. We included polymorphic genes that were reported in genomewide association studies of SSc [78][79][80][81][82], from a recent study using the Immunochip platform [83] and from case-control candidate gene studies that were replicated in at least one other study [54,. This resulted in a list of 41 polymorphic genes (S6 Data file).
Supporting Information S1 Fig Consensus genes are enriched for coexpression hubs. The consensus genes reported by MICC are more correlated to their module eigengene (red density) than is typical for an arbitrary gene-eigengene correlation (blue density). Genes are compared only to their module eigengene, i.e. to the hub that the gene is closest to in the coexpression network. Note the evident shift in the red density toward 1, which is perfect correlation, indicating that consensus genes are more ''hub-like''. (TIF) S2 Fig Adjacency matrix for the triangle graph. The triangle graph is a weighted graph whose nodes are triangles in the information graph and whose edges indicate that the corresponding triangles in the information graph share and edge. (A) The weighted adjacency matrix for the triangle graph. Rows and columns of the adjacency matrix are indexed by nodes of the triangle graph (i.e. by triangles in the information graph). The rows and columns of the matrix are sorted according to community order. Note the distinct block structure of the matrix indicating that the underlying graph is highly modular. (B) The same matrix, but unweighted so that the matrix contains only 0's and 1's (blue and red cells in the matrix, respectively) indicating that the nodes are either connected or disconnected. This aids in the visualization of the community structure of the graph (block structure of the matrix), although community detection was performed on the weighted triangle graph. Edge width denotes the size of the W-score for a pair of modules. Dotted edges represent negative W-scores. The highest possible mutual information occurs when modules are perfectly conserved. The information graph is disconnected with edges denoting the mapping between conserved modules. In the intermediate case, modules break into pieces that are reassorted among each other. The information graph here has strong community structure, but is not completely disconnected. The low mutual information case occurs when the partitions labels are random with respect to each other. In this case, all edges are small and are partially cancelled by the negative edges also present in the graph. (B,C) W-scores are calculated for each pair of modules; in this case one from Milano and one from Pendergrass. (B) Most Wscores are small in absolute value (blue histogram; logarithm of density), while their distribution has a right tail of significantly large scores. We can threshold the small and negative W-scores by keeping only those scores that contribute positively to the total mutual information (red curve; x-intercept). The sum of all Wscores is the total mutual information between the Milano and Pendergrass genomic partitions (dashed blue horizontal line). (C) The W-scores are positively correlated with the size of the overlap between gene clusters, but the relationship is not perfect. The Wscore threshold is shown by a dotted blue vertical line and the overlaps that exceed the threshold are plotted in red. In particular, note that there are relatively large overlaps that fail to meet the threshold. Likewise, there are relatively small overlaps that have high W-scores. (TIF) S1 Text Additional mathematical details about the MICC method and glossary of keywords used in main text. (PDF) S1 Data file WGCNA clustered PCL file for Milano  S9 Data file Molecular network plotted in PDF (text searchable for genes). (PDF) S10 Data file R data for programmatic access to network (iGraph format). (ZIP) S11 Data file R code snippet demonstrating reading and writing graphs from R to GEXF format. (R)