Novel classification for global gene signature model for predicting severity of systemic sclerosis

Zariel I. Johnson; Jacqueline D. Jones; Angana Mukherjee; Dianxu Ren; Carol Feghali-Bostwick; Yvette P. Conley; Cecelia C. Yates

doi:10.1371/journal.pone.0199314

Abstract

Progression of systemic scleroderma (SSc), a chronic connective tissue disease that causes a fibrotic phenotype, is highly heterogeneous amongst patients and difficult to accurately diagnose. To meet this clinical need, we developed a novel three-layer classification model, which analyses gene expression profiles from SSc skin biopsies to diagnose SSc severity. Two SSc skin biopsy microarray datasets were obtained from Gene Expression Omnibus. The skin scores obtained from the original papers were used to further categorize the data into subgroups of low (<18) and high (≥18) severity. Data was pre-processed for normalization, background correction, centering and scaling. A two-layered cross-validation scheme was employed to objectively evaluate the performance of classification models of unobserved data. Three classification models were used: support vector machine, random forest, and naive Bayes in combination with feature selection methods to improve performance accuracy. For both input datasets, random forest classifier combined with correlation-based feature selection (CFS) method and naive Bayes combined with CFS or support vector machine based recursive feature elimination method yielded the best results. Additionally, we performed a principal component analysis to show that low and high severity groups are readily separable by gene expression signatures. Ultimately, we found that our novel classification prediction model produced global gene signatures that significantly correlated with skin scores. This study represents the first report comparing the performance of various classification prediction models for gene signatures from SSc patients, using current clinical diagnostic factors. In summary, our three-classification model system is a powerful tool for elucidating gene signatures from SSc skin biopsies and can also be used to develop a prognostic gene signature for SSc and other fibrotic disorders.

Citation: Johnson ZI, Jones JD, Mukherjee A, Ren D, Feghali-Bostwick C, Conley YP, et al. (2018) Novel classification for global gene signature model for predicting severity of systemic sclerosis. PLoS ONE 13(6): e0199314. https://doi.org/10.1371/journal.pone.0199314

Editor: Donald Gullberg, University of Bergen, NORWAY

Received: March 8, 2018; Accepted: June 5, 2018; Published: June 20, 2018

Copyright: © 2018 Johnson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All microarray data files are available from the GEO database (accession numbers GSE9285, GSE47162).

Funding: Research reported in this publication was supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases of the National Institutes of Health under Award Number AR068317 to CCY. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This publication was also supported by the School of Nursing, University of Pittsburgh Genomics of Patient Outcomes Hub to CCY and YPC.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Systemic Scleroderma (SSc) is a multifaceted disease that exhibits heterogeneity and clinical variability among patients, often complicating diagnosis and decisions regarding treatment. SSc causes skin fibrosis, systemic vascular alterations and collagen accumulation from chronic hardening and tightening of the skin and connective tissues, eventually leading to organ failure and poor prognosis [1]. One level of heterogeneity can be demonstrated by the presence of two SSc disease forms that are defined based on extent of skin involvement: limited cutaneous SSc (lcSSc) and diffuse cutaneous SSc (dcSSc) [2]. Prognosis and risk of internal organ involvement is different for patients in these clinically-defined disease subsets [3–5]. Moreover, genomic studies indicate that variability exists even within each disease forms [6]. Further complexity in SSc stems from the fact that gender [7], race [8], and environmental toxin exposure [9] can predispose individuals to the disease. Thus, the multifactorial nature of SSc pathogenesis can impede accurate measurements of severity and estimation of progression.

Clinical progression of SSc is most often measured by skin thickening using the modified Rodnan skin score (mRSS), which entails a 17-point assessment of skin thickness on various areas of the body, culminating in a 51-point maximum scale of severity. Expression levels of several genes have been associated with skin score [10–12]. However, these investigations did not report the use of a comprehensive analysis comparing various methods available to identify and define a genomic signature that correlates with disease severity. Furthermore, there is evidence that mRSS does not always necessarily correlate with disease trajectory [13]. Thus, the spread of fibrotic behavior to other organs suggests a systemic pathogenic mechanism that requires a more in depth diagnostic analysis to be used in conjunction with skin score. Therefore, the goal of this study was to assess the ability of various classification methods to identify a genetic signature that could be the basis for a diagnostic test for SSc.

In this current study, we constructed classification models with the goal of predicting SSc skin severity based on gene expression profiles and identifying marker gene sets that correlate with high or low severity patients. Ongoing and future studies will focus on the importance of genes identified herein in disease and how they relate to patient-to-patient heterogeneity. Our classification models were capable of readily parsing patients, based on genomic profile alone, into high or low severity groups as defined by mRSS and led to the identification of gene sets associated with disease progression. Thus, our model is a powerful tool not only for diagnosis but also to find previously unknown disease-related genes that warrant further investigation.

Materials and methods

Study population

The two datasets used in this study were obtained from the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/). Dataset 1 (GEO Accession: GSE9285) [14] contains microarray data from 75 biopsies from 34 individuals, including 28 patients and 6 healthy controls. SSc patients met the criteria of the American College of Rheumatology [15]. Data were generated by Whole Human Genome Oligo Microarray G4112A platform (Agilent Technologies). The skin scores of these samples were documented in the original paper. Microarray data from 21 samples (5 morphea samples, 15 healthy samples, and 1 dSSc sample) was not used because there was no associated skin score information. This left a remaining total of 54 samples, which were included in the analysis. The 54 samples were derived from a total of 24 patients. In several cases, one or more biopsy was taken from a single patient. The skin scores and other patient information related to these samples were documented in the original paper (S1 Table).

Dataset 2 (GEO Accession: GSE47162) [16] contains 59 microarray data of skin biopsies from 59 patients. Data were generated using Illumina HumanHT-12 V4.0 Expression Beadchips. The skin scores and other patient information were extracted from GEO sample record information (S2 Table). One sample without the skin score was excluded from analysis, resulting into a total of 58 samples. For Dataset 1, 25 samples fell into the high severity group and 29 into the low severity group. For Dataset 2, 21 samples fell into the high severity group and 37 into the low severity group.

In all data sets a two-class classification was used to establish classification models that distinguish between "high group" and "low" group ". Skin scores between 18 and 51 were categorized as "high severity", and those less than 18 were categorized as "low severity". An analysis of over 900 patients indicated that a mRSS threshold between 18 and 25 was optimal for distinguishing between patients that would progress or regress [17]. We took a conservative approach and chose the lower limit of this range as our cutoff.

Statistical analyses

For the Dataset 1, microarrays were visually inspected for defects or technical artifacts, and poor-quality spots were manually excluded in the original study. Spots with fluorescent signals less than two-fold of local background in both Cy3- and Cy5- channel were excluded. Following this pre-processing step, 28,495 probes with at least 80% of their data points were included in downstream analysis. Background correction and quantile normalization were performed using limma package [18]. The data were represented as log2 of the Cy5/Cy3 ratios. Data for each probe were centered by subtracting the expression value from the mean of all expression values across the arrays, and scaled by dividing the centered value by the standard deviation of values across the arrays. For the Dataset 2, quality control was performed by filtering probes using the detection P-value. The detection P-value represents the confidence that a given transcript is expressed above the background level calculated based on negative control probes. We adopted a P-value cutoff of 0.01 as suggested by a quality control manual for this type of microarrays. Probe sets were filtered for a minimum of 20% samples with detection P-value less than 0.01. Quantile normalization was performed across arrays. The data were then log2 transformed, followed by centering and scaling procedures as described for Dataset 1. In total, 9266 features were included for further analysis.

Cross-validation design

We used a five-fold cross-validation design to compare the performance of the various classification methods. To find the best parameters for the model, we performed another “nested” leave-one-out (LOO) cross-validation on the five original training sets. We identified performance information for each combination of classification method and parameter set and selected the optimal combination to be used for the testing set. This methodology has been described and aims to limit the effects of overfitting the model while objectively evaluating model performance [19].

Classification

Three classification methods were applied to each of the two datasets: support vector machine (SVM), random forest (RF) and naive Bayes (NB). We used the SVM methods implemented in the e1071 package in R, and selected the RBF (radial basis function) kernel, which has been shown to produce outstanding classification performance in previous studies [19]. We employed the RF implementation in the R package random Forest [20]. There are three important parameters, the values of which need to be determined: mtry (the number of input variable tried at each split), ntree (the number of trees to grow for each forest) and nodesize (the minimum size of the terminal nodes). We considered different parameter configurations for their values of mtry = {0.5,1,2}, ntree = {500,1000,2000} and nodesize = 1, as recommended in a previous study [21]. The best-performing parameters were selected by nested cross-validation, as described in the last section. NB implemented in this study used the klaR package [22] in R. The parameter fL (factor for Laplace correction) was set by default (no correction), and usekernel (logic value to set if a kernel density estimate will be used for density estimation) was optimized by nested cross-validation. Classification model performance was evaluated using three classification metrics: accuracy, sensitivity and specificity. The Matthews Correlation Coefficient (MCC) was also used to quantify the balance between sensitivity and specificity.

Feature (gene) selection methods

In feature selection method, a small subset of features are identified that, together with the classification methods, are most effective in distinguishing samples belonging to different groups. There are three broad categories of selection methods: filter, wrapper, and embedded methods. Filter methods rank the features or genes regardless of the model. They select the most significant feature or gene based on univariate measure. Wrapper methods evaluate subsets that are optimal with respect to a subset evaluator such as a classifier. Embedded methods incorporate the search algorithm into the classifier. In this study, two filter methods and two wrapper methods were applied. The Chi-Squared method is a filter method that evaluates features or genes individually by measuring their chi-squared statistic with respect to the classes. The correlation-based feature selection method (CFS) method is also a filter method, which measures the correlation between attributes and recognizes those feature subsets in which each feature is highly correlated with the class but uncorrelated with other subset features. The SVM-based recursive feature elimination method (SVM-RFE) is a wrapper method used in microarray data analysis. It eliminates unessential genes and selects better and more compact gene subsets. Random forest-based backward feature elimination method (RFVS) is another wrapper method that constructs RFs in an iterative manner. Upon each iteration, RFVS builds a random forest after discarding genes with the smallest importance values in the last iteration. The returned subset of genes is the one with the smallest out-of-bag error [23].

Principal component analysis

Principal component analysis (PCA) was performed using function pca() in R (version 3.2.2) package FactoMineR (version 1.31.4). Configuration: scale.unit = TRUE, ncp = 5, ind.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL, col.w = NULL, graph = TRUE, axes = c(1,2) [24].

The component loadings used in PCA are the correlation coefficients between the variables (rows) and factors (columns) and the values for the genes of PC1 and PC2 indicate the "weights" for the genes. The higher the loading value, the higher weight a gene carries in composing the PC. Background correction and quantile normalization were performed using limma package [18]. Data was loaded into MeV as a tab delimited text file of log2-transformed Cy5/Cy3 ratios. For PCA analysis, missing data were first estimated using K-nearest neighbors (KNN) imputation with N = 4.

Construction of heat maps

Heat maps showing gene expression for the patients by severity group were constructed using normalized and scaled expression levels. Quantile normalization was performed on the raw microarray data, and the normalized expression levels were then Log2 transformed, centered (on 0), and scaled to the same range across genes, so that each gene is evaluated with equal weight.

Ingenuity Pathway Analysis

Ingenuity Pathway Analysis (Qiagen) (Version 43605062) was used to perform Canonical Pathway Analysis and Upstream Regulator Prediction. Analysis was performed on four lists of microarray probe IDs: identified by CFS classifier in Dataset 1 and differentially expressed, identified by SVM-RFE classifier in Dataset 1 and differentially expressed, identified by CFS classifier in Dataset 2 and differentially expressed, and identified by SVM-RFE classifier in Dataset 2 and differentially expressed. Differential expression was considered statistically significant if the t-test p-value was < 0.05 following Bonferroni correction based on the number of probe IDs identified for each classifier-dataset combination. The Bonferroni correction was applied to limit the false positive discovery rate associated with multiple hypothesis testing. Probe IDs that were successfully mapped by IPA were included for analyses. The direction of expression was positive for probes with higher expression in high severity patients and negative for probes with higher expression in low severity patients. For each dataset, the appropriate reference probe set was based on the microarray used in the original study.

Results

Classification and accuracy performances were improved by applying feature selection methods

Several studies, including retrospective cohort analyses and prospective clinical trials, have shown that the severity of skin sclerosis, as assessed by the modified Rodnan skin thickness score (mRSS), is predictive of disease outcome [25]. However, only a very select few have attempted to map the clinical covariates of mRSS to unique gene expression signatures. To further explore this, we used microarray data from two SSc patient studies (Dataset 1 and Dataset 2) from the NCBI GEO Database (Patient data shown in Table 1 and S1 and S2 Tables). We categorized microarray profiles as either low (mRSS 0–17) or high (mRSS 18–51) severity.

Download:

Table 1. Summary of patient information for microarray biopsy samples used in models.

https://doi.org/10.1371/journal.pone.0199314.t001

We first tested various two-class classification models to determine which could best distinguish between samples designated as low or high severity. The accuracy, sensitivity, specificity, and Mathew’s correlation coefficient (MCC) were determined for each of the classification models (Table 2). MCC values demonstrate the contribution from each of the randomly formed gene sets. An MCC value of 1 indicates a perfect prediction [26]. In the Dataset 1, the accuracy performance classifier RF and feature selection correlation-based selection method CFS showed a MCC value of 0.96 (Table 2). NB combined with either CFS or SVM-RFE showed MCC values of 0.96, and 0.96 respectively (Table 2). In Group 2, RF and CFS had a significant MCC value of 1.00. NB combined with either CFS or SVM-RFE also produced a MCC value of 1.00. Thus, higher MCC values were obtained in cases where either CFS or SVM-RFE selection methods were employed. Therefore, we chose to focus primarily on the results from these two feature selection methods for the remainder of the investigation.

Download:

Table 2. Performance evaluation of various classifier and feature selection methods.

https://doi.org/10.1371/journal.pone.0199314.t002

The CFS selection method identified 84 probe IDs in Dataset 1 and 89 probe IDs in Dataset 2 as being differentially expressed between low and high severity groups. Likewise, the SVM-RFE method identified 450 probe IDs in Dataset 1 and 50 probe IDs in Dataset 2 (Table 3). Specific genes and potential interactions were explored using Ingenuity Pathway Analysis.

Download:

Table 3. Number of features (microarray probe IDs) selected for each dataset and feature selection method.

https://doi.org/10.1371/journal.pone.0199314.t003

Patient severity classification groups are readily separable by gene expression signatures

We performed principal component analysis (PCA) using all targets to pass data cleaning and filtering to visually interpret the variability within and between severity groups. PCA based on all four feature selection methods showed clear separation between severity groups based on gene signature for both Datasets (Fig 1 and S1 Fig). The PCA based on CFS for Dataset 1 showed 24.90% of variance in PC1 and 7.93% of variance in PC2 (Fig 1A). For all feature selection methods and both datasets, PC1 explained much more of the variability than PC2, indicating that the first principal component explained most of the differences between the expression profiles of the high and low severity patients. For the PCA performed on genes from the Chi-squared feature selection method, the microarray probe IDs with the ten highest absolute loading values for PC1 are given in Table 4. Each of these probe IDs was found to be statistically differentially expressed between low and high severity groups, based on Bonferroni-adjusted t-test p-value < 0.05. All The most highly weighted probe IDs for datasets 1 and 2 corresponded to genes for MAGI1 and SOX18, respectively, with positive loading values being associated with the more severe disease phenotype in both cases.

Download:

Fig 1. Principal component analyses (PCA) of gene expression separation between low and high severity groups.

Results based on CFS feature selection method are shown in A (Dataset 1) and B (Dataset 2). Results based on SVM-RFE feature selection method are shown in C (Dataset 1) and D (Dataset 2).

https://doi.org/10.1371/journal.pone.0199314.g001

Download:

Table 4. Microarray probe IDs associated with the top 10 highest absolute value of loading values for principal component 1 based on principal component analysis of genes identified by Chi-squared feature selection.

https://doi.org/10.1371/journal.pone.0199314.t004

In addition, the Log2 normalized expression values were used to create heat maps for each Dataset. Heat maps were generated to show probe IDs from Dataset 1 (Fig 2) and Dataset 2 (Fig 3) that were identified by CFS feature selection and highlight differences in gene expression profiles between the severity groups. As expected, expression profiles associated with these probe IDs can be separated by severity index when inspected visually. Similar results are shown in heat maps for probe IDs from Dataset 1 (S2 Fig) and 2 (S3 Fig) that were identified by SVM-RFE.

Download:

Fig 2. Heat map showing Log2 normalized expression values for patient samples from Dataset 1 for probe IDs identified by CFS feature selection method.

https://doi.org/10.1371/journal.pone.0199314.g002

Download:

Fig 3. Heat map showing Log2 normalized expression values for patient samples from Dataset 2 for probe IDs identified by CFS feature selection method.

https://doi.org/10.1371/journal.pone.0199314.g003

Pathway enrichment analysis identified a putative signaling network that included genes related to SSc severity

Based on the high performance of our models, we hypothesized that the genes identified by the models and differentially expressed between severity groups may have biological significance. To further explore the relationship between these groups of genes, we used Ingenuity Pathway Analysis (IPA) to analyze the lists of genes that were both identified by the classifier and that showed differential expression between severity groups (Bonferroni-corrected t-test p-value < 0.05). Fig 4 depicts the numbers of Probe IDs that were identified by the classifier, were statistically significant, and mapped to Gene IDs for each classifier and dataset combination. We first investigated whether our gene lists were overrepresented in known Canonical Pathways. The top three Canonical Pathways (ranked by Benjamini-Hochberg p-value) are shown in S3 Table. Our analysis failed to find any pathways that showed statistically meaningful overrepresentation of our input genes. We then used the Upstream Regulator Analysis in IPA to see whether a common upstream regulator could explain the differences that we saw in gene expression between low and high severity patient samples. The upstream regulator that showed the most predictive value, based on a combination of activation Z-score of 2.159 and p-value of overlap of 2.25E-03 was oncostatin M (OSM). Activation of OSM was predicted to regulate five of the genes with significant expression differences identified in Dataset 2 using the CFS classifier when comparing low and high severity patients (Table 5 and Fig 5).

Download:

Fig 4. Schematic representation of pipeline for choosing genes that were included in Ingenuity Pathway Analysis.

Red font indicates numbers of genes more highly expressed by high severity patients; blue font indicates numbers of genes more highly expressed by low severity patients.

https://doi.org/10.1371/journal.pone.0199314.g004

Download:

Fig 5. Predicted signaling network between OSM and downstream genes related to SSc severity.

Red shading of gene indicates upregulation in dataset compared to low severity patients, green shading downregulation, and intensity of color depicts strength of regulation. Relationships between genes that are predicted based on literature are indicated by lines connecting genes, with red symbolizing predicted upregulation and blue predicted downregulation.

https://doi.org/10.1371/journal.pone.0199314.g005

Download:

Table 5. Results of IPA-based upstream regulator analysis showing potential role for OSM in regulating genes identified by CFS-based classification of Dataset 2 and differentially expressed between low and high severity patients.

https://doi.org/10.1371/journal.pone.0199314.t005

Discussion

This study is one of the first to describe an independently validated classification prediction model for gene signatures in SSc using current clinical diagnostic factors. We used applied three well-established classification methods, SVM, RF, and NB, coupled with four different feature selections, to clinically derived microarray data and compared the relative efficiency of the models. Interestingly, RF and CFS showed high MCC values of 0.96 and 1.00 for both of the independent datasets that we used, suggesting an efficient diagnostic two-layered system for SSc. This two-layered model system is capable of reducing the dimensionality of large independent data to find a more useful subset for further exploration. Moreover, due to the heterogeneity of patients with SSc, a prediction model that can deduce disease severity in an unsupervised manner is essential. To get an idea of how the gene lists generated by our classifiers compared to conventionally-derived differentially expressed genes, we also performed a formal differential gene expression analysis comparing low and high severity patients from each dataset (data not shown), which identified many probe IDs that were statistically significant between groups. While some genes identified by the feature selection methods overlapped with this list of conventionally derived differentially expressed genes, in each case our classifiers also identified several genes that were not found with the conventional differential gene analysis. This highlights the value of our classifiers in finding genes that can be useful for classification purposes but may not be identified by conventional methods.

Further validating the efficiency of the two-step model system, principal component analysis showed a clear separation of patients with high and low skin scores when subjected to the model. This clustering of datasets further substantiates the efficiency of the model system to reduce complexity and distinguish between high and low severity gene profiles. After performing the PCA, we examined which microarray probe IDs were associated with the highest loading values to gain a sense of which were the most important in separating the severity groups. The most highly weighted probe IDs were associated with a broad array of biological processes. In Dataset 1, Tetraspanin 8 showed one of the highest loading levels. Tetraspanins have previously been shown to alter fibrosis through the regulation of epithelial cell-basement membrane interaction [27]. For Dataset 2, we found that the gene Signal Transducer and Activator of Transcription 3 (STAT3) ranked was one of those with the highest loading value. STAT3 has been extensively associated with dermal fibrosis and is critical in the pro-fibrotic effects of TGF-β signaling [28,29]. Our analysis suggests that the other highly weighted genes are key in distinguishing low and high severity patients and should be carefully considered for further investigation. Based on the results of the CFS feature selection method, heat maps showing Log2 normalized expression for both datasets showed a clear distinction between patients with high and low skin scores, signifying the presence of two separately clustered gene profiles.

Our model assigned patients into high and low severity groups using skin score, the standard diagnostic measurement for SSc. This bolsters the efficiency and effectiveness of our model system to determine outcome, which should help decrease the lack of specificity that currently hampers clinical treatment. The analytical pipeline reported herein highlights the potential applicability of an unsupervised, tiered correlation method, whereby disease severity can be classified based upon genetic signature alone. Furthermore, the novel gene sets that have been identified as correlating with disease severity may be further investigated to provide insight into the molecular mechanisms underlying a clinical trait of SSc, a complex fibrotic disease. Our methodology can be used to stratify patients to assess response to therapy and to guide appropriate recruitment to clinical trials.

To evaluate the functional relevance of these clusters in the context of disease, we used various tools within Ingenuity Pathway Analysis to explore interactions between genes that were both identified by our classification methods and showed differential expression between low and high severity groups. Our analysis did not return any pathways that were significantly enriched by the genes in our lists, showing that the genes identified here to do not belong to known biological pathways. This result indicates that our classification methods are capable of identifying genes that, when taken together, can accurately be used to predict disease severity although they may not participate in the same pathways in vivo. Additionally, it is possible that these genes may interact in pathways that are not yet known. However, our analysis of potential upstream regulators did identify one small network, controlled by the cytokine oncostatin-m (OSM), that could drive relatively high expression of P-selectin (SELP), cholesterol 24-hydroxylase (CH25H), c-c motif chemokine ligand 2 (CCL2), and angiopoietin 2 (ANGPT2) and low expression of heat shock protein family B member 3 (HSPB3) in higher severity SSc patients. Evidence in the literature shows that OSM is capable of regulating these genes with directionality suggesting that OSM positively regulates the genes associated with the more severe skin scores [30–32]. Studies have shown that levels of OSM, an IL-6 family cytokine, are associated with SSc and OSM can modulate production of several extracellular matrix components important in fibrosis [33–36]. Relevant to our study, serum levels of soluble angiopoietin-1, p-selectin, and CCL2 correlate with increasing severity of SSc and worsening clinical features [37–40]. Therefore, our data suggest that OSM may influence the expression of additional genes associated with severe disease.

We observed some differences in the probe IDs identified by our classification methods and in subsequent downstream analyses. For the two datasets, the CFS method identified similar number of probe IDs that were used for classification (84 and 85 for Datasets 1 and 2, respectively). On the other hand, the SVM-RFE method identified 450 probe IDs for Dataset 1 and 50 for Dataset 2. We believe that these differences stem from the fact that CFS and SVM-RFE use intrinsically different methods for feature selection. The CFS method is a filter method that gives priority to genes whose expression is highly correlated with a class. SFM-RFE is a wrapper method that discards genes that have only a small impact on classification. Interestingly, when only the genes that have significantly different expression levels between severity groups are considered, the numbers of genes identified by the classifiers are more analogous. This indicates that the majority of the 450 probe IDs identified when the SVM-RFE method was applied to Dataset 2 were useful for classification, despite not being considered differentially expressed between groups.

Our novel three-classification model system of SVM, RF, and NB is an efficient and powerful tool for developing gene signatures from SSc skin biopsies and can also be used to develop a prognostic gene signature for SSc. Our models achieved high accuracy, specificity, and sensitivity, demonstrating their potential use as diagnostic indicators. Furthermore, our model system can be used to model other fibrotic disorders by substituting different phenotypes in the training cohort. Current ongoing and future studies using alternative methodologies will investigate specific genes identified by the classifiers to determine how they may relate to disease pathogenesis and progression.

Supporting information

S1 Table. Patient information for samples used from Dataset 1.

B: Back biopsy, FA: Forearm biopsy, F: Female, M: Male, W: White, A: Asian, AA: African American, H: Hispanic. Shading indicates individual patients.

https://doi.org/10.1371/journal.pone.0199314.s001

(DOCX)

S2 Table. Patient information for samples used from Dataset 2.

F: Female, M: Male, W: White, A: Asian, AA: African American, H: Hispanic.

https://doi.org/10.1371/journal.pone.0199314.s002

(DOCX)

S3 Table. Canonical pathways identified by Ingenuity Pathway Analysis.

Probe IDs included in analysis were both identified by the classification method and significantly differentially expressed between low and high severity groups (Bonferroni-corrected t-test p-value < 0.05). CFS: correlation-based feature selection method, SVM-RFE: SVM-based recursive feature elimination method. Ratio: number of genes from input list that occur in pathway. B-H p-value: Benjamini-Hochberg adjusted p-value of right-tailed Fisher’s Exact Test.

https://doi.org/10.1371/journal.pone.0199314.s003

(DOCX)

S1 Fig. Principal component analysis (PCA) of gene expression separation between low and high severity groups.

Results based on Chi-squared feature selection method are shown in A (Dataset 1) and B (Dataset 2). Results based on RVFS feature selection method are shown in C (Dataset 1) and D (Dataset 2).

https://doi.org/10.1371/journal.pone.0199314.s004

(TIF)

S2 Fig. Heat map showing Log2 normalized expression values for patient samples from Dataset 1 for probe IDs identified by SVM-RFE feature selection method.

https://doi.org/10.1371/journal.pone.0199314.s005

(TIF)

S3 Fig. Heat map showing Log2 normalized expression values for patient samples from Dataset 2 for probe IDs identified by SVM-RFE feature selection method.

https://doi.org/10.1371/journal.pone.0199314.s006

(TIF)

Acknowledgments

The authors thank AccuraScience LLC, Johnson, IA for providing technical assistance with bioinformatics analysis of microarray data for the study.

References

1. Steen VD, Medsger TA. Changes in causes of death in systemic sclerosis, 1972–2002. Ann Rheum Dis. 2007/03/03. 2007;66: 940–944. pmid:17329309
- View Article
- PubMed/NCBI
- Google Scholar
2. Johnson SR. Progress in the clinical classification of systemic sclerosis. Curr Opin Rheumatol. 2017/09/30. 2017;29: 568–573. pmid:28961157
- View Article
- PubMed/NCBI
- Google Scholar
3. Steen VD. The many faces of scleroderma. Rheum Dis Clin North Am. 2008;34: 1–15; v. pmid:18329529
- View Article
- PubMed/NCBI
- Google Scholar
4. Mathai SC, Hassoun PM. Pulmonary arterial hypertension associated with systemic sclerosis. Expert Rev Respir Med. 2011/04/23. 2011;5: 267–279. pmid:21510736
- View Article
- PubMed/NCBI
- Google Scholar
5. Steen V. Predictors of end stage lung disease in systemic sclerosis. Ann Rheum Dis. 2003/01/15. 2003;62: 97–99. pmid:12525376
- View Article
- PubMed/NCBI
- Google Scholar
6. Sargent JL, Whitfield ML. Capturing the heterogeneity in systemic sclerosis with genome-wide expression profiling. Expert Rev Clin Immunol. NIH Public Access; 2011;7: 463–73. pmid:21790289
- View Article
- PubMed/NCBI
- Google Scholar
7. Krzyszczak ME, Li Y, Ross SJ, Ceribelli A, Chan EK, Bubb MR, et al. Gender and ethnicity differences in the prevalence of scleroderma-related autoantibodies. Clin Rheumatol. 2011/04/28. 2011;30: 1333–1339. pmid:21523365
- View Article
- PubMed/NCBI
- Google Scholar
8. Reveille JD. Ethnicity and race and systemic sclerosis: how it affects susceptibility, severity, antibody genetics, and clinical manifestations. Curr Rheumatol Rep. 2003/03/12. 2003;5: 160–167. pmid:12628048
- View Article
- PubMed/NCBI
- Google Scholar
9. Dospinescu P, Jones GT, Basu N. Environmental risk factors in systemic sclerosis. Curr Opin Rheumatol. 2013/01/05. 2013;25: 179–183. pmid:23287382
- View Article
- PubMed/NCBI
- Google Scholar
10. Hesselstrand R, Andreasson K, Wuttge DM, Bozovic G, Scheja A, Saxne T. Increased serum COMP predicts mortality in SSc: results from a longitudinal study of interstitial lung disease. Rheumatology. 2012;51: 915–920. pmid:22253028
- View Article
- PubMed/NCBI
- Google Scholar
11. Lakota K, Wei J, Carns M, Hinchcliff M, Lee J, Whitfield ML, et al. Levels of adiponectin, a marker for PPAR-gamma activity, correlate with skin fibrosis in systemic sclerosis: potential utility as a biomarker? Arthritis Res Ther. 2012;14: R102. pmid:22548780
- View Article
- PubMed/NCBI
- Google Scholar
12. Martin P, Teodoro WR, Velosa APP, de Morais J, Carrasco S, Christmann RB, et al. Abnormal collagen V deposition in dermis correlates with skin thickening and disease activity in systemic sclerosis. Autoimmun Rev. 2012;11: 827–835. pmid:22406224
- View Article
- PubMed/NCBI
- Google Scholar
13. Shand L, Lunt M, Nihtyanova S, Hoseini M, Silman A, Black CM, et al. Relationship between change in skin score and disease outcome in diffuse cutaneous systemic sclerosis: application of a latent linear trajectory model. Arthritis Rheum. 2007/06/30. 2007;56: 2422–2431. pmid:17599771
- View Article
- PubMed/NCBI
- Google Scholar
14. Milano A, Pendergrass SA, Sargent JL, George LK, McCalmont TH, Connolly MK, et al. Molecular subsets in the gene expression signatures of scleroderma skin. PLoS One. 2008/07/24. 2008;3: e2696. pmid:18648520
- View Article
- PubMed/NCBI
- Google Scholar
15. Preliminary criteria for the classification of systemic sclerosis (scleroderma). Subcommittee for scleroderma criteria of the American Rheumatism Association Diagnostic and Therapeutic Criteria Committee. Arthritis Rheum. 1980;23: 581–590. pmid:7378088
- View Article
- PubMed/NCBI
- Google Scholar
16. Assassi S, Wu M, Tan FK, Chang J, Graham TA, Furst DE, et al. Skin gene expression correlates of severity of interstitial lung disease in systemic sclerosis. Arthritis Rheum. 2013;65: 2917–27. pmid:23897225
- View Article
- PubMed/NCBI
- Google Scholar
17. Dobrota R, Maurer B, Graf N, Jordan S, Mihai C, Kowal-Bielecka O, et al. Prediction of improvement in skin fibrosis in diffuse cutaneous systemic sclerosis: a EUSTAR analysis. Ann Rheum Dis. 2016;75: 1743–1748. pmid:27016052
- View Article
- PubMed/NCBI
- Google Scholar
18. Diboun I, Wernisch L, Orengo CA, Koltzenburg M. Microarray analysis after RNA amplification can detect pronounced differences in gene expression using limma. BMC Genomics. 2006/10/13. 2006;7: 252. pmid:17029630
- View Article
- PubMed/NCBI
- Google Scholar
19. Liu W, Meng X, Xu Q, Flower DR, Li T. Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models. BMC Bioinformatics. 2006/04/04. 2006;7: 182. pmid:16579851
- View Article
- PubMed/NCBI
- Google Scholar
20. Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2.
- View Article
- Google Scholar
21. Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006/01/10. 2006;7: 3. pmid:16398926
- View Article
- PubMed/NCBI
- Google Scholar
22. Weihs C, Ligges U, Luebke K, Raabe N. klaR Analyzing German Business Cycles. Data Analysis and Decision Support. Berlin/Heidelberg: Springer-Verlag; 2005. pp. 335–343. https://doi.org/10.1007/3-540-28397-8_36
23. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008/07/24. 2008;9: 319. pmid:18647401
- View Article
- PubMed/NCBI
- Google Scholar
24. Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25: 1–18.
- View Article
- Google Scholar
25. Kumanovics G, Pentek M, Bae S, Opris D, Khanna D, Furst DE, et al. Assessment of skin involvement in systemic sclerosis. Rheumatol. 2017/10/11. 2017;56: v53–v66. pmid:28992173
- View Article
- PubMed/NCBI
- Google Scholar
26. Zhang H, Wang H, Dai Z, Chen M, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics. 2012;13: 298. pmid:23148517
- View Article
- PubMed/NCBI
- Google Scholar
27. Tsujino K, Takeda Y, Arai T, Shintani Y, Inagaki R, Saiga H, et al. Tetraspanin CD151 protects against pulmonary fibrosis by maintaining epithelial integrity. Am J Respir Crit Care Med. 2012;186: 170–80. pmid:22592804
- View Article
- PubMed/NCBI
- Google Scholar
28. Chakraborty D, Šumová B, Mallano T, Chen C-W, Distler A, Bergmann C, et al. Activation of STAT3 integrates common profibrotic pathways to promote fibroblast activation and tissue fibrosis. Nat Commun. 2017;8: 1130. pmid:29066712
- View Article
- PubMed/NCBI
- Google Scholar
29. Pedroza M, To S, Assassi S, Wu M, Tweardy D, Agarwal SK. Role of STAT3 in skin fibrosis and transforming growth factor beta signalling. Rheumatology. 2017; pmid:29029263
- View Article
- PubMed/NCBI
- Google Scholar
30. Yao L, Pan J, Setiadi H, Patel KD, McEver RP. Interleukin 4 or oncostatin M induces a prolonged increase in P-selectin mRNA and protein in human endothelial cells. J Exp Med. 1996;184: 81–92. Available: http://www.ncbi.nlm.nih.gov/pubmed/8691152 pmid:8691152
- View Article
- PubMed/NCBI
- Google Scholar
31. Gazel A, Rosdy M, Bertino B, Tornier C, Sahuc F, Blumenberg M. A Characteristic Subset of Psoriasis-Associated Genes Is Induced by Oncostatin-M in Reconstituted Epidermis. J Invest Dermatol. 2006;126: 2647–2657. pmid:16917497
- View Article
- PubMed/NCBI
- Google Scholar
32. RYCHLI K, KAUN C, HOHENSINNER PJ, REGA G, PFAFFENBERGER S, VYSKOCIL E, et al. The inflammatory mediator oncostatin M induces angiopoietin 2 expression in endothelial cells in vitro and in vivo. J Thromb Haemost. 2010;8: 596–604. pmid:20088942
- View Article
- PubMed/NCBI
- Google Scholar
33. Wong S, Botelho FM, Rodrigues RM, Richards CD. Oncostatin M overexpression induces matrix deposition, STAT3 activation and SMAD1 Dysregulation in lungs of fibrosis-resistant BALB/c mice. Lab Investig. 2014;94: 1003–1016. pmid:24933422
- View Article
- PubMed/NCBI
- Google Scholar
34. Nagata T, Kai H, Shibata R, Koga M, Yoshimura A, Imaizumi T. Oncostatin M, an Interleukin-6 Family Cytokine, Upregulates Matrix Metalloproteinase-9 Through the Mitogen-Activated Protein Kinase Kinase-Extracellular Signal-Regulated Kinase Pathway in Cultured Smooth Muscle Cells. Arterioscler Thromb Vasc Biol. 2003;23: 588–593. pmid:12615664
- View Article
- PubMed/NCBI
- Google Scholar
35. Bamber B, Reife RA, Haugen HS, Clegg CH. Oncostatin M stimulates excessive extracellular matrix accumulation in a transgenic mouse model of connective tissue disease. J Mol Med (Berl). 1998;76: 61–9. Available: http://www.ncbi.nlm.nih.gov/pubmed/9462869
- View Article
- Google Scholar
36. Hasegawa M, Sato S, Fujimoto M, Ihn H, Kikuchi K, Takehara K. Serum levels of interleukin 6 (IL-6), oncostatin M, soluble IL-6 receptor, and soluble gp130 in patients with systemic sclerosis. J Rheumatol. 1998;25: 308–13. Available: http://www.ncbi.nlm.nih.gov/pubmed/9489824 pmid:9489824
- View Article
- PubMed/NCBI
- Google Scholar
37. Michalska-Jakubus M, Kowal-Bielecka O, Chodorowska G, Bielecki M, Krasowska D. Angiopoietins-1 and -2 are differentially expressed in the sera of patients with systemic sclerosis: high angiopoietin-2 levels are associated with greater severity and higher activity of the disease. Rheumatology. 2011;50: 746–755. pmid:21149250
- View Article
- PubMed/NCBI
- Google Scholar
38. Gerlicz Z, Dziankowska-Bartkowiak B, Dziankowska-Zaborszczyk E, Sysa-Jedrzejowska A. Disturbed Balance between Serum Levels of Receptor Tyrosine Kinases Tie-1, Tie-2 and Angiopoietins in Systemic Sclerosis. Dermatology. 2014;228: 233–239. pmid:24603462
- View Article
- PubMed/NCBI
- Google Scholar
39. Gruschwitz MS, Hornstein OP, von Den Driesch P. Correlation of soluble adhesion molecules in the peripheral blood of scleroderma patients with their in situ expression and with disease activity. Arthritis Rheum. 1995;38: 184–9. Available: http://www.ncbi.nlm.nih.gov/pubmed/7848308 pmid:7848308
- View Article
- PubMed/NCBI
- Google Scholar
40. Elhai M, Meunier M, Matucci-Cerinic M, Maurer B, Riemekasten G, Leturcq T, et al. Outcomes of patients with systemic sclerosis-associated polyarthritis and myopathy treated with tocilizumab or abatacept: a EUSTAR observational study. Ann Rheum Dis. 2013;72: 1217–1220. pmid:23253926
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Steen VD, Medsger TA. Changes in causes of death in systemic sclerosis, 1972–2002. Ann Rheum Dis. 2007/03/03. 2007;66: 940–944. pmid:17329309
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Johnson SR. Progress in the clinical classification of systemic sclerosis. Curr Opin Rheumatol. 2017/09/30. 2017;29: 568–573. pmid:28961157
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Steen VD. The many faces of scleroderma. Rheum Dis Clin North Am. 2008;34: 1–15; v. pmid:18329529
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Mathai SC, Hassoun PM. Pulmonary arterial hypertension associated with systemic sclerosis. Expert Rev Respir Med. 2011/04/23. 2011;5: 267–279. pmid:21510736
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Steen V. Predictors of end stage lung disease in systemic sclerosis. Ann Rheum Dis. 2003/01/15. 2003;62: 97–99. pmid:12525376
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Sargent JL, Whitfield ML. Capturing the heterogeneity in systemic sclerosis with genome-wide expression profiling. Expert Rev Clin Immunol. NIH Public Access; 2011;7: 463–73. pmid:21790289
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Krzyszczak ME, Li Y, Ross SJ, Ceribelli A, Chan EK, Bubb MR, et al. Gender and ethnicity differences in the prevalence of scleroderma-related autoantibodies. Clin Rheumatol. 2011/04/28. 2011;30: 1333–1339. pmid:21523365
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Reveille JD. Ethnicity and race and systemic sclerosis: how it affects susceptibility, severity, antibody genetics, and clinical manifestations. Curr Rheumatol Rep. 2003/03/12. 2003;5: 160–167. pmid:12628048
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Dospinescu P, Jones GT, Basu N. Environmental risk factors in systemic sclerosis. Curr Opin Rheumatol. 2013/01/05. 2013;25: 179–183. pmid:23287382
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Hesselstrand R, Andreasson K, Wuttge DM, Bozovic G, Scheja A, Saxne T. Increased serum COMP predicts mortality in SSc: results from a longitudinal study of interstitial lung disease. Rheumatology. 2012;51: 915–920. pmid:22253028
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Lakota K, Wei J, Carns M, Hinchcliff M, Lee J, Whitfield ML, et al. Levels of adiponectin, a marker for PPAR-gamma activity, correlate with skin fibrosis in systemic sclerosis: potential utility as a biomarker? Arthritis Res Ther. 2012;14: R102. pmid:22548780
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Martin P, Teodoro WR, Velosa APP, de Morais J, Carrasco S, Christmann RB, et al. Abnormal collagen V deposition in dermis correlates with skin thickening and disease activity in systemic sclerosis. Autoimmun Rev. 2012;11: 827–835. pmid:22406224
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Shand L, Lunt M, Nihtyanova S, Hoseini M, Silman A, Black CM, et al. Relationship between change in skin score and disease outcome in diffuse cutaneous systemic sclerosis: application of a latent linear trajectory model. Arthritis Rheum. 2007/06/30. 2007;56: 2422–2431. pmid:17599771
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Milano A, Pendergrass SA, Sargent JL, George LK, McCalmont TH, Connolly MK, et al. Molecular subsets in the gene expression signatures of scleroderma skin. PLoS One. 2008/07/24. 2008;3: e2696. pmid:18648520
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Preliminary criteria for the classification of systemic sclerosis (scleroderma). Subcommittee for scleroderma criteria of the American Rheumatism Association Diagnostic and Therapeutic Criteria Committee. Arthritis Rheum. 1980;23: 581–590. pmid:7378088
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Assassi S, Wu M, Tan FK, Chang J, Graham TA, Furst DE, et al. Skin gene expression correlates of severity of interstitial lung disease in systemic sclerosis. Arthritis Rheum. 2013;65: 2917–27. pmid:23897225
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Dobrota R, Maurer B, Graf N, Jordan S, Mihai C, Kowal-Bielecka O, et al. Prediction of improvement in skin fibrosis in diffuse cutaneous systemic sclerosis: a EUSTAR analysis. Ann Rheum Dis. 2016;75: 1743–1748. pmid:27016052
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Diboun I, Wernisch L, Orengo CA, Koltzenburg M. Microarray analysis after RNA amplification can detect pronounced differences in gene expression using limma. BMC Genomics. 2006/10/13. 2006;7: 252. pmid:17029630
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Liu W, Meng X, Xu Q, Flower DR, Li T. Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models. BMC Bioinformatics. 2006/04/04. 2006;7: 182. pmid:16579851
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref20] 20. Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref21] 21. Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006/01/10. 2006;7: 3. pmid:16398926
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref22] 22. Weihs C, Ligges U, Luebke K, Raabe N. klaR Analyzing German Business Cycles. Data Analysis and Decision Support. Berlin/Heidelberg: Springer-Verlag; 2005. pp. 335–343. https://doi.org/10.1007/3-540-28397-8_36

[ref23] 23. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008/07/24. 2008;9: 319. pmid:18647401
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref24] 24. Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25: 1–18.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref25] 25. Kumanovics G, Pentek M, Bae S, Opris D, Khanna D, Furst DE, et al. Assessment of skin involvement in systemic sclerosis. Rheumatol. 2017/10/11. 2017;56: v53–v66. pmid:28992173
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref26] 26. Zhang H, Wang H, Dai Z, Chen M, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics. 2012;13: 298. pmid:23148517
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref27] 27. Tsujino K, Takeda Y, Arai T, Shintani Y, Inagaki R, Saiga H, et al. Tetraspanin CD151 protects against pulmonary fibrosis by maintaining epithelial integrity. Am J Respir Crit Care Med. 2012;186: 170–80. pmid:22592804
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref28] 28. Chakraborty D, Šumová B, Mallano T, Chen C-W, Distler A, Bergmann C, et al. Activation of STAT3 integrates common profibrotic pathways to promote fibroblast activation and tissue fibrosis. Nat Commun. 2017;8: 1130. pmid:29066712
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref29] 29. Pedroza M, To S, Assassi S, Wu M, Tweardy D, Agarwal SK. Role of STAT3 in skin fibrosis and transforming growth factor beta signalling. Rheumatology. 2017; pmid:29029263
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref30] 30. Yao L, Pan J, Setiadi H, Patel KD, McEver RP. Interleukin 4 or oncostatin M induces a prolonged increase in P-selectin mRNA and protein in human endothelial cells. J Exp Med. 1996;184: 81–92. Available: http://www.ncbi.nlm.nih.gov/pubmed/8691152 pmid:8691152
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref31] 31. Gazel A, Rosdy M, Bertino B, Tornier C, Sahuc F, Blumenberg M. A Characteristic Subset of Psoriasis-Associated Genes Is Induced by Oncostatin-M in Reconstituted Epidermis. J Invest Dermatol. 2006;126: 2647–2657. pmid:16917497
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref32] 32. RYCHLI K, KAUN C, HOHENSINNER PJ, REGA G, PFAFFENBERGER S, VYSKOCIL E, et al. The inflammatory mediator oncostatin M induces angiopoietin 2 expression in endothelial cells in vitro and in vivo. J Thromb Haemost. 2010;8: 596–604. pmid:20088942
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref33] 33. Wong S, Botelho FM, Rodrigues RM, Richards CD. Oncostatin M overexpression induces matrix deposition, STAT3 activation and SMAD1 Dysregulation in lungs of fibrosis-resistant BALB/c mice. Lab Investig. 2014;94: 1003–1016. pmid:24933422
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref34] 34. Nagata T, Kai H, Shibata R, Koga M, Yoshimura A, Imaizumi T. Oncostatin M, an Interleukin-6 Family Cytokine, Upregulates Matrix Metalloproteinase-9 Through the Mitogen-Activated Protein Kinase Kinase-Extracellular Signal-Regulated Kinase Pathway in Cultured Smooth Muscle Cells. Arterioscler Thromb Vasc Biol. 2003;23: 588–593. pmid:12615664
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref35] 35. Bamber B, Reife RA, Haugen HS, Clegg CH. Oncostatin M stimulates excessive extracellular matrix accumulation in a transgenic mouse model of connective tissue disease. J Mol Med (Berl). 1998;76: 61–9. Available: http://www.ncbi.nlm.nih.gov/pubmed/9462869
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref36] 36. Hasegawa M, Sato S, Fujimoto M, Ihn H, Kikuchi K, Takehara K. Serum levels of interleukin 6 (IL-6), oncostatin M, soluble IL-6 receptor, and soluble gp130 in patients with systemic sclerosis. J Rheumatol. 1998;25: 308–13. Available: http://www.ncbi.nlm.nih.gov/pubmed/9489824 pmid:9489824
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref37] 37. Michalska-Jakubus M, Kowal-Bielecka O, Chodorowska G, Bielecki M, Krasowska D. Angiopoietins-1 and -2 are differentially expressed in the sera of patients with systemic sclerosis: high angiopoietin-2 levels are associated with greater severity and higher activity of the disease. Rheumatology. 2011;50: 746–755. pmid:21149250
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref38] 38. Gerlicz Z, Dziankowska-Bartkowiak B, Dziankowska-Zaborszczyk E, Sysa-Jedrzejowska A. Disturbed Balance between Serum Levels of Receptor Tyrosine Kinases Tie-1, Tie-2 and Angiopoietins in Systemic Sclerosis. Dermatology. 2014;228: 233–239. pmid:24603462
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref39] 39. Gruschwitz MS, Hornstein OP, von Den Driesch P. Correlation of soluble adhesion molecules in the peripheral blood of scleroderma patients with their in situ expression and with disease activity. Arthritis Rheum. 1995;38: 184–9. Available: http://www.ncbi.nlm.nih.gov/pubmed/7848308 pmid:7848308
View Article
PubMed/NCBI
Google Scholar

[148] View Article

[149] PubMed/NCBI

[150] Google Scholar

[ref40] 40. Elhai M, Meunier M, Matucci-Cerinic M, Maurer B, Riemekasten G, Leturcq T, et al. Outcomes of patients with systemic sclerosis-associated polyarthritis and myopathy treated with tocilizumab or abatacept: a EUSTAR observational study. Ann Rheum Dis. 2013;72: 1217–1220. pmid:23253926
View Article
PubMed/NCBI
Google Scholar

[152] View Article

[153] PubMed/NCBI

[154] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Study population

Statistical analyses

Cross-validation design

Classification

Feature (gene) selection methods

Principal component analysis

Construction of heat maps

Ingenuity Pathway Analysis

Results

Classification and accuracy performances were improved by applying feature selection methods

Patient severity classification groups are readily separable by gene expression signatures

Pathway enrichment analysis identified a putative signaling network that included genes related to SSc severity

Discussion

Supporting information

S1 Table. Patient information for samples used from Dataset 1.

S2 Table. Patient information for samples used from Dataset 2.

S3 Table. Canonical pathways identified by Ingenuity Pathway Analysis.

S1 Fig. Principal component analysis (PCA) of gene expression separation between low and high severity groups.

S2 Fig. Heat map showing Log2 normalized expression values for patient samples from Dataset 1 for probe IDs identified by SVM-RFE feature selection method.

S3 Fig. Heat map showing Log2 normalized expression values for patient samples from Dataset 2 for probe IDs identified by SVM-RFE feature selection method.

Acknowledgments

References