Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Novel classification for global gene signature model for predicting severity of systemic sclerosis

  • Zariel I. Johnson,

    Roles Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Health Promotions and Development, University of Pittsburgh School of Nursing, Pittsburgh, PA, United States of America

  • Jacqueline D. Jones,

    Roles Formal analysis, Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Biological & Environmental Sciences, Troy University, Troy, AL, United States of America

  • Angana Mukherjee,

    Roles Data curation, Software

    Affiliation Department of Biological & Environmental Sciences, Troy University, Troy, AL, United States of America

  • Dianxu Ren,

    Roles Data curation, Methodology, Writing – review & editing

    Affiliation Health and Community Systems, University of Pittsburgh School of Nursing, Pittsburgh, PA, United States of America

  • Carol Feghali-Bostwick,

    Roles Conceptualization, Investigation, Writing – original draft, Writing – review & editing

    Affiliation Department of Rheumatology & Immunology, University of South Carolina, Charleston, SC, United States of America

  • Yvette P. Conley,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Health Promotions and Development, University of Pittsburgh School of Nursing, Pittsburgh, PA, United States of America, Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, United States of America

  • Cecelia C. Yates

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Visualization, Writing – original draft, Writing – review & editing

    yatescc@upmc.edu

    Affiliations Department of Health Promotions and Development, University of Pittsburgh School of Nursing, Pittsburgh, PA, United States of America, Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States of America

Abstract

Progression of systemic scleroderma (SSc), a chronic connective tissue disease that causes a fibrotic phenotype, is highly heterogeneous amongst patients and difficult to accurately diagnose. To meet this clinical need, we developed a novel three-layer classification model, which analyses gene expression profiles from SSc skin biopsies to diagnose SSc severity. Two SSc skin biopsy microarray datasets were obtained from Gene Expression Omnibus. The skin scores obtained from the original papers were used to further categorize the data into subgroups of low (<18) and high (≥18) severity. Data was pre-processed for normalization, background correction, centering and scaling. A two-layered cross-validation scheme was employed to objectively evaluate the performance of classification models of unobserved data. Three classification models were used: support vector machine, random forest, and naive Bayes in combination with feature selection methods to improve performance accuracy. For both input datasets, random forest classifier combined with correlation-based feature selection (CFS) method and naive Bayes combined with CFS or support vector machine based recursive feature elimination method yielded the best results. Additionally, we performed a principal component analysis to show that low and high severity groups are readily separable by gene expression signatures. Ultimately, we found that our novel classification prediction model produced global gene signatures that significantly correlated with skin scores. This study represents the first report comparing the performance of various classification prediction models for gene signatures from SSc patients, using current clinical diagnostic factors. In summary, our three-classification model system is a powerful tool for elucidating gene signatures from SSc skin biopsies and can also be used to develop a prognostic gene signature for SSc and other fibrotic disorders.

Introduction

Systemic Scleroderma (SSc) is a multifaceted disease that exhibits heterogeneity and clinical variability among patients, often complicating diagnosis and decisions regarding treatment. SSc causes skin fibrosis, systemic vascular alterations and collagen accumulation from chronic hardening and tightening of the skin and connective tissues, eventually leading to organ failure and poor prognosis [1]. One level of heterogeneity can be demonstrated by the presence of two SSc disease forms that are defined based on extent of skin involvement: limited cutaneous SSc (lcSSc) and diffuse cutaneous SSc (dcSSc) [2]. Prognosis and risk of internal organ involvement is different for patients in these clinically-defined disease subsets [35]. Moreover, genomic studies indicate that variability exists even within each disease forms [6]. Further complexity in SSc stems from the fact that gender [7], race [8], and environmental toxin exposure [9] can predispose individuals to the disease. Thus, the multifactorial nature of SSc pathogenesis can impede accurate measurements of severity and estimation of progression.

Clinical progression of SSc is most often measured by skin thickening using the modified Rodnan skin score (mRSS), which entails a 17-point assessment of skin thickness on various areas of the body, culminating in a 51-point maximum scale of severity. Expression levels of several genes have been associated with skin score [1012]. However, these investigations did not report the use of a comprehensive analysis comparing various methods available to identify and define a genomic signature that correlates with disease severity. Furthermore, there is evidence that mRSS does not always necessarily correlate with disease trajectory [13]. Thus, the spread of fibrotic behavior to other organs suggests a systemic pathogenic mechanism that requires a more in depth diagnostic analysis to be used in conjunction with skin score. Therefore, the goal of this study was to assess the ability of various classification methods to identify a genetic signature that could be the basis for a diagnostic test for SSc.

In this current study, we constructed classification models with the goal of predicting SSc skin severity based on gene expression profiles and identifying marker gene sets that correlate with high or low severity patients. Ongoing and future studies will focus on the importance of genes identified herein in disease and how they relate to patient-to-patient heterogeneity. Our classification models were capable of readily parsing patients, based on genomic profile alone, into high or low severity groups as defined by mRSS and led to the identification of gene sets associated with disease progression. Thus, our model is a powerful tool not only for diagnosis but also to find previously unknown disease-related genes that warrant further investigation.

Materials and methods

Study population

The two datasets used in this study were obtained from the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/). Dataset 1 (GEO Accession: GSE9285) [14] contains microarray data from 75 biopsies from 34 individuals, including 28 patients and 6 healthy controls. SSc patients met the criteria of the American College of Rheumatology [15]. Data were generated by Whole Human Genome Oligo Microarray G4112A platform (Agilent Technologies). The skin scores of these samples were documented in the original paper. Microarray data from 21 samples (5 morphea samples, 15 healthy samples, and 1 dSSc sample) was not used because there was no associated skin score information. This left a remaining total of 54 samples, which were included in the analysis. The 54 samples were derived from a total of 24 patients. In several cases, one or more biopsy was taken from a single patient. The skin scores and other patient information related to these samples were documented in the original paper (S1 Table).

Dataset 2 (GEO Accession: GSE47162) [16] contains 59 microarray data of skin biopsies from 59 patients. Data were generated using Illumina HumanHT-12 V4.0 Expression Beadchips. The skin scores and other patient information were extracted from GEO sample record information (S2 Table). One sample without the skin score was excluded from analysis, resulting into a total of 58 samples. For Dataset 1, 25 samples fell into the high severity group and 29 into the low severity group. For Dataset 2, 21 samples fell into the high severity group and 37 into the low severity group.

In all data sets a two-class classification was used to establish classification models that distinguish between "high group" and "low" group ". Skin scores between 18 and 51 were categorized as "high severity", and those less than 18 were categorized as "low severity". An analysis of over 900 patients indicated that a mRSS threshold between 18 and 25 was optimal for distinguishing between patients that would progress or regress [17]. We took a conservative approach and chose the lower limit of this range as our cutoff.

Statistical analyses

For the Dataset 1, microarrays were visually inspected for defects or technical artifacts, and poor-quality spots were manually excluded in the original study. Spots with fluorescent signals less than two-fold of local background in both Cy3- and Cy5- channel were excluded. Following this pre-processing step, 28,495 probes with at least 80% of their data points were included in downstream analysis. Background correction and quantile normalization were performed using limma package [18]. The data were represented as log2 of the Cy5/Cy3 ratios. Data for each probe were centered by subtracting the expression value from the mean of all expression values across the arrays, and scaled by dividing the centered value by the standard deviation of values across the arrays. For the Dataset 2, quality control was performed by filtering probes using the detection P-value. The detection P-value represents the confidence that a given transcript is expressed above the background level calculated based on negative control probes. We adopted a P-value cutoff of 0.01 as suggested by a quality control manual for this type of microarrays. Probe sets were filtered for a minimum of 20% samples with detection P-value less than 0.01. Quantile normalization was performed across arrays. The data were then log2 transformed, followed by centering and scaling procedures as described for Dataset 1. In total, 9266 features were included for further analysis.

Cross-validation design

We used a five-fold cross-validation design to compare the performance of the various classification methods. To find the best parameters for the model, we performed another “nested” leave-one-out (LOO) cross-validation on the five original training sets. We identified performance information for each combination of classification method and parameter set and selected the optimal combination to be used for the testing set. This methodology has been described and aims to limit the effects of overfitting the model while objectively evaluating model performance [19].

Classification

Three classification methods were applied to each of the two datasets: support vector machine (SVM), random forest (RF) and naive Bayes (NB). We used the SVM methods implemented in the e1071 package in R, and selected the RBF (radial basis function) kernel, which has been shown to produce outstanding classification performance in previous studies [19]. We employed the RF implementation in the R package random Forest [20]. There are three important parameters, the values of which need to be determined: mtry (the number of input variable tried at each split), ntree (the number of trees to grow for each forest) and nodesize (the minimum size of the terminal nodes). We considered different parameter configurations for their values of mtry = {0.5,1,2}, ntree = {500,1000,2000} and nodesize = 1, as recommended in a previous study [21]. The best-performing parameters were selected by nested cross-validation, as described in the last section. NB implemented in this study used the klaR package [22] in R. The parameter fL (factor for Laplace correction) was set by default (no correction), and usekernel (logic value to set if a kernel density estimate will be used for density estimation) was optimized by nested cross-validation. Classification model performance was evaluated using three classification metrics: accuracy, sensitivity and specificity. The Matthews Correlation Coefficient (MCC) was also used to quantify the balance between sensitivity and specificity.

Feature (gene) selection methods

In feature selection method, a small subset of features are identified that, together with the classification methods, are most effective in distinguishing samples belonging to different groups. There are three broad categories of selection methods: filter, wrapper, and embedded methods. Filter methods rank the features or genes regardless of the model. They select the most significant feature or gene based on univariate measure. Wrapper methods evaluate subsets that are optimal with respect to a subset evaluator such as a classifier. Embedded methods incorporate the search algorithm into the classifier. In this study, two filter methods and two wrapper methods were applied. The Chi-Squared method is a filter method that evaluates features or genes individually by measuring their chi-squared statistic with respect to the classes. The correlation-based feature selection method (CFS) method is also a filter method, which measures the correlation between attributes and recognizes those feature subsets in which each feature is highly correlated with the class but uncorrelated with other subset features. The SVM-based recursive feature elimination method (SVM-RFE) is a wrapper method used in microarray data analysis. It eliminates unessential genes and selects better and more compact gene subsets. Random forest-based backward feature elimination method (RFVS) is another wrapper method that constructs RFs in an iterative manner. Upon each iteration, RFVS builds a random forest after discarding genes with the smallest importance values in the last iteration. The returned subset of genes is the one with the smallest out-of-bag error [23].

Principal component analysis

Principal component analysis (PCA) was performed using function pca() in R (version 3.2.2) package FactoMineR (version 1.31.4). Configuration: scale.unit = TRUE, ncp = 5, ind.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL, col.w = NULL, graph = TRUE, axes = c(1,2) [24].

The component loadings used in PCA are the correlation coefficients between the variables (rows) and factors (columns) and the values for the genes of PC1 and PC2 indicate the "weights" for the genes. The higher the loading value, the higher weight a gene carries in composing the PC. Background correction and quantile normalization were performed using limma package [18]. Data was loaded into MeV as a tab delimited text file of log2-transformed Cy5/Cy3 ratios. For PCA analysis, missing data were first estimated using K-nearest neighbors (KNN) imputation with N = 4.

Construction of heat maps

Heat maps showing gene expression for the patients by severity group were constructed using normalized and scaled expression levels. Quantile normalization was performed on the raw microarray data, and the normalized expression levels were then Log2 transformed, centered (on 0), and scaled to the same range across genes, so that each gene is evaluated with equal weight.

Ingenuity Pathway Analysis

Ingenuity Pathway Analysis (Qiagen) (Version 43605062) was used to perform Canonical Pathway Analysis and Upstream Regulator Prediction. Analysis was performed on four lists of microarray probe IDs: identified by CFS classifier in Dataset 1 and differentially expressed, identified by SVM-RFE classifier in Dataset 1 and differentially expressed, identified by CFS classifier in Dataset 2 and differentially expressed, and identified by SVM-RFE classifier in Dataset 2 and differentially expressed. Differential expression was considered statistically significant if the t-test p-value was < 0.05 following Bonferroni correction based on the number of probe IDs identified for each classifier-dataset combination. The Bonferroni correction was applied to limit the false positive discovery rate associated with multiple hypothesis testing. Probe IDs that were successfully mapped by IPA were included for analyses. The direction of expression was positive for probes with higher expression in high severity patients and negative for probes with higher expression in low severity patients. For each dataset, the appropriate reference probe set was based on the microarray used in the original study.

Results

Classification and accuracy performances were improved by applying feature selection methods

Several studies, including retrospective cohort analyses and prospective clinical trials, have shown that the severity of skin sclerosis, as assessed by the modified Rodnan skin thickness score (mRSS), is predictive of disease outcome [25]. However, only a very select few have attempted to map the clinical covariates of mRSS to unique gene expression signatures. To further explore this, we used microarray data from two SSc patient studies (Dataset 1 and Dataset 2) from the NCBI GEO Database (Patient data shown in Table 1 and S1 and S2 Tables). We categorized microarray profiles as either low (mRSS 0–17) or high (mRSS 18–51) severity.

thumbnail
Table 1. Summary of patient information for microarray biopsy samples used in models.

https://doi.org/10.1371/journal.pone.0199314.t001

We first tested various two-class classification models to determine which could best distinguish between samples designated as low or high severity. The accuracy, sensitivity, specificity, and Mathew’s correlation coefficient (MCC) were determined for each of the classification models (Table 2). MCC values demonstrate the contribution from each of the randomly formed gene sets. An MCC value of 1 indicates a perfect prediction [26]. In the Dataset 1, the accuracy performance classifier RF and feature selection correlation-based selection method CFS showed a MCC value of 0.96 (Table 2). NB combined with either CFS or SVM-RFE showed MCC values of 0.96, and 0.96 respectively (Table 2). In Group 2, RF and CFS had a significant MCC value of 1.00. NB combined with either CFS or SVM-RFE also produced a MCC value of 1.00. Thus, higher MCC values were obtained in cases where either CFS or SVM-RFE selection methods were employed. Therefore, we chose to focus primarily on the results from these two feature selection methods for the remainder of the investigation.

thumbnail
Table 2. Performance evaluation of various classifier and feature selection methods.

https://doi.org/10.1371/journal.pone.0199314.t002

The CFS selection method identified 84 probe IDs in Dataset 1 and 89 probe IDs in Dataset 2 as being differentially expressed between low and high severity groups. Likewise, the SVM-RFE method identified 450 probe IDs in Dataset 1 and 50 probe IDs in Dataset 2 (Table 3). Specific genes and potential interactions were explored using Ingenuity Pathway Analysis.

thumbnail
Table 3. Number of features (microarray probe IDs) selected for each dataset and feature selection method.

https://doi.org/10.1371/journal.pone.0199314.t003

Patient severity classification groups are readily separable by gene expression signatures

We performed principal component analysis (PCA) using all targets to pass data cleaning and filtering to visually interpret the variability within and between severity groups. PCA based on all four feature selection methods showed clear separation between severity groups based on gene signature for both Datasets (Fig 1 and S1 Fig). The PCA based on CFS for Dataset 1 showed 24.90% of variance in PC1 and 7.93% of variance in PC2 (Fig 1A). For all feature selection methods and both datasets, PC1 explained much more of the variability than PC2, indicating that the first principal component explained most of the differences between the expression profiles of the high and low severity patients. For the PCA performed on genes from the Chi-squared feature selection method, the microarray probe IDs with the ten highest absolute loading values for PC1 are given in Table 4. Each of these probe IDs was found to be statistically differentially expressed between low and high severity groups, based on Bonferroni-adjusted t-test p-value < 0.05. All The most highly weighted probe IDs for datasets 1 and 2 corresponded to genes for MAGI1 and SOX18, respectively, with positive loading values being associated with the more severe disease phenotype in both cases.

thumbnail
Fig 1. Principal component analyses (PCA) of gene expression separation between low and high severity groups.

Results based on CFS feature selection method are shown in A (Dataset 1) and B (Dataset 2). Results based on SVM-RFE feature selection method are shown in C (Dataset 1) and D (Dataset 2).

https://doi.org/10.1371/journal.pone.0199314.g001

thumbnail
Table 4. Microarray probe IDs associated with the top 10 highest absolute value of loading values for principal component 1 based on principal component analysis of genes identified by Chi-squared feature selection.

https://doi.org/10.1371/journal.pone.0199314.t004

In addition, the Log2 normalized expression values were used to create heat maps for each Dataset. Heat maps were generated to show probe IDs from Dataset 1 (Fig 2) and Dataset 2 (Fig 3) that were identified by CFS feature selection and highlight differences in gene expression profiles between the severity groups. As expected, expression profiles associated with these probe IDs can be separated by severity index when inspected visually. Similar results are shown in heat maps for probe IDs from Dataset 1 (S2 Fig) and 2 (S3 Fig) that were identified by SVM-RFE.

thumbnail
Fig 2. Heat map showing Log2 normalized expression values for patient samples from Dataset 1 for probe IDs identified by CFS feature selection method.

https://doi.org/10.1371/journal.pone.0199314.g002

thumbnail
Fig 3. Heat map showing Log2 normalized expression values for patient samples from Dataset 2 for probe IDs identified by CFS feature selection method.

https://doi.org/10.1371/journal.pone.0199314.g003

Pathway enrichment analysis identified a putative signaling network that included genes related to SSc severity

Based on the high performance of our models, we hypothesized that the genes identified by the models and differentially expressed between severity groups may have biological significance. To further explore the relationship between these groups of genes, we used Ingenuity Pathway Analysis (IPA) to analyze the lists of genes that were both identified by the classifier and that showed differential expression between severity groups (Bonferroni-corrected t-test p-value < 0.05). Fig 4 depicts the numbers of Probe IDs that were identified by the classifier, were statistically significant, and mapped to Gene IDs for each classifier and dataset combination. We first investigated whether our gene lists were overrepresented in known Canonical Pathways. The top three Canonical Pathways (ranked by Benjamini-Hochberg p-value) are shown in S3 Table. Our analysis failed to find any pathways that showed statistically meaningful overrepresentation of our input genes. We then used the Upstream Regulator Analysis in IPA to see whether a common upstream regulator could explain the differences that we saw in gene expression between low and high severity patient samples. The upstream regulator that showed the most predictive value, based on a combination of activation Z-score of 2.159 and p-value of overlap of 2.25E-03 was oncostatin M (OSM). Activation of OSM was predicted to regulate five of the genes with significant expression differences identified in Dataset 2 using the CFS classifier when comparing low and high severity patients (Table 5 and Fig 5).

thumbnail
Fig 4. Schematic representation of pipeline for choosing genes that were included in Ingenuity Pathway Analysis.

Red font indicates numbers of genes more highly expressed by high severity patients; blue font indicates numbers of genes more highly expressed by low severity patients.

https://doi.org/10.1371/journal.pone.0199314.g004

thumbnail
Fig 5. Predicted signaling network between OSM and downstream genes related to SSc severity.

Red shading of gene indicates upregulation in dataset compared to low severity patients, green shading downregulation, and intensity of color depicts strength of regulation. Relationships between genes that are predicted based on literature are indicated by lines connecting genes, with red symbolizing predicted upregulation and blue predicted downregulation.

https://doi.org/10.1371/journal.pone.0199314.g005

thumbnail
Table 5. Results of IPA-based upstream regulator analysis showing potential role for OSM in regulating genes identified by CFS-based classification of Dataset 2 and differentially expressed between low and high severity patients.

https://doi.org/10.1371/journal.pone.0199314.t005

Discussion

This study is one of the first to describe an independently validated classification prediction model for gene signatures in SSc using current clinical diagnostic factors. We used applied three well-established classification methods, SVM, RF, and NB, coupled with four different feature selections, to clinically derived microarray data and compared the relative efficiency of the models. Interestingly, RF and CFS showed high MCC values of 0.96 and 1.00 for both of the independent datasets that we used, suggesting an efficient diagnostic two-layered system for SSc. This two-layered model system is capable of reducing the dimensionality of large independent data to find a more useful subset for further exploration. Moreover, due to the heterogeneity of patients with SSc, a prediction model that can deduce disease severity in an unsupervised manner is essential. To get an idea of how the gene lists generated by our classifiers compared to conventionally-derived differentially expressed genes, we also performed a formal differential gene expression analysis comparing low and high severity patients from each dataset (data not shown), which identified many probe IDs that were statistically significant between groups. While some genes identified by the feature selection methods overlapped with this list of conventionally derived differentially expressed genes, in each case our classifiers also identified several genes that were not found with the conventional differential gene analysis. This highlights the value of our classifiers in finding genes that can be useful for classification purposes but may not be identified by conventional methods.

Further validating the efficiency of the two-step model system, principal component analysis showed a clear separation of patients with high and low skin scores when subjected to the model. This clustering of datasets further substantiates the efficiency of the model system to reduce complexity and distinguish between high and low severity gene profiles. After performing the PCA, we examined which microarray probe IDs were associated with the highest loading values to gain a sense of which were the most important in separating the severity groups. The most highly weighted probe IDs were associated with a broad array of biological processes. In Dataset 1, Tetraspanin 8 showed one of the highest loading levels. Tetraspanins have previously been shown to alter fibrosis through the regulation of epithelial cell-basement membrane interaction [27]. For Dataset 2, we found that the gene Signal Transducer and Activator of Transcription 3 (STAT3) ranked was one of those with the highest loading value. STAT3 has been extensively associated with dermal fibrosis and is critical in the pro-fibrotic effects of TGF-β signaling [28,29]. Our analysis suggests that the other highly weighted genes are key in distinguishing low and high severity patients and should be carefully considered for further investigation. Based on the results of the CFS feature selection method, heat maps showing Log2 normalized expression for both datasets showed a clear distinction between patients with high and low skin scores, signifying the presence of two separately clustered gene profiles.

Our model assigned patients into high and low severity groups using skin score, the standard diagnostic measurement for SSc. This bolsters the efficiency and effectiveness of our model system to determine outcome, which should help decrease the lack of specificity that currently hampers clinical treatment. The analytical pipeline reported herein highlights the potential applicability of an unsupervised, tiered correlation method, whereby disease severity can be classified based upon genetic signature alone. Furthermore, the novel gene sets that have been identified as correlating with disease severity may be further investigated to provide insight into the molecular mechanisms underlying a clinical trait of SSc, a complex fibrotic disease. Our methodology can be used to stratify patients to assess response to therapy and to guide appropriate recruitment to clinical trials.

To evaluate the functional relevance of these clusters in the context of disease, we used various tools within Ingenuity Pathway Analysis to explore interactions between genes that were both identified by our classification methods and showed differential expression between low and high severity groups. Our analysis did not return any pathways that were significantly enriched by the genes in our lists, showing that the genes identified here to do not belong to known biological pathways. This result indicates that our classification methods are capable of identifying genes that, when taken together, can accurately be used to predict disease severity although they may not participate in the same pathways in vivo. Additionally, it is possible that these genes may interact in pathways that are not yet known. However, our analysis of potential upstream regulators did identify one small network, controlled by the cytokine oncostatin-m (OSM), that could drive relatively high expression of P-selectin (SELP), cholesterol 24-hydroxylase (CH25H), c-c motif chemokine ligand 2 (CCL2), and angiopoietin 2 (ANGPT2) and low expression of heat shock protein family B member 3 (HSPB3) in higher severity SSc patients. Evidence in the literature shows that OSM is capable of regulating these genes with directionality suggesting that OSM positively regulates the genes associated with the more severe skin scores [3032]. Studies have shown that levels of OSM, an IL-6 family cytokine, are associated with SSc and OSM can modulate production of several extracellular matrix components important in fibrosis [3336]. Relevant to our study, serum levels of soluble angiopoietin-1, p-selectin, and CCL2 correlate with increasing severity of SSc and worsening clinical features [3740]. Therefore, our data suggest that OSM may influence the expression of additional genes associated with severe disease.

We observed some differences in the probe IDs identified by our classification methods and in subsequent downstream analyses. For the two datasets, the CFS method identified similar number of probe IDs that were used for classification (84 and 85 for Datasets 1 and 2, respectively). On the other hand, the SVM-RFE method identified 450 probe IDs for Dataset 1 and 50 for Dataset 2. We believe that these differences stem from the fact that CFS and SVM-RFE use intrinsically different methods for feature selection. The CFS method is a filter method that gives priority to genes whose expression is highly correlated with a class. SFM-RFE is a wrapper method that discards genes that have only a small impact on classification. Interestingly, when only the genes that have significantly different expression levels between severity groups are considered, the numbers of genes identified by the classifiers are more analogous. This indicates that the majority of the 450 probe IDs identified when the SVM-RFE method was applied to Dataset 2 were useful for classification, despite not being considered differentially expressed between groups.

Our novel three-classification model system of SVM, RF, and NB is an efficient and powerful tool for developing gene signatures from SSc skin biopsies and can also be used to develop a prognostic gene signature for SSc. Our models achieved high accuracy, specificity, and sensitivity, demonstrating their potential use as diagnostic indicators. Furthermore, our model system can be used to model other fibrotic disorders by substituting different phenotypes in the training cohort. Current ongoing and future studies using alternative methodologies will investigate specific genes identified by the classifiers to determine how they may relate to disease pathogenesis and progression.

Supporting information

S1 Table. Patient information for samples used from Dataset 1.

B: Back biopsy, FA: Forearm biopsy, F: Female, M: Male, W: White, A: Asian, AA: African American, H: Hispanic. Shading indicates individual patients.

https://doi.org/10.1371/journal.pone.0199314.s001

(DOCX)

S2 Table. Patient information for samples used from Dataset 2.

F: Female, M: Male, W: White, A: Asian, AA: African American, H: Hispanic.

https://doi.org/10.1371/journal.pone.0199314.s002

(DOCX)

S3 Table. Canonical pathways identified by Ingenuity Pathway Analysis.

Probe IDs included in analysis were both identified by the classification method and significantly differentially expressed between low and high severity groups (Bonferroni-corrected t-test p-value < 0.05). CFS: correlation-based feature selection method, SVM-RFE: SVM-based recursive feature elimination method. Ratio: number of genes from input list that occur in pathway. B-H p-value: Benjamini-Hochberg adjusted p-value of right-tailed Fisher’s Exact Test.

https://doi.org/10.1371/journal.pone.0199314.s003

(DOCX)

S1 Fig. Principal component analysis (PCA) of gene expression separation between low and high severity groups.

Results based on Chi-squared feature selection method are shown in A (Dataset 1) and B (Dataset 2). Results based on RVFS feature selection method are shown in C (Dataset 1) and D (Dataset 2).

https://doi.org/10.1371/journal.pone.0199314.s004

(TIF)

S2 Fig. Heat map showing Log2 normalized expression values for patient samples from Dataset 1 for probe IDs identified by SVM-RFE feature selection method.

https://doi.org/10.1371/journal.pone.0199314.s005

(TIF)

S3 Fig. Heat map showing Log2 normalized expression values for patient samples from Dataset 2 for probe IDs identified by SVM-RFE feature selection method.

https://doi.org/10.1371/journal.pone.0199314.s006

(TIF)

Acknowledgments

The authors thank AccuraScience LLC, Johnson, IA for providing technical assistance with bioinformatics analysis of microarray data for the study.

References

  1. 1. Steen VD, Medsger TA. Changes in causes of death in systemic sclerosis, 1972–2002. Ann Rheum Dis. 2007/03/03. 2007;66: 940–944. pmid:17329309
  2. 2. Johnson SR. Progress in the clinical classification of systemic sclerosis. Curr Opin Rheumatol. 2017/09/30. 2017;29: 568–573. pmid:28961157
  3. 3. Steen VD. The many faces of scleroderma. Rheum Dis Clin North Am. 2008;34: 1–15; v. pmid:18329529
  4. 4. Mathai SC, Hassoun PM. Pulmonary arterial hypertension associated with systemic sclerosis. Expert Rev Respir Med. 2011/04/23. 2011;5: 267–279. pmid:21510736
  5. 5. Steen V. Predictors of end stage lung disease in systemic sclerosis. Ann Rheum Dis. 2003/01/15. 2003;62: 97–99. pmid:12525376
  6. 6. Sargent JL, Whitfield ML. Capturing the heterogeneity in systemic sclerosis with genome-wide expression profiling. Expert Rev Clin Immunol. NIH Public Access; 2011;7: 463–73. pmid:21790289
  7. 7. Krzyszczak ME, Li Y, Ross SJ, Ceribelli A, Chan EK, Bubb MR, et al. Gender and ethnicity differences in the prevalence of scleroderma-related autoantibodies. Clin Rheumatol. 2011/04/28. 2011;30: 1333–1339. pmid:21523365
  8. 8. Reveille JD. Ethnicity and race and systemic sclerosis: how it affects susceptibility, severity, antibody genetics, and clinical manifestations. Curr Rheumatol Rep. 2003/03/12. 2003;5: 160–167. pmid:12628048
  9. 9. Dospinescu P, Jones GT, Basu N. Environmental risk factors in systemic sclerosis. Curr Opin Rheumatol. 2013/01/05. 2013;25: 179–183. pmid:23287382
  10. 10. Hesselstrand R, Andreasson K, Wuttge DM, Bozovic G, Scheja A, Saxne T. Increased serum COMP predicts mortality in SSc: results from a longitudinal study of interstitial lung disease. Rheumatology. 2012;51: 915–920. pmid:22253028
  11. 11. Lakota K, Wei J, Carns M, Hinchcliff M, Lee J, Whitfield ML, et al. Levels of adiponectin, a marker for PPAR-gamma activity, correlate with skin fibrosis in systemic sclerosis: potential utility as a biomarker? Arthritis Res Ther. 2012;14: R102. pmid:22548780
  12. 12. Martin P, Teodoro WR, Velosa APP, de Morais J, Carrasco S, Christmann RB, et al. Abnormal collagen V deposition in dermis correlates with skin thickening and disease activity in systemic sclerosis. Autoimmun Rev. 2012;11: 827–835. pmid:22406224
  13. 13. Shand L, Lunt M, Nihtyanova S, Hoseini M, Silman A, Black CM, et al. Relationship between change in skin score and disease outcome in diffuse cutaneous systemic sclerosis: application of a latent linear trajectory model. Arthritis Rheum. 2007/06/30. 2007;56: 2422–2431. pmid:17599771
  14. 14. Milano A, Pendergrass SA, Sargent JL, George LK, McCalmont TH, Connolly MK, et al. Molecular subsets in the gene expression signatures of scleroderma skin. PLoS One. 2008/07/24. 2008;3: e2696. pmid:18648520
  15. 15. Preliminary criteria for the classification of systemic sclerosis (scleroderma). Subcommittee for scleroderma criteria of the American Rheumatism Association Diagnostic and Therapeutic Criteria Committee. Arthritis Rheum. 1980;23: 581–590. pmid:7378088
  16. 16. Assassi S, Wu M, Tan FK, Chang J, Graham TA, Furst DE, et al. Skin gene expression correlates of severity of interstitial lung disease in systemic sclerosis. Arthritis Rheum. 2013;65: 2917–27. pmid:23897225
  17. 17. Dobrota R, Maurer B, Graf N, Jordan S, Mihai C, Kowal-Bielecka O, et al. Prediction of improvement in skin fibrosis in diffuse cutaneous systemic sclerosis: a EUSTAR analysis. Ann Rheum Dis. 2016;75: 1743–1748. pmid:27016052
  18. 18. Diboun I, Wernisch L, Orengo CA, Koltzenburg M. Microarray analysis after RNA amplification can detect pronounced differences in gene expression using limma. BMC Genomics. 2006/10/13. 2006;7: 252. pmid:17029630
  19. 19. Liu W, Meng X, Xu Q, Flower DR, Li T. Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models. BMC Bioinformatics. 2006/04/04. 2006;7: 182. pmid:16579851
  20. 20. Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2.
  21. 21. Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006/01/10. 2006;7: 3. pmid:16398926
  22. 22. Weihs C, Ligges U, Luebke K, Raabe N. klaR Analyzing German Business Cycles. Data Analysis and Decision Support. Berlin/Heidelberg: Springer-Verlag; 2005. pp. 335–343. https://doi.org/10.1007/3-540-28397-8_36
  23. 23. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008/07/24. 2008;9: 319. pmid:18647401
  24. 24. Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25: 1–18.
  25. 25. Kumanovics G, Pentek M, Bae S, Opris D, Khanna D, Furst DE, et al. Assessment of skin involvement in systemic sclerosis. Rheumatol. 2017/10/11. 2017;56: v53–v66. pmid:28992173
  26. 26. Zhang H, Wang H, Dai Z, Chen M, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics. 2012;13: 298. pmid:23148517
  27. 27. Tsujino K, Takeda Y, Arai T, Shintani Y, Inagaki R, Saiga H, et al. Tetraspanin CD151 protects against pulmonary fibrosis by maintaining epithelial integrity. Am J Respir Crit Care Med. 2012;186: 170–80. pmid:22592804
  28. 28. Chakraborty D, Šumová B, Mallano T, Chen C-W, Distler A, Bergmann C, et al. Activation of STAT3 integrates common profibrotic pathways to promote fibroblast activation and tissue fibrosis. Nat Commun. 2017;8: 1130. pmid:29066712
  29. 29. Pedroza M, To S, Assassi S, Wu M, Tweardy D, Agarwal SK. Role of STAT3 in skin fibrosis and transforming growth factor beta signalling. Rheumatology. 2017; pmid:29029263
  30. 30. Yao L, Pan J, Setiadi H, Patel KD, McEver RP. Interleukin 4 or oncostatin M induces a prolonged increase in P-selectin mRNA and protein in human endothelial cells. J Exp Med. 1996;184: 81–92. Available: http://www.ncbi.nlm.nih.gov/pubmed/8691152 pmid:8691152
  31. 31. Gazel A, Rosdy M, Bertino B, Tornier C, Sahuc F, Blumenberg M. A Characteristic Subset of Psoriasis-Associated Genes Is Induced by Oncostatin-M in Reconstituted Epidermis. J Invest Dermatol. 2006;126: 2647–2657. pmid:16917497
  32. 32. RYCHLI K, KAUN C, HOHENSINNER PJ, REGA G, PFAFFENBERGER S, VYSKOCIL E, et al. The inflammatory mediator oncostatin M induces angiopoietin 2 expression in endothelial cells in vitro and in vivo. J Thromb Haemost. 2010;8: 596–604. pmid:20088942
  33. 33. Wong S, Botelho FM, Rodrigues RM, Richards CD. Oncostatin M overexpression induces matrix deposition, STAT3 activation and SMAD1 Dysregulation in lungs of fibrosis-resistant BALB/c mice. Lab Investig. 2014;94: 1003–1016. pmid:24933422
  34. 34. Nagata T, Kai H, Shibata R, Koga M, Yoshimura A, Imaizumi T. Oncostatin M, an Interleukin-6 Family Cytokine, Upregulates Matrix Metalloproteinase-9 Through the Mitogen-Activated Protein Kinase Kinase-Extracellular Signal-Regulated Kinase Pathway in Cultured Smooth Muscle Cells. Arterioscler Thromb Vasc Biol. 2003;23: 588–593. pmid:12615664
  35. 35. Bamber B, Reife RA, Haugen HS, Clegg CH. Oncostatin M stimulates excessive extracellular matrix accumulation in a transgenic mouse model of connective tissue disease. J Mol Med (Berl). 1998;76: 61–9. Available: http://www.ncbi.nlm.nih.gov/pubmed/9462869
  36. 36. Hasegawa M, Sato S, Fujimoto M, Ihn H, Kikuchi K, Takehara K. Serum levels of interleukin 6 (IL-6), oncostatin M, soluble IL-6 receptor, and soluble gp130 in patients with systemic sclerosis. J Rheumatol. 1998;25: 308–13. Available: http://www.ncbi.nlm.nih.gov/pubmed/9489824 pmid:9489824
  37. 37. Michalska-Jakubus M, Kowal-Bielecka O, Chodorowska G, Bielecki M, Krasowska D. Angiopoietins-1 and -2 are differentially expressed in the sera of patients with systemic sclerosis: high angiopoietin-2 levels are associated with greater severity and higher activity of the disease. Rheumatology. 2011;50: 746–755. pmid:21149250
  38. 38. Gerlicz Z, Dziankowska-Bartkowiak B, Dziankowska-Zaborszczyk E, Sysa-Jedrzejowska A. Disturbed Balance between Serum Levels of Receptor Tyrosine Kinases Tie-1, Tie-2 and Angiopoietins in Systemic Sclerosis. Dermatology. 2014;228: 233–239. pmid:24603462
  39. 39. Gruschwitz MS, Hornstein OP, von Den Driesch P. Correlation of soluble adhesion molecules in the peripheral blood of scleroderma patients with their in situ expression and with disease activity. Arthritis Rheum. 1995;38: 184–9. Available: http://www.ncbi.nlm.nih.gov/pubmed/7848308 pmid:7848308
  40. 40. Elhai M, Meunier M, Matucci-Cerinic M, Maurer B, Riemekasten G, Leturcq T, et al. Outcomes of patients with systemic sclerosis-associated polyarthritis and myopathy treated with tocilizumab or abatacept: a EUSTAR observational study. Ann Rheum Dis. 2013;72: 1217–1220. pmid:23253926