Integrated Analyses of Gene Expression Profiles Digs out Common Markers for Rheumatic Diseases

Objective Rheumatic diseases have some common symptoms. Extensive gene expression studies, accumulated thus far, have successfully identified signature molecules for each rheumatic disease, individually. However, whether there exist shared factors across rheumatic diseases has yet to be tested. Methods We collected and utilized 6 public microarray datasets covering 4 types of representative rheumatic diseases including rheumatoid arthritis, systemic lupus erythematosus, ankylosing spondylitis, and osteoarthritis. Then we detected overlaps of differentially expressed genes across datasets and performed a meta-analysis aiming at identifying common differentially expressed genes that discriminate between pathological cases and normal controls. To further gain insights into the functions of the identified common differentially expressed genes, we conducted gene ontology enrichment analysis and protein-protein interaction analysis. Results We identified a total of eight differentially expressed genes (TNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, PRF1), each associated with at least 3 of the 4 studied rheumatic diseases. Meta-analysis warranted the significance of the eight genes and highlighted the general significance of four genes (CX3CR1, LY96, TLR5, and PRF1). Protein-protein interaction and gene ontology enrichment analyses indicated that the eight genes interact with each other to exert functions related to immune response and immune regulation. Conclusion The findings support that there exist common factors underlying rheumatic diseases. For rheumatoid arthritis, systemic lupus erythematosus, ankylosing spondylitis and osteoarthritis diseases, those common factors include TNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, and PRF1. In-depth studies on these common factors may provide keys to understanding the pathogenesis and developing intervention strategies for rheumatic diseases.


Introduction
Rheumatic diseases are painful conditions usually characterized by inflammation, swelling, and pain in joints or muscles. More than 100 types of diseases are classified as rheumatic diseases, including many types of arthritis. The pathologic mechanisms of rheumatic diseases are complex, under the influence of both genetic and environmental factors. Although the rheumatic diseases have been studied for decades, the underlying mechanisms are still poorly understood.
Rheumatic diseases share some common symptoms, which imply that they may share common pathologic factors. Though each rheumatic disease has specific genetic causes, previous studies identified common factors associated with different rheumatic diseases. For instance, the HLA region is known to be associated with several rheumatic diseases including rheumatoid arthritis (RA) [1], systemic lupus erythematosus (SLE) [2], ankylosing spondylitis (AS) [3], Sjögren's syndrome, as well as many others [4]. Tumour necrosis factor has been shown to play a dominant role in the pathogenesis of various immune-mediated inflammatory diseases such as RA, AS, osteoarthritis (OA) [5], and psoriatic arthritis [6]. In addition, hypovitaminosis D is commonly observed in rheumatic patients with and without autoimmune features [7]. Despite the above sporadic evidence, common factors shared by various rheumatic diseases have yet to be examined systematically.
Gene expression profiling with microarray is a powerful tool for the discovery of genes and biological pathways that are associated with various complex diseases [8,9]. Gene expression datasets, collected in rheumatic patients and normal controls, have been accumulated for a decade, leading to successful identification of gene expression signatures for each rheumatic disease, respectively. However, to the best of our knowledge, no study has been conducted to systematically identify factors in common for various rheumatic diseases.
In this study, based on the archived public high-throughput microarray gene expression datasets for rheumatic cohorts, we performed comprehensive statistical analyses to identify genetic factors commonly significant for different rheumatic diseases.

Ethics statement
The study did not made use of human or vertebrate animal subjects and tissue. The data were collected from an international publicly available database and analyzed anonymously. Therefore, no additional informed consent was required.

Collection and selection of data
We searched the NCBI PubMed and Gene Expression Omnibus (GEO) database (http://www. ncbi.nlm.nih.gov/geo/) [10]with key words "Rheumatic diseases", "Rheumatoid arthritis", "Adult onset still's disease", "Osteoarthritis", "Ankylosing spondylitis", "Sjogren's syndrome", "Systemic lupus erythematosus", "Microarray", and "Gene expression profile". By August 31, 2014, a total of 23 types of common rheumatic diseases were considered. A study was included in our analysis if: (1) it included patients diagnosed with rheumatic diseases and normal controls, (2) it contained gene expression profiling of blood, and (3) it provided data sufficient to our analysis. Finally, 6 datasets covering 4 types of rheumatic diseases(RA, SLE, OA, and AS) were retained for subsequent analysis ( Table 1). The process of data collection and selection was provided in Fig 1A.

Data preprocessing
We downloaded the 6 microarray gene expression datasets from GEO. The datasets were generated using four different platforms. The numbers of profiled genes in each dataset are different, and the probe IDs are different across platforms. To perform integrative analysis of datasets from different studies, we processed the datasets by using functions MetaDE.match, MetaDE.merge, MetaDE.filtering inan R package: MetaDE [11]. At the first step, when multiple probes matched to the same gene, we adopted the "IQR" method to select a probe with the largest interquartile range of gene expression values among all matched probes to represent the gene. At the second step, we extracted the commonly profiled genes across the six datasets. In the identification of differential expressed genes (DEGs), either un-expressed or un-informative genes contribute to false discoveries. Thus, at the third step, we performed gene filtering to sequentially remove un-expressed genes and un-informative genes. In each datasets, mean intensities and standard deviations of expression valuesfor each gene were ranked. The sum of ranks across all datasets was used to evaluate level of gene expression/information. To get the best balance between the false discovery rate (FDR)and the number of genes retained, we considered 30% genes with the smallest rank sum of mean intensity as un-expressed genes, and  considered 30% genes with the smallest rank sum of standard deviations as un-informative genes [11]. Finally, a total of 2600 genes were retained for further analysis.

Statistical analysis
We analyzed each dataset individually by ind.analysis function in MetaDE package to identify DEGs between rheumatic patients and normal controls. The moderatedt-statistic [12] was selected for significance analysis. The Benjamini & Hochberg FDR method was used to apply p-value adjustment for multiple-testing correction [13].
To get an overview of similarities of the gene expression profiles among different rheumatic diseases, we considered an approach introduced by Marina Sirota [14]. In a specific dataset, expression variation score for each gene is calculated as sign(t j ) log(p j ), where j is a gene number in a specific dataset and t j is the moderated-t statistic of gene j, p j is the moderatep-value of gene j. Thus it combines both the strength and the 'direction' of association. The gene expression variation profile of each dataset is a set of expression variation scores of all genes in the dataset. Correlations between the gene expression variation profiles can quantify the similarity of gene expression and regulation effects [14]. Herein, we considered the Kendall and Spearman correlation methods, as these two correlations are well-known methods for quantifying the degree of correlations between lists of ordinal data. The above correlation coefficients were computed by using the cor.test function in R. Based on the correlation coefficients, the hierarchical clustering of datasets was conducted by using the gplots package in R.
To investigate the common DEGs across the four types of rheumatic diseases, we firstly ranked the moderate p-values from the smallest to the largest for each dataset. Then, we examined the overlaps of the top 100 ranked genes across the six datasets. Genes with significantly differential expression in at least 3 types of rheumatic diseases were selected as common genes. Under the same criterion, we also detected common genes from DEGs with FDR adjusted pvalue less than 0.01.
To evaluate the reliability of the above detected common markers, we performed a metaanalysis of the 6 gene expression microarray datasets. We used MetaDE package for identification of DEGs by the Fisher method and the maximum P-value method. The moderated t-statistic was used to calculate p-values in each datasets, and the Benjamini & Hochberg FDR method was used to apply meta-p-value adjustment. In the Fisher method, strong statistical significance of a gene could result from an extremely small p-value of one study, thus it detects genes that are differentially expressed in one or more datasets. In contrast, the maximum Pvalue method detects genes with small p-values of all studies [15].

Functional annotation analysis
To understand the functions of the identified genes, we performed Gene Ontology (GO) enrichment analysis (http://www.geneontology.org/). Herein, we only considered the biological process. To explore the functional associations between identified genes, the proteins encoded by the identified genes were analyzed according to the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) 9.1 Server (http://www.string-db.org/) [16].
The above data processing and analyses workflow was presented in Fig 1B.

Study characteristics
A total of 6 GEO datasets (accession number: GSE15573, GSE1402, GSE12374, GSE20864, GSE48556, and GSE25101) were included in our analysis, covering 4 types of rheumatic diseases (2, 2, 1, 1, datasets on RA, SLE, OA, and AS, respectively). These datasets were generated with total RNA extracted from peripheral blood of rheumatic patients and normal controls [17][18][19][20][21][22]. Detailed information of each dataset was described in Table 1. To be noted, four different microarray platforms were used in generating the six datasets (GPL6102 for GSE15573, GPL8300 for GSE1402, GPL1291 for GSE12374 and GSE20864, GPL6947 for GSE48556 and GSE25101). After preprocessing the datasets, a total of 2600 genes, profiled in all the 6 datasets, were extracted for analysis in the present study.

Correlation of gene expression variation profiles between rheumatic diseases
The Kendall and Spearman correlation analyses, in Fig 2A and 2B respectively, presented similar results. Correlation matrices between the gene expression variation profilesof datasets were shown in the heat maps. Positive correlations between pairs of diseases were shown in blue, and negative correlations were shown in pink. The first level of clustering put RA1 and AS in the same cluster; SLE1 and SLE2 were put in the same cluster in the second level of clustering. The two clusters then converged and clustered together with RA2. However, OA was negatively correlated with the other five datasets, and did not cluster with others until at the last level of clustering. These results were consistent with previous studies [14,23]that suggest similarities and differences between rheumatic diseases.

Identification of common DEGs in rheumatic diseases
Based on the moderate p-values, top100 ranked genes in each datasets were listed in S1 Table. Genes with adjusted-p<0.01 in each dataset were listed in S2 Table. Throughout S1 and S2 Tables, none of the genes was identified in all the four studied rheumatic diseases. Corresponding to S1 and S2 Tables, genes identified in three of the four diseases were presented in Table 2. Each ranking method identified five genes. Notably, TXN and CX3CR1 were identified by both ranking methods. Together, a total of eight genes were identified, includingTNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, and PRF1. Significance of the above eight genes for rheumatic diseases was further evidenced by meta-analyses with the Fisher's method ( Table 2). Four of the DEGs were further validated by meta-analyses using the max-P method ( Table 2).
To ascertain the importance of the above eight genes for rheumatic diseases, we conducted a clustering analysis based on the eight detected genes. Consistent with the clustering analysis based on whole gene expression variation profiles, clustering analysis based on the above eight genes (Fig 2C and 2D) successfully separated OA from the other three autoimmune rheumatic diseases. Although clusters within autoimmune rheumatic diseases changed slightly, the global trend was similar and the correlations within the autoimmune rheumatic diseases were enhanced. This finding indicates that the identified genes may play a principle role in the molecular pathological mechanism of these diseases.

Functional annotation
PPI analysis of the eight identified genes provided reference information for their potential association (Fig 3). Each gene had text mining associations with others. In addition, there were co-expression relations between three pairs of genes: CX3CR1/PRF1, TLR5/TNFSF10, and TNFSF10/PRKCH. Four genes, i.e., CX3CR1, PRF1, TNFSF10, and TLR5, were the nodes of the network.
The top 20 enriched GO terms of biological process related to the above eight identified genes were shown in Table 3. The most significantly enriched function was "immune response" (GO:0006955, p = 6.70E-06). Related with immune regulation are GO terms "immune system

Discussion
The inflammation and abnormal immune process are two important pathologic characteristics for rheumatic diseases. Joint pain is the main symptom in common. We speculate that these diseases may share similar pathological factors and mechanisms. In the last decade, microarray analysis of gene expression profiles has been widely used to identify genes and biological pathways associated with various complex diseases [8,9,16,18]. However, previous such studies on rheumatic diseases usually focused on identifying factors specific to one disease and paid little attention to identifying genes important to various diseases. Therefore, in this study, we are attempted to identify common genes underlying multiple rheumatic diseases, with RA, SLE, OA, and AS as representatives. To the best of our knowledge, this is the first such endeavor in the research community of rheumatic diseases. By jointly analyzing 6 published microarray gene expression datasets about RA, SLE, OA and AS, we identified eight genes (TNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, PRF1) presenting general importance to rheumatic diseases. As evidenced by PPI and GO analyses, these eight genes interact with each other to exert functions related to immune response and immune regulation.
Consistent with our findings, evidences from previous studies support that four of the above eight identified genes, i.e., TNFSF10, CX3CR1, TLR5, and PRF1, are relevant to multiple rheumatic diseases. For example, it was reported that TNFSF10 is involved in pathogenesis of RA [24], SLE [25], AS [26], OA [27], and multiple sclerosis [28]. CX3CR1 was reported to involve in inflammation and autoimmune progresses in RA [29], MS [30] and SS [31]. In vivo experiments confirmed that the de novo CX3CL1-CX3CR1 axis plays a pivotal role in osteoclast recruitment and subsequent bone resorption [32], which provides a clue of molecular mechanism responsible for bone damage in rheumatic diseases. It is known that toll-like receptors (TLRs) are membrane receptors recognizing biotic inflammatory stimulus. Previous study showed that TLR5 was over-expressed in patients affected with AS [33], RA and OA [34]. Besides, TLR5 gene mutations are associated with resistance and susceptibility to SLE [35]. In addition, PRF1 was reported to be associated with SLE [36], RA [37] and AS Patients [38].
Besides the above four genes previously recognized to be involved in multiple rheumatic diseases, this study firstly point out that TXN, PRKCH, TIA1, and LY96 genes are significant for multiple rheumatic diseases, as well. Previous studies showed that TXN, PRKCH, and TIA1 genes are related to RA [39][40][41]. LY96 gene encodes a protein which associated with TLR4 on cell surface, and plays important roles in TLR signaling pathway, inflammatory response and innate immune response [42]. However, molecular function mechanisms of the above four genes in relation to other three types of rheumatic diseases are unclear, and have yet to be studied.
It iswell known that immune response is involved in the development of most rheumatic diseases. In this study, GO analyses showed that the eight identified genes are significantly enriched in biological processes of "immune response" and "defense response". Specifically, those genes are involved in "responses to lipoplysaccharide, molecule of bacterial origin, biotic stimulus, and bacterium", and are involved in "MyD88-dependent toll-like receptor signaling pathway". The findings point out that gene-environment interaction plays an important role in the development of rheumatic diseases. Understanding the molecular functions of these genes in biotic induced inflammation and cellular defense responses may shed new light on the pathogenesis of rheumatic diseases.
The purpose of this study is to identify common genes for various rheumatic diseases. Due to the following reasons, this study has some limitations. Firstly, for limited availability of gene expression datasets, this study only focused on four kinds of rheumatic diseases. Datasets to be generated in the future for more kinds of rheumatic diseases may contribute to validating and identifying novel genes with rheumatic-diseases-general effects. Secondly, the gene expression datasets utilized in the present study were generated from four different microarray platforms. Only 2600 genes profiled in all the 6 datasets were analyzed in this study, leaving a majority of the protein-coding genes across the human genome unexplored yet. Thirdly, since the 6 utilized datasets were generated from blood samples, we are unable to identify the specific disease-related functional cells.
To understand which immune cell subsets are contributing to the whole blood expression of the identified genes, we searched the Immunological Genome (ImmGen) [43], a 'road map' of gene-expression in all immunecells. According to the ImmGen, PRF1, TNFSF10 and LY96 show relatively higher expression level in innate lymphocytes than other cell subsets. Besides, TNFSF10 is also highly expressed in stromal cells and gd T cells; LY96 is also highly expressed in stromal cells and macrophages. CX3CR1 is relatively highly expressed in macrophages, monocytes, and ab T cells. TIAL shows a moderately high expression level in stem cells, B cells, and T cells. PRKCH is relatively highly expressed in T cells. Other two genes, TLR5 and TXN, are not in the database. For our identified genes, the above expression patterns across various immune cell subsets provide us interesting clues to better understand and explore their functions in particular immunological cell subsets.
In conclusion, the present study identified eight common genes underlying different types of rheumatic diseases. In-depth functional studies on these common genes may improve our understanding of the pathological processes of these diseases, which could have important implications for the prevention and treatment of rheumatic diseases in general.
Supporting Information S1 Table. The top 100 ranked genes that show a differential expression between rheumatic patients and normal controls. Ranking of genes is based on moderated p-values. Genes appeared in more than 2 types of rheumatic diseases are in red. (XLSX) S2 Table. Genes with adjusted p-values <0.01 in each dataset. P-values are adjusted by the Benjamini-Hochberg method. Genes appeared in more than 2 types of rheumatic diseases are in red. (XLSX)