Evaluation of the Psoriasis Transcriptome across Different Studies by Gene Set Enrichment Analysis (GSEA)

Background Our objective was to develop a consistent molecular definition of psoriasis. There have been several published microarray studies of psoriasis, and we compared disease-related genes identified across these different studies of psoriasis with our own in order to establish a consensus. Methodology/Principal Findings We present a psoriasis transcriptome from a group of 15 patients enrolled in a clinical study, and assessed its biological validity using a set of important pathways known to be involved in psoriasis. We also identified a key set of cytokines that are now strongly implicated in driving disease-related pathology, but which are not detected well on gene array platforms and require more sensitive methods to measure mRNA levels in skin tissues. Comparison of our transcriptome with three other published lists of psoriasis genes showed apparent inconsistencies based on the number of overlapping genes. We extended the well-established approach of Gene Set Enrichment Analysis (GSEA) to compare a new study with these other published list of differentially expressed genes (DEG) in a more comprehensive manner. We applied our method to these three published psoriasis transcriptomes and found them to be in good agreement with our study. Conclusions/Significance Due to wide variability in clinical protocols, platform and sample handling, and subtle disease-related signals, intersection of published DEG lists was unable to establish consensus between studies. In order to leverage the power of multiple transcriptomes reported by several laboratories using different patients and protocols, more sophisticated methods like the extension of GSEA presented here, should be used in order to overcome the shortcomings of overlapping individual DEG approach.


Introduction
The study of human diseases such as psoriasis has benefited significantly from analysis of the transcriptome, the global gene expression of a diseased tissue compared to its healthy counterpart. However, as more studies are carried out independently in multiple laboratories, effective methodology to leverage these multiple studies becomes necessary. Such methodologies have significant hurdles to overcome: first, multiple studies are likely to use different platforms, different sample dissection, handling and preparation, and, especially, different definition of the nondiseased counterpart, resulting in different physical samples being hybridized against different platforms [1,2,3]. Second, computational analysis and statistical treatment required to assess the transcriptome are just as likely to be considerably different.
In many instances, all that is available from published studies are lists of differentially expressed genes (DEG). It is tempting to evaluate the agreement between studies simply by evaluating the intersection between the published lists, the ''Venn diagram approach''. However, such an approach suffers serious methodological shortcomings [4,5,6]. Use of the original raw data of the studies has shown that studies which are apparently discordant in terms of their overlapping individual DEG lists are, in fact, both concordant and predictive [4,5]. However, most of the time the original raw data is unavailable, and furthermore a complete reanalysis of all data is needlessly laborious. In such cases use of the published lists of DEG is a necessity. Here we present an extension to the widely used Gene Set Enrichment Analysis (GSEA) method, where it suffices to have full access to the complete list of gene expression values for a single study, while the remaining studies only require the DEG list.
In the last few years, the use of Gene-Sets approach had emerged as a powerful tool to identify sets of functionally related genes or pathways that are associated with a disease phenotype [7,8]. Gene-Sets based methods were designed to address limitations of conventional single gene methods [6] by evaluating differential expression patterns of gene groups instead of individual genes. GSEA, introduced by Mootha et al [9] and further developed by Subramanian et al [10], was one of the first method using the Gene-Sets approach, and is arguably the most widely used of such methods. Here we use GSEA as a conventional approach to identify pathways related to the psoriatic phenotype. Furthermore, we propose to extend the use of GSEA as a tool to easily cross-compare prior lists of DEG genes.
We developed this method specifically to compare several highquality studies that defined the psoriasis transcriptome by identifying DEG between psoriatic lesions and non-lesional tissue from the same patients [11,12,13,14,15]. Those studies had identified key genes involved in psoriasis pathogenesis, using a non-biased approach. Because the genomic data for more recent studies is more comprehensive than in the earlier studies due to the larger number of genes represented in the latest Affymetrix chips, we chose to compare the transcriptomes for studies published since 2003 [12,14,15].
We recently conducted a clinical trial of 15 psoriasis patients with the TNF inhibitor etanercept [16], and performed a timecourse experiment using HGU 133 2.0 microarray chips [17]. By analyzing the baseline data from this experiment, we generated our psoriasis transcriptome comparing baseline-paired values of lesional versus non-lesional skin. We have compared our transcriptome with the three others described above, and introduce the concept of using GSEA as a more robust way of comparing genomic data.

Disease-modulated genes
The analysis of our data identified a psoriasis transcriptome composed of 732 up-regulated probesets (representing 579 genes with unique ENTREZ identifier) and 890 down-regulated probesets (703 genes) with fold change (FCH) greater than 2 and false discovery rate (FDR) less than 0.05 (Table 1, and Table S1). Certain genes with low expression on the Affymetrix chip were confirmed by RT-PCR, and will be discussed in the next section.
To further consider the biological significance of our data, we used GSEA in the classical manner, to identify pathways that correlate with the psoriatic phenotype [10,18]. GSEA evaluates how genes in queried pathways are distributed in the fold change (lesional versus non-lesional) ordered list generated by our data (all probesets included). This is quantified by using the Enrichment Score (ES), a weighted Kolmogorov-Smirnov-like statistic that evaluates if the members of the pathway are randomly distributed or found at the extremes (top or bottom) of the list. If genes in a pathway rank at the top of the new fold change list, ie. they are overrepresented at the top, then the enrichment score (ES) will be close to 1. Conversely if the ES = 21, then genes are overrepresented at the bottom of our fold change data. A perfect agreement is reached if ES = 1 for the up-regulated genes and ES = 21 for the down-regulated genes. A normalized enrichment score (NES) takes into account the number of genes in the pathway. A positive NES indicates that the list of genes is enriched at the ''top'' of the ordered fold change list, and a negative NES indicates that the list in question is enriched at the ''bottom'' of the list.
GSEA may be used with well known ''canonical'' pathways and Gene ontology categories, but also with sets that contain genes sharing the same transcription factor binding site, the same microRNA binding motif or the same cis-regulatory motif. It can also be used with curated collections such as GeneSigDb (http:// compbio.dfci.harvard.edu/genesigdb), which contain gene signatures of cancer, viral and stem cell biology, and the CGP collection of the MSigDb (http://www.broadinstitute.org/gsea/msigdb/) that contains gene expression signatures of genetic and chemical perturbations, or computational derived sets such as cancer modules (presented in [19]).
We queried our psoriasis transcriptome with a set of cytokinetreated keratinocyte pathways for IL-17, TNF, IL-17+TNF, IFNc, and IL-22, reported in [17,20] and IL-1a [21]. GSEA showed that those pathways were enriched in lesional skin of psoriasis patients ( Table 2). We used a list (''IL-17 Gaffen'') considered to be IL-17gene targets defined by Shen et al [22]. This was also significantly enriched in our psoriasis transcriptome. An IFNa-keratinocyte pathway [14] was also significantly enriched in lesional skin, supporting the potential role for IFNa discussed by Yao et al [14]. A list of genes representing the transcriptome of inflammatory myeloid DCs [23] was also significantly enriched in lesional skin. In addition, the cell cycle and TLR signaling from the collection of canonical pathways (C2 CP) available at the Molecular Signature Database (MSigDb) were enriched in psoriasis lesional skin. Gudjonsson et al reported lack of evidence of enrichment of the Hedghog Signalling Pathways in psoriasis [24]. GSEA analysis did not detect any significant enrichment of this pathway with the psoriasis phenotype. (ES = 0.37, NES = 1.05 and p = 0.39).

Comparison with other published studies
We next performed a comparative analysis of the DEG lists of these three studies [12,14,15] with our transcriptome. Table 1 summarizes the main characteristics of the four studies. Zhou's study which uses early hgu95 (a,b,c,d,e) chips, reported 397 upregulated and 613 down-regulated probesets representing 270 and 397 unique genes respectively [15]. Yao et al conducted a study on hgu133plus2 chips and reported 1408 up-regulated and 1465 down-regulated probesets (974 and 853 genes respectively) [14]. More recently, Gudjonsson et al also using hgu133plus2 chips with a large sample size [12] reported a set of 721 up-regulated and 364 down-regulated probesets (508 genes and 248 genes). Figure 1 shows a Venn diagram illustrating the intersection between the four studies. There were approximately 11,000 ENTREZ identifiers common to the 4 chip series, but only 126 genes were up-regulated ( Figure 1A) and 38 down-regulated ( Figure 1B) in all four studies. The numbers of upregulated genes in our transcriptome were similar to the Gudhjonsson group, but less than Yao's. The number of down-regulated genes were greater than the Gudhjonsson group, and again, less than Yao's.
The list of the DEG in common in the 4 studies (Table S2) contains many genes known to be upregulated in psoriasis, such as IFNc-regulated genes STAT-1, STAT-3 and MX1; antimicrobial peptides (beta defensin 4, lipocalin 2, S100A7); and proinflammatory proteins such as IL-8, CXCL1, IL-1F9, TNFSF10/TRAIL. However, some genes known to be upregulated in psoriasis were only identified in one study. For example STAT2 was only identified in our study, JAK3 only in Yao's, and IL-12RB1 only in Zhou's study. The set of genes that were downregulated in all 4 studies include CCL27, also called Cutaneous T cell -attracting chemokine (CTACK), which has a role in memory T cell homing to the skin [25], and Aquaporin 9, a member of a family of proteins that form water channels across membranes [26]. Some well-recognized inflammatory genes involved in psoriasis were not detected by most of the four studies, for example IFNc, IL-17, iNOS. This is due to the fact that the expression of these genes is usually low on the Affymetrix gene array platform (0-4 range of expression in log 2 -scale) and hence fold change is not accurately measured. Most analysis pipelines filter out low abundance genes so they may be excluded from the statistical analysis, or the resultant fold change is very low, albeit significant. This is a major limitation of the use of these arrays for the study of these genes.
To confirm the role of these genes in psoriasis, we analyzed RT-PCR data of many of these inflammatory genes. We used data from the same clinical trial of etanercept treatment of psoriasis [16], however, we compared only baseline lesional skin and non-lesional skin. The fold change and p-value for each gene by RT-PCR is presented in Table 3, and all of these genes except LTA and p35 were significantly different in lesional skin (p,0.05). We also present the fold change of any of these genes that were detectable in any of the microarray studies. It can be seen that IL-23p19, IL-12/23p40, IL-22, IFNc, IL-6 were not found to be differentially expressed by any of the four studies, but the fold change by RT-PCR was greater than 4. Furthermore, IL-17, IL-20, CCL4, iNOS and CCL3 were detected in only one study, and with a lower fold change than detected by RT-PCR (which was greater than 5.5). We also included the percentage of samples in our study with low intensity, as defined as expression values less than 4. For example, p19 has expression of less than 4. It can be seen that the genes that were not detected by any or only one microarray study had low abundance in more than 87% of the samples.
Correlation of other published psoriasis transcriptomes compared to our transcriptome using Gene Set Enrichment Analysis (GSEA) A more robust approach beyond a Venn diagram was required to overcome these limitations. We propose to use a Gene-Sets approach to analyze how published DEG rank in our fold change data. This reduces the bias due to preprocessing steps, statistical protocols and stringency of cut-offs. This approach is successfully used in the Connectivity Map [27], a pattern recognition instance that correlates disease signatures (based on gene expression of any platform) with drugs. The idea is to use the GSEA framework [10] by considering the published DEG as a pathway or gene set, and quantify how well the up (and down) regulated genes rank in the ordered fold change for all genes in our data. This will generate an ES for up-regulated genes and one for the down-regulated genes. The connectivity score (CS) can be used to give a measure of agreement between studies, by combining the two ES into one final value, as used in the connectivity map to rank drugs that better correlate with disease. A value of CS near 1 would indicate perfect agreement between a study list and our analysis, whereas 0 would indicate no agreement, and 21 a negative correlation.
GSEA showed that there was highly significant enrichment of psoriasis DEGs, both up-and down-regulated genes, from the three studies compared to our data ( Table 4). The GSEA plot for  In general, a better agreement was observed among up-regulated genes for all studies. In addition, Yao's transcriptome correlated better with our study than Zhou's, which is not surprising since the array series and the statistical protocols used in Yao's and ours were more similar than those in Zhou's. We used the same approach to compare two published DEGs of other skin diseases produced by own group: squamous cell carcinoma (SCC) [28] and basal cell carcinoma (BCC) [29]. The CS for SCC was 0.69 and the CS of BCC was 0.42, considerably lower than in psoriasis (Table 4). This degree of enrichment is most likely reflects the origin of this data from our own lab, and the cutaneous nature of the specimens, as well as epidermal hyperproliferation and inflammation in all these three diseases.

Discussion
Investigators might be surprised at the lack of overlap between DEG lists, as shown in the Venn diagram ( Figure 1). However, if one considers all the variables involved in the four studies, summarized in Table 1, it is not that surprising. Although all the studies were conducted using Affymetrix platform, they used different array series, which may contribute to variability in results. Besides the obvious laboratory effect due to sample preparation, technician experience, equipment calibration, and the use of different preprocessing algorithms [30], alternative statistical tests and stringent cut-offs also contribute to different results [3,31]. Measuring agreement of microarray studies by overlap of DEG lists generated by individual studies has been largely criticized [5] because it is highly inconsistent, even in the presence of small variation in the data as in the case of technical replicates [32,33]. A low overlap between DEG does not directly indicate low agreement between studies [3,31,33].
Here we present our new data on the psoriasis transcriptome from our patients, as well as a comparison of our data with three published DEG lists for psoriasis. We find only 164 genes in common for the four lists. However by changing the focus of the single-gene approach behind the intersection of DEG involved in psoriasis and using a gene set approach, a closer biological similarity between the studies is revealed. The extension of GSEA presented here enables us to see that cellular processes and molecular signature involved in psoriasis is very robust across the studies. In this paper, we extended the use of GSEA to compare new expression data with previously published DEG lists in order to validate psoriasis disease-related gene profiles. It is worth noting that this approach is applicable to expression data obtained through deep sequencing (potentially improving sensitivity for low abundance genes and cross-hybridization problems of current microarray technology). Moreover, this approach is easily extendable to other omics applications and more complex phenotypes. Since the method is based on ranking a list according to a phenotype, the ranked list can be derived from other measures besides gene expression fold changes from microarray chips or deep sequencing; such phenotype measures may include odd ratios of single nucleotide polymorphisms (SNPs), or a microRNA profile assay derived from microarray technologies or deep sequencing.
The use of GSEA as a gene set approach is not unique: extensions to GSEA [34] and other Gene Set methods and statistics have also been proposed, and could also be used to compare transcriptomes. Efron and Tibshirani [35] proposed the MaxMean statistic instead of the weighted Kolmogorov Smirnov statistics used in the classical GSEA. Dinu et al [36] extended the single gene analysis SAM and proposed SAM-GS. See [7,37] for a comparative study of different gene set enrichment methods.
Extensions other than the classical difference between two phenotypes have also been reported. For example, we used the time slope of a mixed effect model as a phenotype to evaluate the time-response of cytokines pathways to psoriasis treatment with TNF inhibitor [17].
In this report, we show an excellent and simple method for researchers seeking validation of their own expression data with published lists from different studies.

Patients
Twenty adult patients with moderate to severe psoriasis were treated with etanercept 50 mg subcutaneously twice weekly for 12 weeks (clinical trial no. NCT00116181). The clinical and histological response of patients in this trial was previously published [16]. The gene array was performed on samples from 15 sequential patients [17]. RT-PCR was performed on samples from all 20 patients.

Ethics Statement
The clinical trial (no. NCT00116181) was conducted according to the principles expressed in the Declaration of Helsinki and informed consent for their information to be stored in the hospital database and used for research was obtained from all patients in written form. This research was conducted under protocol JKR-0542 approved by the Rockefeller University Institutional Review Board.

Sample Preparation and Hybridization
RNA was extracted from skin biopsies and hybridized to Affymetrix hgu133a2 chips as described in [17].

Statistical Analysis
Expression values were obtained using gcrma algorithm. Samples were filtered for unreliable low expression value and low variability. To compare lesional with non-lesional values, the moderated t-test available at limma package was used. P-values were adjusted for multiple hypothesis correction using the Benjamini-Hochberg approach, which controls the false discovery rate (FDR). Probesets with FDR,0.05 and more than 2 fold change (FCH) were considered differentially expressed. All analysis was carried out using R programming language (www.R-project.org) and Bioconductor packages (www.bioconductor.org).

Data Repository
This data has been deposited at the public repository Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) with accession number GSE11903.  Figure 2. Rank of the published DEG lists in our fold change data. A. Rank of Zhou's genes in our data set ordered by fold change expression, with Enrichment plots for the up-regulated and downregulated genes data below. B. As A but for Yao's study. C. As A but for Gudjonsson's study. A detailed explanation of Enrichment plots can be found in Figure 1 of original Subramanian et al [10]. doi:10.1371/journal.pone.0010247.g002