Increased comparability between RNA-Seq and microarray data by utilization of gene sets

Frans M. van der Kloet; Jeroen Buurmans; Martijs J. Jonker; Age K. Smilde; Johan A. Westerhuis

doi:10.1371/journal.pcbi.1008295

Abstract

The field of transcriptomics uses and measures mRNA as a proxy of gene expression. There are currently two major platforms in use for quantifying mRNA, microarray and RNA-Seq. Many comparative studies have shown that their results are not always consistent. In this study we aim to find a robust method to increase comparability of both platforms enabling data analysis of merged data from both platforms. We transformed high dimensional transcriptomics data from two different platforms into a lower dimensional, and biologically relevant dataset by calculating enrichment scores based on gene set collections for all samples. We compared the similarity between data from both platforms based on the raw data and on the enrichment scores. We show that the performed data transforms the data in a biologically relevant way and filters out noise which leads to increased platform concordance. We validate the procedure using predictive models built with microarray based enrichment scores to predict subtypes of breast cancer using enrichment scores based on sequenced data. Although microarray and RNA-Seq expression levels might appear different, transforming them into biologically relevant gene set enrichment scores significantly increases their correlation, which is a step forward in data integration of the two platforms. The gene set collections were shown to contain biologically relevant gene sets. More in-depth investigation on the effect of the composition, size, and number of gene sets that are used for the transformation is suggested for future research.

Author summary

The field of transcriptomics uses and measures mRNA as a proxy of gene expression. There are currently two major platforms in use for quantifying mRNA, microarray and RNA-Seq. Many comparative studies have shown that their results are not always consistent. In this study we aim to find a robust method to increase comparability of both platforms enabling data analysis of merged data from both platforms. We transformed the high dimensional transcriptomics data from the two different platforms into lower dimensional, and biologically relevant gene set scores. These gene sets were defined a-priori as specific combination of genes (e.g. up-regulated in a certain pathway). We observed that although microarray and RNA-Seq expression levels might appear different, using these gene sets to transform the data significantly increases their correlation. This is a step forward in data integration of the two platforms. More in-depth investigation on the effect of the composition, size, and number of gene sets that are used for the transformation is suggested for future research.

Citation: van der Kloet FM, Buurmans J, Jonker MJ, Smilde AK, Westerhuis JA (2020) Increased comparability between RNA-Seq and microarray data by utilization of gene sets. PLoS Comput Biol 16(9): e1008295. https://doi.org/10.1371/journal.pcbi.1008295

Editor: Jason A. Papin, University of Virginia, UNITED STATES

Received: November 1, 2019; Accepted: August 27, 2020; Published: September 30, 2020

Copyright: © 2020 van der Kloet et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The micro array and sequencing data can be obtained from the broad institute website (last accessed Oct 1st, 2019) using the following urls: micro array https://data.broadinstitute.org/ccle_legacy_data/mRNA_expression/CCLE_Expression_Entrez_2012-09-29.gct, sequencing: https://data.broadinstitute.org/ccle/CCLE_RNAseq_081117.reads.gct. The data described in the Zhao paper (doi: 10.1371/journal.pone.0078644) can be downloaded as supplemental data using the following urls: micro array https://journals.plos.org/plosone/article/file?type=supplementary&id=info:doi/10.1371/journal.pone.0078644.s006, sequencing: https://journals.plos.org/plosone/article/file?type=supplementary&id=info:doi/10.1371/journal.pone.0078644.s009. The BRCA data from the Thompson paper can be downloaded via (DOI: 10.7717/peerj.1621/supp-2).

Funding: FK was financially supported by the Amsterdam Academic Alliance Data Science (https://amsterdamdatascience.nl/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

To determine cellular activity of a culture or tissue, the field of transcriptomics currently has two major platforms at its disposal, namely microarrays and RNA-Seq. As a proxy of gene expression both platforms can be used to quantify the constituent of all protein encoding transcripts, or mRNA, present in a sample. Because mRNA has a high rate of decay, its momentary composition can be considered a snapshot of gene activity. Analysis of the transcriptome of a cell in different conditions or time points can therefore give invaluable insights in the differences on a molecular level between healthy or diseased tissues or the response to external stimuli like drugs or stress.

A microarray consists of probes that are composed of many specific single strand DNA fragments, which makes the probe sensitive to the complementary sequence. The microarray method relies on reverse transcriptases to convert mRNA, isolated from the sample, back into DNA (cDNA) while introducing fluorescent labeled nucleotides [1]. Quantification of the different transcripts is performed by allowing the fluorescently labeled fragment library to hybridize to the probes on the microarray slide. The level of fluorescence is measured per probe which is thereby a relative measure for the number of probe specific transcripts that were present in the sample. Microarrays have been in use as a transcriptomics platform since 1995 while the newer high throughput RNA-Seq method was only used as such since 2008 [2]. RNA-Seq is a next generation sequencing (NGS) method that, other than its name suggests, sequences DNA. To determine mRNA levels, a cDNA fragment library preparation step is needed, analogous to microarray analysis [3]. The clonal cDNA fragments are individually sequenced to the single stranded cDNA. This fragment is then sequenced by a process called sequencing by synthesis (lllumina). In this process the complementary strand is extended with a single nucleotide per cycle. The type of nucleotide (A, C, T, G) that was incorporated in the strand is determined by a fluorescent label which is then cleaved off, this in turn allows extension by a subsequent nucleotide. This cycle is repeated in a massively parallel fashion. The sequenced fragments are then mapped to a reference genome. If such a reference genome is not available, de novo transcriptome assembly is possible depending on adequate coverage and sequence depth.

RNA-Seq is associated with a higher dynamic range than microarrays [4, 5] making it suitable for detection of low abundant transcripts. Furthermore, as RNA-Seq is not necessarily reliant on a reference genome this platform allows for novel transcript detection and transcriptomics analysis for organisms for which no reference genome is available. These advantages have made RNA-Seq gain in popularity leading to a reduction of the overall cost of the platform which is expected to fully replace the microarray platform. Although most future gene expression studies will use an RNA-Seq platform, it would be a waste of effort and resources if all the microarray datasets available would be ignored from now on. Meta-analysis of gene expression over multiple studies independent on the type of platform would increase sample size and power of such analyses. This however, requires a high comparability between data from the two platforms.

The question about comparability between the microarray platform and RNA-Seq platform has received attention in previous papers. Fu et al. [6] compared microarray and RNA-Seq platforms and found correlation values of 0.62 up to 0.75 for microarray and RNA-Seq measurements of groups of around 8,000 and 5000 selected genes. Zhao et al. [5] focused on the differences between the two platforms and showed how background hybridisation and probe saturation in microarrays resulted in limited sensitivity in both low and high expression levels. Meta-analysis ([7, 8]) for combined microarray and RNA-Seq studies therefore seems to be more problematic. Jung et al. used rank products to combine RNA-seq and microarray data for exploring carcinogenic risk [9]. Training Distribution matching (TDM) was introduced by Thompson et al. [7] in which one of the datatypes (RNA-seq or microarray) is transformed in such a way that it can be used with a model developed on the other data type.

Hänzelmann et al [8] compare their method to other implementations of enrichment score based methods (PLAGE, ssGSEA and combined z-score) but using either only micro-array or sequencing data to discover subtle pathway activity changes. In this paper we explore single sample GSEA (ssGSEA) related to GSEA [10], to make gene expression data from microarray and RNA-Seq more comparable on an individual sample level. We combine single gene expression values for a specific set of genes (a priori defined) into a single enrichment score for that set. This way multiple gene sets together (called gene set collections), are used to transform the genes into a smaller collection of gene set enrichment scores for every sample. The enrichment score represents the degree to which the genes within each set are expressed, i.e. all over or under expressed. We propose that these enrichment scores can be treated as a new “latent variable” and that the conversion to enrichment scores can be considered as a form of data transformation.

Fig 1 depicts the different types of data blocks we work with. We identify 4 different blocks; A through D. Because we are interested in whether or not we can improve the correspondence between the two platforms we only focus on differences between A and B on the one hand and C and D on the other hand.

Download:

Fig 1. The basic setup of datasets that are compared in this study.

https://doi.org/10.1371/journal.pcbi.1008295.g001

We implemented the enrichment score according to [11] and applied the transformation to three cases of publicly available microarray and RNA-Seq datasets of the same set of samples. The first set of samples were obtained from the Cancer Cell Line Encyclopedia (CCLE) [12] which has a large overlap in samples measured on both microarray and RNA-Seq. The second dataset resulted from a study on activated T-cells in humans [5] and was much smaller. We will show that the enrichment scores increase the similarities between the two platforms compared to the original gene (raw) expression values and characterize the role of the gene sets. To demonstrate that biological information is retained by using this gene set transformation we used a third dataset from Thompson et al. [7]. We developed a logistic regression model using the enrichment scores based on the microarray data (TCGA) to predict different subtypes of breast cancer. The sequence data-set contained overlapping samples but also other (breast cancer) samples. Using the model based on microarray data we show that meta-analysis across platforms becomes possible.

Materials and methods

In this paper we use 3 different datasets containing data from both the microarray and sequencing platforms. Below we describe the datasets. We included a schematic overview of overlapping samples between the two platforms for every dataset respectively in the supplemental figures S1, S2 and S3 Figs.

Dataset 1, cancer cell line data (CCLE)

Two publicly available datasets of expression data from cancer cell lines covering 22 different histologies have been obtained from the Cancer Cell Line Encyclopedia (CCLE) consortium [12]. The expression values were determined by Affymetrix U133+2 microarrays and by RNA-Seq. We used the Robust Multi-array Average (RMA) normalized microarray data and raw read count (transcriptomic) values from RNA-Seq. Both were downloaded from the Broad Institute website [13]. To evaluate if inter-species comparison is feasible based on the enrichment score we also downloaded Transcript Per Million (TPM) normalized sequence data for the same cancer cell lines from this website.

The obtained microarray dataset from CCLE was annotated with HUGO gene symbols and Entrez IDs while the RNA-Seq dataset was annotated with HUGO gene symbols and Ensembl IDs. Although both datasets shared the HUGO gene symbols these were not unique in the datasets and were therefore discarded as the primary means of matching genes between datasets. We used the biomaRt R package [14, 15] to create an annotation table that maps the Ensembl IDs of the RNA-Seq dataset to the Entrez format. We first removed all genes from the datasets for which no mapping between formats could be obtained. Due to differences in gene definition between the Entrez and Ensembl format multiple genes from one dataset could in some cases be mapped to one gene of the other dataset. In such cases we used the HUGO gene symbol annotation column available in both datasets to determine which of the duplicate results should be matched. If gene mapping remained inconclusive after these steps, the genes in question were removed from both datasets. After removing samples that were in only one of the datasets we arrived at two datasets of 970 samples for which the expression values of 17,415 genes were determined.

Dataset 2, in vivo data

We obtained an in vivo dataset from a comparative transcriptomics study on activated CD4⁺ human T-cells [5] to test the transformation in a second example, biologically very different from the CCLE dataset. The data from this study encompasses expression values for samples obtained at six different time points from one individual. Each sample had a replicate resulting in a total of 12 samples for which both Affymetrix GeneChip HT HG-U133+PM microarray and RNA-Seq data are available. We used the RMA normalized microarray dataset and the raw RNA-Seq counts data. Only those genes for which values were established on both platforms were maintained. This resulted in two comparable datasets of 18,304 gene expression values measured for all 12 samples. The correlation between the repeats is very high (0.996 plus) for each platform.

Dataset 3, breast cancer data (TCGA)

The breast cancer data is well described in the paper by Thompson et al. [7] and consisted of samples of different types of breast cancer from The Cancer Genome Atlas (TCGA). Both microarray and RNA-seq were used to measure gene expression. The microarray data consisted of 577 samples with 4 different subtypes of breast cancer and non-carcinoma samples. The sequence data contained the same 577 samples and contained an additional 379 samples.

Gene set collections

The Molecular Signatures Database v6.0 (MSigDB) [5] from Broad Institute provides 8 publicly available gene set collections that cover a range of biological functions. The number of genes and the number of sets in the different gene set collections from the database vary widely covering from 4,386 to 30,012 unique coding and non-coding gene definitions and varying from 50 to 4,872 sets in a collection. Individual gene sets can vary from 5 to 2,940 genes (see Table 1). We calculated the enrichment scores for both the microarray and the RNA-Seq data for all the different datasets with gene set collections H, C6 and C7 of MSigDB v6.2. These gene set collections are expected to be in line with the biological origin of our datasets. We used collection C6 that represents ‘biological processes that are commonly dysregulated in cancer’ for the cancer cell line dataset (dataset 1). This collection consists of 189 individual gene sets and covers a total of 11,250 unique genes. As a result, the data is transformed/compressed from 17,415 genes to 189 enrichment score variables.

Download:

Table 1. The composition of the eight gene set collections of the MSigDB v6.0.

Shown are the number of sets, the smallest, largest and average set size by number of genes. The total number of genes and the number of unique genes. The last column represents the highest number of sets a single gene was encountered in.

https://doi.org/10.1371/journal.pcbi.1008295.t001

For the 12 samples in the in vivo dataset (dataset 2) in which the immune system is perturbed we used the C7 gene set collection to transform. The C7 collection encompasses manually curated cell states and perturbations from immunology studies. The C7 collection is composed of 4,872 gene sets and will thus reduce the number of variables of this dataset (2) from 18,304 genes to 4,872 enrichment score variables.

As a control we use the hallmark (H) collection which is, with only 50 gene sets, the smallest of the available collections, see Table 1. This collection is constructed from different gene sets from the other collections ‘to represent well defined biological states and processes’. From Table 1 it is clear that the different collections use the genes in different configurations but also that every collections has gene sets that contain genes that are used in multiple gene sets (e.g. C5, up to 659 times a use of the same gene in as many gene sets). Conversely, gene set C1 has a maximum occurrence of a gene in only 3 different gene sets and therefore has the most ‘distinctive’ gene sets.

Gene set transformation

Gene set enrichment scores were determined as proposed by [11]. The enrichment score ES(H) for gene set H is calculated for each sample. First, the gene expression values G of all R genes in that sample are sorted from high to low to determine the rank r (= 1…R) for each gene. Thus, for the gene with the largest expression value r = 1 while r = R for the lowest gene expression value. ES(H) is calculated as the cumulative sum of the differences between the cumulative probability sum (of all genes in the sample) that the genes belong to gene set H (P_H) minus the cumulative probability sum that genes do not belong the gene set H (P_NH) (Eqs 1–3).

In these equations the counting variable r, representing the rank, goes from 1 to the total number of genes R. In Eq 2, the sum is only taken over the genes up to rank r that belong gene set H, where I is an indicator variable meaning I(G_i∈H) equals 1 when gene G_i is a member of gene set H, while it is 0 otherwise. In Eq 3, the sum is taken over the first r genes that do not belong to gene set H, I(G_i∉H). R_H represents the number of genes that are in gene set H. The stabilizing power α was left at a value of 0.25 as suggested in [11].

Both Eqs 2 and 3 go from 0 to 1 in R_H and (R-R_H) steps respectively. If the average rank of genes belonging to gene set H is high (or low), then there is an early (or late) fast increase in P_H. If the genes in gene set H have no preference and are randomly spread over the ranked genes P_H closely follows P_NH.

(1)

(2)

(3)

Fig 2 shows the application of the transformation on artificial data in which 250 genes and 3 gene sets (one with high ranks, one with low ranks and one with no preferred (rnd) ranks) were simulated. The top figure represents the ranked expression scores of the sample in which the coloring of the bars denotes their membership to any of the three artificial gene sets (green, blue, red). The grey colored bars represent genes that are not present in any of the three gene sets. The second plot shows the calculated values for P_H (solid) and P_NH colored in correspondence to the different gene sets. The resulting enrichment score for each gene set is the resulting area described by the summed difference between the two lines. For gene sets with members randomly dispersed throughout the ranked expression landscape of the sample (blue), P_H and P_NH will closely track each other leading to an enrichment score close to zero. In contrast, gene sets of which the members show a clear tendency to aggregate to high (green) or low (red) expression values result in high and low enrichment scores respectively. With this approach genes in a single sample can be ‘scored’ on many different gene sets and can be tested for differential analysis later whereas in the ‘normal’ (enrichment analysis) approach the gene set scores (pathway enrichment scores) are determined based on group averages.

Download:

Fig 2. Demonstration of the enrichment score calculation with a simulated dataset.

Top: sorted expression levels of 250 genes in which coloring represents gene set membership. Middle: Left, the calculated P_H values for each gene set. Right, the P_NH values for each gene set. Bottom: The resulting enrichment scores for each of the three gene sets.

https://doi.org/10.1371/journal.pcbi.1008295.g002

Congruence between the two platforms

Microarray results are influenced by the hybridization efficiency of the probes and not necessarily proportional to the molar concentrations of mRNA in the samples while the RNA-seq read counts are influenced for example by the sequencing depth of the sample. Both platforms have their own biases. Discrepancies between the two transcriptomics platforms can clearly be visualized in a scatterplot when the two axes represent the microarray values and the RNA-Seq values respectively of the same sample. In such a plot ideally, all points would lie on a line with intercept 0 and slope 1 when equal gene expression levels are found by both platforms. Because the platforms generate different gene expression levels, possibly of different magnitudes, the intercept and slope are expected to be different from 0 and 1 respectively. To quantify the difference between the two platforms we use the Spearman (rank) correlation and the residual variance (var(E)) value. This residual variance is calculated by performing a Principal Component Analysis (PCA) on the combined RNA-seq and microarray data for the two samples to a single data matrix. At most 2 principal components can be obtained from which the first (PC1) explains the most variance and if all points lie on a straight line, captures 100% of all variation. The remainder (PC2) therefore explains 100% minus the explained PC1 variance and represents the deviation from the ideal situation The larger var(E), the more points are deviating from the straight line, the higher the dispersion and/or nonlinearity between the two platforms. Additionally, we also determine the similarity between the original platform datasets and the gene set compressed datasets as a whole by means of the modified RV coefficient [16]. The modified RV coefficient is a matrix correlation measure based on the configuration of the samples, e.g. if grouping of samples is similar in two matrices, then their modified RV coefficient is high (close to 1). The modified RV coefficient was calculated after mean centering and scaling of unit variance for each column in both datasets.

Gene set permutations

To assess whether the gene set collections capture real biological information by an appropriate selection of genes into the gene sets, we performed a permutation test. The permutation test should answer whether the defined gene sets are a better collection of genes to perform a specific task than any random selection of genes. We permute the original gene set collection C6 (oncogenic gene sets) by randomly selecting genes from the whole pool of genes in the collection for every gene set. Similar to the original gene sets, a permuted gene set can contain a gene only once. This effectively eliminates the biological relevance of the gene sets while leaving the specific structure i.e., the number of genes per set and the total number of genes in the collection, intact. By removing the relationship between the selected genes in a gene set, we obtain enrichment score values unrelated to biological events. We will test whether the originally obtained enrichment score values are different from the ones obtained after permutation (H₀). After every permutation we recalculate the resulting gene set enrichment scores on a sample for both platforms and determine the Spearman correlations between the resulting vectors. This is done for all 970 samples of the CCLE datasets with 100 permutations. We thereby obtain a H₀ distribution of 97,000 values against which we compare the results that were found using the original gene set collection.

Results and discussion

Using the three datasets we will demonstrate the increase in similarity when gene set transformation is applied. Next we attempt to investigate why this approach works using permutations and a simpler rank score.