Global Meta-Analysis of Transcriptomics Studies

Transcriptomics meta-analysis aims at re-using existing data to derive novel biological hypotheses, and is motivated by the public availability of a large number of independent studies. Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies' experimental designs, followed by computing the overlap between the resulting differential expression signatures. While useful, in this methodology each study yields multiple independent phenotype comparisons, and connections are established not between studies, but rather between subsets of the studies corresponding to phenotype comparisons. We propose a rank-based statistical meta-analysis framework that establishes global connections between transcriptomics studies without breaking down studies into sets of phenotype comparisons. By using a rank product method, our framework extracts global features from each study, corresponding to genes that are consistently among the most expressed or differentially expressed genes in that study. Those features are then statistically modelled via a term-frequency inverse-document frequency (TF-IDF) model, which is then used for connecting studies. Our framework is fast and parameter-free; when applied to large collections of Homo sapiens and Streptococcus pneumoniae transcriptomics studies, it performs better than similarity-based approaches in retrieving related studies, using a Medical Subject Headings gold standard. Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections. Our proposed statistical framework shows that it is possible to perform a meta-analysis of transcriptomics studies with arbitrary experimental designs by deriving global expression features rather than decomposing studies into multiple phenotype comparisons.

Supporting Text S1: Control Annotations As mentioned in the main text, we used a Gamma distribution approximation method to compute the rank product p-values, which is based on the null hypothesis that ranks are uniformly distributed [1]. Here, we describe the approach used by [1].
Succintly, if a given gene rank divided by the number of genes is approximately uniformly distributed (the rank is discrete, not continuous, so this is necessarily an approximation), then its negative logarithm is exponentially distributed, where G is the total number of genes. In the equation above, the normalizing factor G+1 instead of G was used, so that the expectation of the uniformly distributed normalized rank is K/2 instead of (K + 1)/2. The sum of log-ranks that constitutes the rank product score for a given gene i is therefore related to the sum of exponentially distributed variables, The sum of n exponentially distributed variables with unit expectation is equal to a Gamma-distributed variable with shape n and unit scale. Therefore, in order to compute the probability that a given log-rank sum is smaller than an observed value (we remind the reader that in the present context, smaller ranks are better), one needs to compute the lower tail of the corresponding Gamma distribution, = P (Gamma(shape = n, scale = 1) > −x + n log(G + 1)) .
In summary, to compute the p-value for a given gene log-score x, we compute the term −x+n log(G+1) and then compute the probability that a Gamma-distributed variable with scale n and unit shape exceeds the computed term.

Log-Score Interpretation
One relevant aspect of the rank product method is that the log-score assigned to each gene and used for determining significance consists of a sum of log-ranks, logscore(i) = n j=1 log(rank(i, j)). (9) For a large rank r, log(r) ≈ log(r + 1), implying that the terms in (9) corresponding to large ranks are mostly indistinguishable from one another. For instance, for ranks r 1 = 1000 and r 2 = 5000, we have that log r 1 ≈ 6.90 and log r 2 ≈ 8.52, i.e., while r 2 is five times larger than r 1 , its logarithm in only about 23% larger than the logarithm of r 1 . To the best of our knowledge, this interpretation of the rank product method has not been provided before.
Intuitively, the above discussion means that the terms in the log-score are softly divided into terms corresponding to lower ranks and terms corresponding to larger ranks (which possess similar log-values). The idea of providing a soft distinction between low-ranks and the rest has been previously used in statistical testing frameworks for gene expression [2].

Computing Exact p-values
A method for computing exact p-values has been recently proposed [3]. Here, we measure the method's running time requirements on Windows 7 R, using an Intel i7-3610QM CPU at 2.3Ghz.
We assume that a given study has n genes and k samples. For the human differential expression study collection, the median number of genes and samples is n = 9701 and k = 12. For the S. pneumoniae study collection, the median number of genes and samples is n = 1935 and k = 6. We perform our benchmarking using these values.
The exact method relies on explicitly computing the number of possible ways in which the product of a gene's ranks across a study's samples is exactly equal to a given rank product value x. This number is designated as H(x; k, n). The exact unnormalized p-value for a given rank product x can be obtained by computing the total number of ways in which the rank product is less than or equal to x, i.e., H(1; k, n) + H(2; k, n) + . . . + H(x; k.n). The normalized p-value is then obtained by dividing by n k , which is the total number of possible rank combinations of a gene across the k samples. The p-value computation pseudo-code is thus as follows: We chose as a benchmark task the time it takes for the code above to reach a p-value which is deemed significant using Bonferroni correction at an initial p-value of 0.05. For the human study collection, this p-value threshold is p = 0.05 9701 , while for the S. pneumoniae study collection it is p = 0.05 1935 . Intuitively, we recreate the time it takes for the exact method to return the p-value for a gene whose rank product score is barely significant using Bonferroni correction. For both the human and the S. pneumoniae study collection, the exact method takes over forty minutes to compute. Given the impractical running time for the exact p-value method, we have opted for using the Gamma approximation method, which runs instantly. However, in the future it may be possible to combine both approaches in order to obtain a fast and precise hybrid method.