^{1}

^{1}

^{2}

^{1}

^{3}

^{*}

The authors would like to formally state that the work leading to the manuscript has been funded in part by Bayer Technologies Services GmbH and the Zentrum für Integrative Psychiatrie by providing salary to Andreas Schuppert and Franz-Josef Müller, respectively. Also, the work has been funded in part by a fellowship to Franz-Josef Müller by the Else Kröner Fresenius Stiftung. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials and there are no commercial or other interests (such as patents, products in development, consultancy etc.) tied to the manuscript. The funding sources had no influence on or took part in the interpretation of the scientific data presented or conclusions drawn.

Conceived and designed the experiments: AS BMS ML. Performed the experiments: BMS ML AS. Analyzed the data: ML BMS AS FJM. Wrote the manuscript: ML BMS AS FJM.

Relating expression signatures from different sources such as cell lines, in vitro cultures from primary cells and biopsy material is an important task in drug development and translational medicine as well as for tracking of cell fate and disease progression. Especially the comparison of large scale gene expression changes to tissue or cell type specific signatures is of high interest for the tracking of cell fate in (trans-) differentiation experiments and for cancer research, which increasingly focuses on shared processes and the involvement of the microenvironment. These signature relation approaches require robust statistical methods to account for the high biological heterogeneity in clinical data and must cope with small sample sizes in lab experiments and common patterns of co-expression in ubiquitous cellular processes. We describe a novel method, called PhysioSpace, to position dynamics of time series data derived from cellular differentiation and disease progression in a genome-wide expression space. The PhysioSpace is defined by a compendium of publicly available gene expression signatures representing a large set of biological phenotypes. The mapping of gene expression changes onto the PhysioSpace leads to a robust ranking of physiologically relevant signatures, as rigorously evaluated via sample-label permutations. A spherical transformation of the data improves the performance, leading to stable results even in case of small sample sizes. Using PhysioSpace with clinical cancer datasets reveals that such data exhibits large heterogeneity in the number of significant signature associations. This behavior was closely associated with the classification endpoint and cancer type under consideration, indicating shared biological functionalities in disease associated processes. Even though the time series data of cell line differentiation exhibited responses in larger clusters covering several biologically related patterns, top scoring patterns were highly consistent with a priory known biological information and separated from the rest of response patterns.

In many biological and medical research fields, such as stem cell research, drug development or analysis of disease status, it is important to integrate data from different sources, such as cell lines, in vitro cultures from primary cells or clinical biopsies. Data integration has the possibility to combine the knowledge derived from different experiments, providing a bigger picture surrounding the new data and improving the interpretation of results [

Data integration approaches have been implemented on different levels using gene expression data. The classical analyses started with the integration on a single gene level, e.g. by interpreting differential gene expression in newly performed experiments using knowledge from gene annotation databases. These analyses were then extended to sets of genes, corresponding to specific biological functionalities, pathways or genomic locations [

This last step has been implemented by extension of gene set enrichment analyses to include signatures derived from high-throughput experiments [

The present article, in contrast, focuses on the relation of gene expression changes to various tissue or cell type specific expression patterns. This specific focus becomes increasingly relevant as outlined by the following two examples. First, differentiation of pluripotent stem cells towards neural cells or cardiomyocytes, for instance, is anticipated to bear enormous potential for drug screening and regenerative medicine [

Global analyses of gene expression patterns across diverse tissues and cell lines are typically performed in an unsupervised way, e.g. based on principal components analysis (PCA) [

The presented PhysioSpace method serves as an exploratory research tool that allows getting a large scale overview of the data in terms of defined physiological coordinates. PhysioSpace complements single gene based analyses, gene set and pathway methods and unsupervised global methods like PCA.

The PhysioSpace algorithm defines directions (signatures) in a supervised way based on retrospective microarray data. These directions are directly associated with specific phenotypes defined by data postprocessing.

The directions are derived by comparing samples of a specific tissue with a reference via a t-test (

The differential expression between samples from a specific tissue and this reference is then used as a signature representing the characteristic expression pattern of this tissue. Due to the apparent similarity of different tissues, e.g. neural tissues from different regions of the brain or different tissues related to the immune system, some PhysioSpace signatures are highly correlated (Figure S1 in

The task of mapping gene expression changes into the PhysioSpace can be defined in the light of high dimensional gene expression spaces as follows: A phenotype is interpreted as a point or cloud in an expression space and phenotypical change is a vector connecting the centers of different clouds in the same expression space (

(A) Data from a new experiment is transformed to remove ellipticity and the resulting fold-change vector is compared to a compendium of signatures from prior experiments using a robust, rank-based scoring method. Graphical displays and the statistical validation allow to evaluate the position of the new experiment in the global PhysioSpace. (B, C) Illustration of the influence of non-sphericity on sample permutations. (B) In the presence of a strong ellipticity, sample permutation does not randomize directions in contrast to more spherically distributed samples as obtained through the spherical transformation approach (C).

Considering that the PhysioSpace method should be able to compare data from heterogenous sources, derived from cell lines, in vitro cultures from primary cells, or primary patient biopsies, it is important to use robust and statistically sound techniques. In this article we follow practices from gene set analysis [_{10} p-values from the Wilcoxon test, termed PhysioScores, are then used for visualization purposes. If there are at least 9 samples in each group, sample-label permutation is performed to assess the significance of the PhysioScores.

The algorithm used in this article is similar to classical gene set enrichment algorithms. However, the usage of signatures instead of gene sets allows to perform the enrichment calculation in a backward direction, defining the gene sets on the new data and calculating the enrichment on the tissue specific signatures. This backward direction provides a different view on the data as evaluated and discussed below. A similar backward approach has been used previously [

We evaluate the performance of the PhysioSpace method and discuss the effect of the spherical transformation by analysis of simulated mixtures of embryonic stem cells (ESCs) with different tissues. We then apply the method to analyze tumor development comparing different breast cancer grades and prostate Gleason scores, as well as to investigate the effect of smoking on gene expression of lung cancer tissues. These examples show three principally different outcomes of the PhysioSpace method that are used to exemplify possible interpretations. Furthermore, the cancer data are utilized to investigate the relationship between PhysioScores and permutation p-values, providing useful information for the applicability of PhysioSpace in the case of low numbers of replicates. The PhysioSpace method is then applied on tracking of induced pluripotent stem cell (iPSC) differentiation experiments towards neural cells, cardiomyocytes and trophoblast lineages in a physiological context. It detects the direction and dynamics of differentiation, uncovering interesting information from data with very small numbers of replicates, and matches well to biological expectations. The comparison to a classical forward enrichment algorithm is performed on the cancer, differentiation, and simulated data, with overall slightly better results for the implemented algorithm. The robustness of the proposed method is demonstrated by the use of different PhysioSpaces (

Accession | Usage | Description |
---|---|---|

GSE7307 | PhysioSpace 1 | 677 samples corresponding to 93 different tissues or cell lines |

GSE23402 | PhysioSpace 1 | 17 ESC samples (the 25 hiPSC and Fibroblast samples are not used) |

GSE2361 | PhysioSpace 2 | 36 samples, each from a different tissue |

E-MTAB-62 | PhysioSpace 3 | 5372 samples divided into 369 different groups as annotated in [ |

Clinical datasets often suffer from large, non-phenotype associated variation, affecting the determination of fold change vectors. This can lead to spurious associations, especially in the case of relatively small sample sizes and large heterogeneities in the data. The spherical transformation (

In the opposite case, i.e. when the heterogeneity in the data is considerably lower than the effect of interest, the sample permutation approach does not generate a meaningful null distribution. Random re-sampling of sample labels generates new vectors connecting the centroids of the two sampled groups. In the case of elliptical data distributions the sampled vectors are highly correlated, resulting in a decreased significance (

In order to investigate the two described effects of the spherical transformation, two datasets of embryonic stem cells (GSE33789) and cancerous and normal lung tissues (GSE19804) were downloaded, normalized and merged (

Accession | Usage | Description |
---|---|---|

GSE33789 | All simulations | 10 ESC samples (2 Fibroblast samples are not used) |

GSE19804 | Effect of spherical transformation | 60 lung cancer and 60 adjacent normal lung samples |

GSE18676 | Mixture simulations | 24 samples from 22 different tissues and 2 cell lines |

Two different types of simulated data were produced for investigation of the above described effects. First, 40 samples were randomly drawn from the lung dataset GSE19804. No distinction of normal and cancerous lung tissue was made. Cancerogenicity was rather interpreted as an unknown confounding effect, increasing the heterogeneity of the dataset. The first 20 of these samples were subjected to a computational modification, simulating a mixing of lung tissue with ESCs with mixing factor

(A) In the case of relatively high heterogeneity and comparably low signal strength, the spherical transformation increases sensitivity and specificity of the simulated effect, i.e. results in a strong and specific increase of the ESC score (left part). The results without spherical transformation (right part) are more heterogeneous. (B) The null-distribution obtained from re-sampling without spherical transformation is not meaningful in cases of high signal strength and low heterogeneity, leading to increasing p-values for increasing signal strength in simulated data. This effect does not occur when the spherical transformation is applied. Depicted are permutation p-values for the lung signature with (red dots) and without (black dots) spherical transformation. (C) The mean matching score of 21 simulated mixtures is compared between the implemented PhysioSpace algorithm and a classical GSEA based method. The matching score is defined as the quotient of the respective tissue (or ESC) score and the highest (lowest) score. It is truncated at a minimum of zero, avoiding negative values. While there are some differences between the two methods, especially for very low mixture values, the overall performance does not generally favor one or the other algorithm. (D) The PhysioScores of simulated mixtures of 97% ESCs with 3% of different tissues are visualized as an exemplary case, showing a very nice agreement with biological expectations. Correlations between different signatures are represented by the dendrogram on the left hand side as well as by simultaneous increasing PhysioScores in the columns of the heatmap-like representation.

In the second simulation, the low-heterogeneity ESC dataset was analyzed, where 10 samples were simulated as mixtures of ESC and adjacent normal lung tissue (Materials and Methods) and compared to unmodified ESC samples. In order to simulate a rather strong signal, the fraction of lung tissues in the mixture was set to 0.1, 0.2,…, 1. For the analysis without spherical transformation, the negative effect of increasing ellipticity in the data can be observed from

In order to evaluate the ability of the PhysioSpace method to detect changes in tissue composition, the analysis of simulated mixtures was extended to several different tissues. For this purpose, the dataset GSE18676 was considered, consisting of 22 different tissues and 2 cell lines. Each tissue or cell line is represented by a single sample only. The 24 samples were computationally mixed with one ESC sample from dataset GSE33789 with a mixture proportion of 97% ESC and 3% of the tissue sample and compared to the remaining 9 ESC samples (

Overall, these results reflect the simulated changes very well, indicating high robustness of the PhysioSpace method even in cases where data from different studies were combined as well as high sensitivity and specificity for detecting mixtures with mixture fractions of only 3%. Following an advice of a referee, we set out to test the dependence of the matching on the mixing fraction lambda and compared the results to those using a typical gene set enrichment algorithm as implemented in the geneSetTest method of the limma package [

For the performance assessment, two matching scores were calculated, a tissue matching score and an ESC matching score. They are defined as the ratio of the expected PhysioScore, i.e. of the matching tissue (or the ESCs), and the highest (or lowest) PhysioScore. For example, for the mixing of the bone marrow sample with the ESC sample, the tissue matching score is the ratio between the bone marrow PhysioScore and the highest PhysioScore. This matching score is 1 in case of a perfect match, e.g. if the bone marrow score has the highest value itself, and gradually decreases with the distance of the expected score to the actual highest (lowest) score. We truncated the score at zero to avoid negative values. In

Cancer is a highly heterogeneous disease consisting of different subtypes traditionally defined by specific histological markers like grade in breast cancer or Gleason score in prostate cancer [

In this context, the PhysioSpace method was applied to investigate differences in global gene expression between breast cancer grades. A dataset with 189 breast cancer samples (GSE2990) was analyzed, incorporating 64 breast cancers of grade 1, 48 of grade 2, 55 of grade 3, and 22 with missing information on breast cancer grade. The vectors of differential expression of grade 1 to that of grade 2 and grade 3 cancers were compared to the signatures in the PhysioSpace, resulting in a large number of significantly associated reference signatures (

Ranking of PhysioScores comparing breast cancer samples of grade 1 to grade 2 or 3 (A) and lung samples from never smokers to former or current smokers (B). Apparently, grading differences in breast cancer are associated with more signatures from the PhysioSpace than differences in gene expression of smokers and non-smokers. Blue (red) colors depict negative (positive) PhysioScores. Filled bars indicate significant scores according to a sample-permutation FDR (Benjamini-Hochberg) cutoff of 0.1.

In contrast to the breast cancer results, the influence of smoking on gene expression seems to be more phenotype specific. A comparison of the gene expression of (normal and cancerous) lung tissues from 31 never to 36 former, and 40 current smokers (dataset GSE10072,

Accession | Usage | Description |
---|---|---|

GSE2990 | Breast grade | Breast cancer samples of grade 1 (64), 2 (48), or 3 (55); the 22 samples with missing information on breast cancer grade are not used |

GSE10072 | Lung smoker | 107 samples from cancerous or adjacent normal lung tissue from never (31), former (36), or current (40) smokers |

GSE21034 | Prostate Gleason | 131 primary prostate cancer samples with a Gleason score of 5 (1) 6 (77), 7 (42), 8 (7), or 9 (4); 19 metastases; normal prostate samples and cell lines are not used |

GSE16560 | Prostate Gleason | 281 primary prostate cancer samples with a Gleason score of 6 (83), 7 (117), 8 (27), 9 (49), or 10 (5). |

The PhysioScores for all six investigated datasets are visualized in context for selected signatures. In order to evaluate the stability of the method, the signatures were derived from three different physiological databases resulting in PhysioSpaces 1-3. The results are depicted in a heatmap-like representation. The color scheme differs between datasets but is the same for the PhysioSpaces 1-3, ranging from negative values in blue and green to positive values in orange and red. The dendrogram represents a hierarchical clustering of the signatures according to a Pearson-correlation distance. Values within clusters are usually similar, e.g clusters of neural or immune signatures. The results corresponding to PhysioSpaces 1-3 show similar dynamics and consistent dominating signatures, while the absolute values are only approximately comparable.

The third analyzed dataset consists of prostate tumors with differing Gleason scores. Dataset GSE21034 consists of primary prostate tumors, metastases, and prostate cancer cell lines. In order to concentrate on the differences associated with Gleason score, only primary tumor samples were considered for the analysis. The highest scoring PhysioScores are fetal liver and pancreas, increasing with Gleason score, as well as prostate, decreasing with Gleason score (

We applied the PhysioSpace method additionally to compare metastases and primary tumors in dataset GSE21034. The result of this analysis is dominated by a strongly negative prostate-signature (Figure S3 in

Furthermore, we analyzed dataset GSE16560, in order to investigate whether differences in prostate cancer samples of different Gleason scores have any physiological interpretation. In this analysis, a weak association between Gleason score and cell line signatures, including the ESC signature, can be found (Figure S3 in

Markert et al. [

In summary, the three presented application examples show three qualitatively differing outcomes of the PhysioSpace method. Breast cancer grade is significantly associated with many signatures corresponding to various cellular phenotypes, suggesting a common underlying mechanism. The effect of smoking on lung cancer is more specific, showing primarily an increase in immune signatures. Finally, Gleason scores in prostate cancer show no significant associations for dataset GSE21034.

In order to investigate the effect of the spherical transformation with real data, the permutation p-values were compared to the PhysioScores for all cancer data, with and without spherical transformation (Figure S4 in

The almost monotonic association between PhysioScores and permutation p-values suggests that the PhysioScore is a valid measure to rank the signatures according to their significance, even though it is not possible to determine a rigorous significance threshold. This result is very important for applications were the number of replications is too low for sample label permutation.

In vitro differentiation of pluripotent stem cells into diverse somatic cell types is increasingly studied in order to obtain a molecular understanding of embryogenesis, to build disease specific in vitro models, and to develop new options for regenerative medicine and drug development [

We mapped the dynamic changes of three in vitro differentiation time series (

Accession | Usage | Description |
---|---|---|

GSE9940 | Neural differentiation | 18 samples from a differentiation of ESCs towards neural precursors at day 0 (3), day 6 (3), day 10 (6), or day 17 (6) |

GSE30915 | Trophoblast differentiation | 21 samples from a differentiation of ESCs towards trophoblasts at days 0, 2, 4, 6, 8, 10, and 12 with 3 samples per time point |

GSE28191 | Cardiac differentiation | 12 samples from a differentiation of ESCs towards cardiomyocytes at days 0, 2, 5, 7, 9, and 11 with 2 samples per time point |

GSE10469 | Additional trophoblast differentiation | 2 trophoblast differentiation time series with 4% or 20% O2 at 0, 3, 12, 24, 72, and 120 hours of differentiation |

In a global perspective, the a priori expected signatures dominate over time (

(A) Line plots of most relevant PhysioScores for the three differentiation time series comparing scores from PhysioSpaces 1 and 3. Lines with names ending with “Lukk” correspond to the third PhysioSpace (Lukk et al. 2010 [

For the neural differentiation, a consistently increasing PhysioScore was observed for the whole cluster of neural tissues (

Looking closer into the cardiomyocyte differentiation, a striking increase of the heart score can be observed from day 7 to day 9 of differentiation (

In

The three differentiation time series were also used to compare the presented algorithm to the GSEA-based implementation. The matching score, as used in the simulation results, was calculated for each time point (

The robustness of the PhysioSpace method was evaluated by constructing three different PhysioSpaces. For this, we have used the same construction approach, yet employed three different gene expression data collections, which contained similar biological phenotypes (Materials and Methods). The embedding of the three differentiation time series and three cancer datasets into these three separate PhysioSpaces are depicted in Figures S5, S6, and S7 in

For future extensions of the PhysioSpace the possibility to combine signatures from different microarray platforms is going to increase the universal applicability and thus utility of the PhysioSpace approach. In order to be comparable, the absolute values of the PhysioScores must be in a similar range. We tested the principal possibility of a combination by comparing the PhysioScores from the three different PhysioSpace constructions. A linear adjustment according to the number of matching probes between the signature and the dataset was applied to approximately account for the resulting different gene-set sizes (Materials and Methods). A selection of the most prominent signatures from all three PhysioSpace constructions is depicted in

In order to additionally test the robustness and platform independence of the PhysioSpace method, we analyzed an additional trophoblast differentiation time series (GSE10469, Agilent-014850 Whole Human Genome Microarray 4x44K G4112F microarray platform). The differentiation was performed in two settings with either 4% or 20% oxygen. In both settings, the Placenta-score and ESC-score dominate (Figure S3 in

A robust signature association method was developed that allows the linkage of data from cellular assays with clinical data and outcomes. This approach enables the contextual interpretation of newly performed experiments in a physiological “space” as represented by signatures generated from publicly available microarray data sets, allowing “big data” approaches. The signatures can be interpreted as directions in a high dimensional gene expression space. Following this geometrical interpretation, the PhysioSpace represents a low dimensional subspace of the space spanned by all gene expression patterns, having all a physiological interpretation. Differential expression vectors from newly performed experiments are compared to the reference signatures by robust methods extending gene set enrichment algorithms to account for the noise and heterogeneity in the data. The comparison is performed on differential expression patterns rather than absolute expression values in order to avoid the direct comparison of absolute gene expression values from heterogeneous protocols.

The combined use of gene permutation and sample-label permutation together with the robust signature ranking ability of the PhysioScores allows a detailed and valid statistical interpretation of the results. It is possible to use the PhysioScore values for ranking purposes in cases were sample permutation is not feasible, e.g. if a class contains less than 7 samples. In this case the relevance of the high ranking features has to be supported by other means, such as additional experiments and analyses, literature, or marker genes. The Prostate Cancer example shows that the sample-label permutation approach avoids false positives, helping to decide on further investigations and possibly saving resources.

A spherical transformation has been shown to improve the concordance of the PhysioScore with permutation derived significance assessment. Since the PhysioSpace method compares only directions of vectors rather than their lengths, sample label permutation for elliptical distributed data creates many vectors with similar directions and, hence, similar PhysioScores (

Classical gene set enrichment approaches use predefined gene sets and evaluate the enrichment of these sets on the differential expression in the data under investigation. In contrast, the method presented here defines the gene sets using the new data and calculates the enrichment on the PhysioSpace signatures, essentially implementing a backward gene set enrichment approach. This gives a slightly different perspective on the data since the strict cutoff of genes is performed on the new data instead of the retrospective data. In addition, there is also a difference from a statistical point of view. Many biological studies do not have more than three to five independent replicates, making sample permutation infeasible. Especially for time series experiments the relatively large amounts of replicates (at least seven to nine samples) needed for the sample permutation approach can become very cost intensive. Therefore, many studies rely on gene label permutation, which is not sufficient for rigorous assessment of statistical significance, bearing the risk of “heading down the wrong pathway” [

Three different types of cancer data were analyzed in order to show the performance of the PhysioSpace method. The results show different kinds of outcome with many, few, or no significant signature associations for breast, lung and prostate cancer datasets. It is already known that determination of biomarkers depends not only on the size of studies but primary on the clinical phenotype [

The detailed characterization of human pluripotent stem cells as well as their differentiation dynamics is important for quality control and understanding of the mechanisms and dynamics of cell fate changes. Besides the analysis of single marker genes and proteins, a whole genome based characterization can provide more robust information and outcome measures [

The PhysioSpace compendium is derived from publicly available gene expression data in a straightforward and resource effective manner. More specific compendia for specific applications can be derived easily from the large amount of available datasets.

The robustness of the PhysioSpace method allows using very small number of replicates, reducing experimental efforts at the same precision. The high biological relevance of the results confirm its usefulness for wide ranges of applications in drug discovery and (trans-) differentiation approaches. It allows to utilize the tremendously growing repositories of existing data for interpretation of specific wet-lab experiments on the background of physiology, possibly establishing a quantitative link between lab experiments and clinical applications.

RNA-seq and related next-generation technologies have an enormous potential for high resolution measurements of small cell populations. Properties of specific measurement techniques must be considered, when applying the PhysioSpace method to datasets created by different measurement techniques. Processing pipelines and normalization for next-generation sequencing data is an active field of development and all count based methods introduce bias through low counts and different transcript lengths [

All datasets were downloaded from Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) [

Probe identifiers of different microarray platforms were matched using the getBM method of the biomaRt R package [

The breast cancer analyses were performed on the GSE2990 dataset using all data that have information on breast cancer grade. No distinction according to estrogen receptor status was made. Analyses were performed for comparison of breast cancer grade 1 vs. 2 and grade 1 vs. 3.

Dataset GSE10072 was used for the analysis of lung gene expression of never vs. former and never vs. current smokers using all samples from cancerous and adjacent tissues in a single analysis. For the investigation of prostate cancer all primary tumors with Gleason-score 6, 7, 8, or 9 from dataset GSE21034 (transcript version) were used. Comparisons were made between samples of Gleason score 6 vs. 7, 6 vs. 8, and 6 vs. 9. In a separate analysis, primary tumors were compared to metastases. Additionally, all data from dataset GSE16560 were used to analyze differences associated with Gleason scores.

For the neural, trophoblast, and cardiomyocyte differentiation analyses, datasets GSE9940, GSE30915, and GSE28191 were used, respectively. Samples were grouped according to time of differentiation. No distinction according to treatment was made for the neural differentiation. An additional trophoblast differentiation dataset (GSE10469) was analyzed with separate analyses for the normoxic (20% oxygen) and the hypoxic (4% oxygen) conditions. In all comparisons, differentiating cells were compared to the starting pluripotent stem cell samples.

The computational mixing was achieved by a linear combination of tissue and ESC samples with mixing factor

The mixing was performed on non-transformed data, in contrast to all other calculations that were performed on log_{2}-transformed data. For the first simulation, _{ESC}represents a dataset with 20 ESC samples, obtained from GSE33789 by taking each ESC sample twice, and _{Lung}the first 20 of the 40 randomly drawn lung samples, obtained from GSE19804. Hence, the data simulate an infiltration of up to 5% ESCs (

The PhysioSpace consists of a compendium of gene expression signatures, representing vectors of differential expression. Differential expression is calculated using a Student’s t-test between samples from a specific tissue or cell line and a computationally built reference. The reference is chosen as the vector of mean expression values of all samples in the dataset, in order to simulate a common reference showing no tissue-specific expression. The standard error of the mean expression is used as standard deviation of the reference for calculation of t-tests. The signed log_{10}-p-values of the t-tests are used as PhysioSpace signatures.

Three different PhysioSpaces were built for the analyses to show the robustness of the presented results (

The second PhysioSpace was built based on the GSE2361 dataset using all available probes. This dataset consists of 36 samples, each representing a different human tissue. Thus, the second PhysioSpace consists of 36 signatures, each representing the differential expression of a single sample to the computationally built common reference.

The 369 signatures of the third PhysioSpace were built according to the 369 Groups annotated in dataset E-MTAB-62. Again, all probes of the Affymetrix GeneChip Human Genome U133A array were used to calculate the signatures.

For visualization purposes the signatures were hierarchically clustered based on a Pearson-correlation distance using average-linkage, indicating some common gene expression features of signatures clustering closely together.

For a comparison of two phenotypes, e.g. different times of differentiation, or cancer stages, the mapping of the differences in gene expression onto the PhysioSpace is done by the following three-step procedure consisting of a spherical transformation, a data-based definition of gene sets and an enrichment calculation via a Wilcoxon rank-sum test.

The spherical transformation of the data is an essential step for the statistical validation. It allows the meaningful calculation of a null-distribution of the enrichment score via sample label permutation (

Starting from a gene-wise standardized (i.e. centered and scaled) data matrix ^{T}). The entries of the diagonal matrix

In the second step of the mapping procedure two sets of up- and down-regulated genes are determined, each containing the top 5% of up- and down-regulated genes, respectively. Genes are ranked corresponding to their mean fold changes between the two phenotypes.

The cutoff of 5% up- or down-regulated genes was chosen, on one hand, to focus on large scale patterns rather than a few genes and, on the other hand, to exclude genes that are only driven by noise. In the application examples it turned out that the results were quite robust with respect to a variation of the cutoff parameter between 1% and 10% (data not shown).

In the third step, a gene set enrichment score is calculated on the PhysioSpace signatures using the gene-sets of the previous step. The wilcox.test procedure of the R stats-package [_{10} p-value of the Wilcoxon-test.

A sample-label permutation approach with B=1000 permutations is used to rigorously determine statistically significant signatures. The permutation p-value is defined as

where _{0} is the absolute value of the observed PhysioScore, _{b},_{b} is smaller than s_{0} and 1 otherwise. All sample permutation p-values were adjusted for multiple testing using the Benjamini-Hochberg correction [

In _{10} p-value is approximately linear in the relevant range (data not shown). Therefore, the PhysioScore values were linearly transformed to simulate same number of genes.

All analyses were conducted in the R programming language [

(PDF)

(XLSX)

(TXT)

We thank S. Schneckener for data preprocessing and helpful discussion.