^{1}

^{*}

^{2}

^{1}

^{1}

^{3}

Conceived and designed the experiments: JTC. Performed the experiments: JLYC JTC. Analyzed the data: JEL CC MW. Wrote the paper: JEL MW.

The authors have declared that no competing interests exist.

Human disease studies using DNA microarrays in both clinical/observational and experimental/controlled studies are having increasing impact on our understanding of the complexity of human diseases. A fundamental concept is the use of gene expression as a “common currency” that links the results of

Microarray technology allows the capture of diverse aspects of genetic, environmental, oncogenic and other factors as reflected in global mRNA expression and opens the possibility of personalizing treatment of disease

Though many such studies have shown promise in using

There are a number of possible approaches to this problem. One popular approach has been to compare the identities of the differentially expressed probes to databases of pre-defined pathways. Descriptions of such approaches can be found in

The inherent heterogeneity of environment and cell type in tissue samples means that the genes in a signature may potentially involve many additional activities not evident

Robust statistical modeling of both experimental gene expression and tissue sample expression.

Identification and correction of assay artifacts, which are known to be a significant issue associated with the use of microarray technologies.

A mapping from a single signature, generated

A model for imputing the values of factors in new collections of tissue samples even though these samples may originate from different groups and at different times.

We explore this analysis approach in translating a collection of gene signatures reflecting cellular response to five known tumor microenvironmental factors, discovered

We begin with five signatures defined by the transcriptional responses of cultured human mammary breast epithelial cells to five microenvironmental perturbations: hypoxia, lactic acidosis, hypoxia plus lactic acidosis, lactosis, and acidosis. Each of these is seen in human cancers and carries prognostic information with respect to clinical outcomes

We use Bayesian Factor Regression Modeling (BFRM)

We will focus, for now, on the ten lactic acidosis factors. Examining the genes in each of the factors (

(a) Connections between genes and the 10 lactic acidosis factors in the statistical factor analysis of the breast cancer data from

^{−4}. If we approximate the distribution of ^{−13}. Because only the list of highly differentially expressed genes from the lactic acidosis signature, and not the weights, are used in the factor discovery, and because the weights are critical for the computation of the lactic acidosis signature scores, the ability to recover signature scores from factors is strong evidence of the relationship between the two.

The three factors derived from the lactic acidosis signature that were not important in the prediction of signature scores may still represent activity relevant to the presence of lactic acid, but they are not strongly predictive of the original signature. They may also simply represent the activity of biological pathways that involve very large sets of genes, and are thus discovered from many different possible starting points. Nonetheless, they represent significant structure in expression of the expanded signature gene set in tumor data, and none of these factors would be detectable from studying the signature alone as a phenotype.

Factors can reflect distinct aspects of biological activity.

SFPA-derived factors can represent distinct aspects of biological processes associated with clinical phenotypes. To evaluate this, we explored subset regression models to predict a number of clinical phenotypes in the Miller data set

The analysis indicates that highly scoring regression models for the prediction of ER status utilize one of the factors – Acidosis 1, Hypoxia 4, Lactic Acidosis 2, or Lactosis 5. From

Each point in these plots represents a single patient from the dataset in

ER and PgR factors predict progesterone receptor status: (a) training data set

Gene Ontology | # Genes | p-value | Bayes Factor |

Cell Cycle | 34 | <.0001 | 28 |

Cell Proliferation | 39 | <.0001 | 25 |

Regulation of cell cycle | 21 | <.0001 | 17 |

Mitotic cell cycle | 15 | <.0001 | 16 |

Estrogen and progesterone are known to be antagonists, so it is expected that ER factors can predict progesterone receptor status. Using SSS we find that the highly scoring regression models for PgR status involve the ER factor in addition to Lactic Acidosis factor 10 – we label this the PgR specific factor.

Gene Ontology | # Genes | p-value | Bayes Factor |

Nucleotide Metabolism | 6 | .0004 | 4 |

RNA Processing | 8 | .0008 | 4 |

RNA Splicing | 5 | .003 | 2 |

Nulcear mRNA splicing | 5 | .003 | 2 |

RNA metabolism | 8 | .003 | 2 |

The third binary phenotype, wild type versus mutant p53 gene, is present in only the data set from

We stress that, if we restrain ourselves to considering the original

LA Factors | LA Signature | |||

P53 Mutant | P53 Wild | P53 Mutant | P53 Wild | |

>50% | 63 | 5 | 23 | 10 |

<50% | 9 | 174 | 49 | 169 |

ER+ | ER− | ER+ | ER− | |

>50% | 202 | 17 | 212 | 31 |

<50% | 11 | 17 | 1 | 3 |

PgR+ | PgR− | PgR+ | PgR− | |

>50% | 180 | 33 | 185 | 54 |

<50% | 10 | 28 | 5 | 7 |

SFPA offers a technique for interrogating a single independent tumor sample against any number of biologically determined signatures, and then consequent linking of factors to phenotypes may include clinically relevant outcomes such as patient survival outcomes and drug response.

Subsets of the 67 factors were evaluated in Weibull survival regression models using the SSS method to identify and score models predicting survival. Each model in a resulting set of highly scoring models produces fitted survival curves and also may be used to predict survival for new samples. Bayesian analysis mandates averaging predictions from such a set of models, and this was done to result in

(a) Predicted survival times from an average of Weibull survival models where used to split the 251 samples from

Four of the breast cancer data sets have clinical annotation pertaining to treatment with Tamoxifen. Though the 67 factors are in no way specifically targeted at Tamoxifen, we do know they are associated with relevant biological pathways. From our 67 factors, we found that Lactic Acidosis 1 is predictive of Tamoxifen resistance. It differentiates metastasis-free survival in patients who received the drug and shows no predictive ability in patients who did not (

Gene Ontology | # Genes | p-value | Bayes factor |

Phosphate transport | 6 | <.0001 | 8 |

Inorganic anion transport | 6 | .0002 | 5 |

Cell adhesion | 11 | .0002 | 5 |

Anion transport | 6 | .0003 | 4 |

Response to abiotic stimulus | 8 | .0008 | 4 |

Response to external stimulus | 15 | .001 | 3 |

Blood coagulation | 4 | .002 | 3 |

While the same biological processes may contribute to tumor phenotypes in different cancers, the process by which this happens may be entirely different given the particular cellular context, tissue-specific gene expression and epigenetic influences. Since SFPA can utilize

In the case of the lung cancer, the analysis discovered 20 factors associated with lactic acidosis. When we compared the expression levels of the 10 lactic acidosis factors in the breast cancer data with the 20 lactic acidosis factors discovered in the lung cancer data, we found that several factors are highly conserved, including the tamoxifen factor, the p53 specific factor, as well as factors 7 and 8. In contrast, the ER and PgR factors are only found in breast cancers. If we look specifically at standardized raw expression levels for the genes in the ER factor in the breast data (

It is increasingly common for investigators to use gene expression signatures directly as phenotypes to link various biological processes and perturbations to disease phenotypes and chemical agents. Although these signatures derived

There are several possible explanations for the enhancement of the prognostic values achieved with SFPA. It is possible that certain genes or pathway components in the original gene signatures are simply noise or artifact due to their

Another opportunity this analysis raises is the ability to uncover the pathways which would be “hidden” in the

Tremendous resources continue to be expended on the discovery of biomarkers for drug susceptibility. The ability to predict susceptibility to a given drug has the potential to significantly increase efficacy while decreasing morbidity and mortality in the relevant patient population. Additionally, it opens the possibility of facilitating the process of bringing new drugs to market. We have demonstrated the efficacy of SFPA for translating signatures discovered

A total of five signatures were derived from two different experiments on Human Mammary Epithelial Cells (HMEC). The details of the collection of gene expression data from these cell lines are in

In designed experiments such as

BFRM is a Bayesian modeling framework. As such, we assume that all of the parameters of our model are random variables. In order to learn more about the values of these parameters, we specify prior distributions, which are subsequently updated based on the data. The result of fitting the model to data in this way is a joint posterior distribution for all of the model parameters. In our case, the parameters of interest are the coefficients of the regression.

The general model implemented in BFRM is as follows. Let

Or alternatively in matrix notation

We have used a prior distribution for the coefficients of the regression that has a point mass at zero. This reflects our belief that, for any particular intervention, there will be relatively few genes (of the over ten thousand that are measured in a microarray experiment) that are affected. For the case outlined in this paper, we argue that growing mammary epithelial cells in the presence of mild lactic acidosis has led to changes in the expression of some of the genes on the array, but that most remain unchanged. Thus our posterior distribution for each

The prior on

We define a signature to be a list of genes and associated weights. Using the posterior parameters from above we define the weight of gene

We use six cancer data sets with Affymetrix U133+ expression samples available on the Gene Expression Omnibus (GEO) web site. Details of the collection and measuring are contained in

Statistical factor analysis using BFRM estimates latent factors that represent common, underlying aspects of covariation of subsets of genes, typically representing expression gene-by-gene in terms of contributions from possibly several factors. The iterative analysis to expand on an initial set of signature genes that we used here then revises the gene list by adding in genes apparently associated with estimated factors, and then refitting the model. Full details of this algorithm are available in

Given a signature, we must choose a collection of tissue samples on which to train the factor model. Because of its relatively large size, the availability of CEL files, and the wealth of clinical and phenotypic information, we chose the data set from

To fit our binary regression and survival models, statistical analysis used Shotgun Stochastic Search (SSS) routines from

Factor models are structured as in

Implicit in this formulation is the assumption that there is a set of vectors, equivalent to design vectors, which describe some part of the variation observed in the matrix of expression values,

Calculation of the activity of a set of factors,

Where

We computed the significance of the relationship between the lactic acidosis factors and the lactic acidosis signature by resampling the lactic acidosis signature weights and modeling the resulting scores with the factors. After 10,000 iterations, we fit the sampled r-squared values to a beta distribution. This figure shows a Q-Q plot of the distribution of resampled values versus the best fit beta distribution. Using this beta distribution, we find that the r-squared value from regressing the true signature scores on the factors is significant with p-value approximately 1e-13.

(0.03 MB JPG)

Percent of variation across all discovered factors as a function of the number of principal components used.

(0.01 MB PNG)

Figures (a) and (b) show the expression levels of the probes from the ER factor (discovered in breast tissue). (a) shows a conserved pattern of expression in the breast samples that is lost in the lung samples (b). (c) and (d) show the same figure, but for probes from the Tamoxifen susceptibility factor. For purposes of visualization, samples are sorted such that the first principal component is increasing. In figures (a) and (c) the rows are sorted according to increasing correlation with the first principal component. The ordering of the rows in figures (b) and (d) is forced to be the same as that in (a) and (c) respectively.

(0.64 MB PNG)

Lactic acidosis factors discovered in lung cancer can distinguish between adenocarcinoma and squamous cell carcinoma (a) as well as stratify patients according to rates of recurrence (b). Factors discovered in ovarian cancer have similar prognostic ability (c).

(0.05 MB PNG)

As in

(0.48 MB PNG)

High dimensional sparse factor modeling: Applications in gene expression gneomics. Reference 17 is currently in press, so we have included it as supplementary material.

(3.27 MB PDF)

In-vitro to in-vivo factor profiling in expression genomics. Reference 37 is currently in press, so we have included it as supplimentary material.

(0.73 MB PDF)