Conceived and designed the experiments: YY. Performed the experiments: RSS YY. Analyzed the data: YY RSS. Contributed reagents/materials/analysis tools: RSS YY. Wrote the paper: YY RSS FM.
The authors have declared that no competing interests exist.
Different data types can offer complementary perspectives on the same biological phenomenon. In cancer studies, for example, data on copy number alterations indicate losses and amplifications of genomic regions in tumours, while transcriptomic data point to the impact of genomic and environmental events on the internal wiring of the cell. Fusing different data provides a more comprehensive model of the cancer cell than that offered by any single type. However, biological signals in different patients exhibit diverse degrees of concordance due to cancer heterogeneity and inherent noise in the measurements. This is a particularly important issue in cancer subtype discovery, where personalised strategies to guide therapy are of vital importance. We present a nonparametric Bayesian model for discovering prognostic cancer subtypes by integrating gene expression and copy number variation data. Our model is constructed from a hierarchy of Dirichlet Processes and addresses three key challenges in data fusion: (i) To separate concordant from discordant signals, (ii) to select informative features, (iii) to estimate the number of disease subtypes. Concordance of signals is assessed individually for each patient, giving us an additional level of insight into the underlying disease structure. We exemplify the power of our model in prostate cancer and breast cancer and show that it outperforms competing methods. In the prostate cancer data, we identify an entirely new subtype with extremely poor survival outcome and show how other analyses fail to detect it. In the breast cancer data, we find subtypes with superior prognostic value by using the concordant results. These discoveries were crucially dependent on our model's ability to distinguish concordant and discordant signals within each patient sample, and would otherwise have been missed. We therefore demonstrate the importance of taking a patient-specific approach, using highly-flexible nonparametric Bayesian methods.
The goal of personalised medicine is to develop accurate diagnostic tests that identify patients who can benefit from targeted therapies. To achieve this goal it is necessary to stratify cancer patients into homogeneous subtypes according to which molecular aberrations their tumours exhibit. Prominent approaches for subtype definition combine information from different molecular levels, for example data on DNA copy number changes with data on mRNA expression changes. This is called data fusion. We contribute to this field by proposing a unified model that fuses different data types, finds informative features and estimates the number of subtypes in the data. The main strength of our model comes from the fact that we assess for each patient whether the different data agree on a subtype or not. Competing methods combine the data without checking for concordance of signals. On a breast cancer and a prostate cancer data set we show that concordance of signals has strong influence on subtype definition and that our model allows to define prognostic subtypes that would have been missed otherwise.
Molecular data show great promise to stratify patients into distinct subgroups that are indicative of disease development, response to medication and overall survival prospects
In addition to expression data there are also many other data types that can be informative about a patient's disease status. For example, somatic copy number alterations provide good biomarkers for cancer subtype classification
Data integration for subtype discovery poses several challenges that we address in this paper.
Challenge 1: Separating concordant from contradictory signals. While different molecular data are expected to share complementary information on common cellular processes, they can also contain contradictory signals because of the complexity of living cells and noise in the data. For example, genomic gains and losses may or may not be accompanied by concordant expression changes of the genes in the altered regions. The level of concordance may differ dramatically from patient to patient due to cancer heterogeneity. However, most existing integrative methods force different data types to be fused in all samples without reference to whether the data are concordant or contradictory in each patient.
Challenge 2: Selecting informative features. Identifying which measurements are informative about the underlying subtypes is particularly important when using genomic data because the number of measurements can be very large, e.g. in the tens of thousands or more in the case of microarrays. Because
Challenge 3: Estimating the number of subtypes. In many clustering algorithms this number is a parameter that needs to be set by the user
These three challenges are not independent of each other: Whether or not the data show concordant signals for a subgroup of patients has a direct effect on which features should be selected as informative, which in turn has a direct influence on the estimate of the number of clusters. Thus, all three challenges need to be treated in an unified model.
Our approach is Patient-specific Data Fusion (PSDF) by Bayesian nonparametric modeling. In this paper, we propose a statistical model based on a two-level hierarchy of Dirichlet Process (infinite mixture) models (DPMs)
Different data types are fused (or not fused) on a sample-by-sample basis depending on the degree of concordance between two data types;
Input features are selected only if they are informative to clustering;
The most likely number of clusters are inferred automatically given the data.
Thus, the model not only identifies copy number alterations driving gene expression changes but simultaneously finds differences in regulation that distinguish one cancer subtype from the other. In doing so it explores the basic scientific question to which extend copy number data can be fused with expression data in integrative cancer studies.
everal integrative clustering approaches have been proposed in the literature
We introduce PSDF as an unified model to address the above three key challenges in patient subtype discovery. To demonstrate the power of this patient-specific integrative method, we analyse a breast cancer data set and a prostate cancer data set. High degree of concomitant changes has been observed in copy number and expression changes in breast cancer
Bayesian nonparametric modeling provides a principled way to learn unknown structure in the data. Dirichlet Process (infinite mixture) models (DPMs)
PSDF groups patient samples on the basis of both gene expression and copy number alteration data. It also simultaneously distinguishes, on a sample-by-sample basis, between samples that can share concordant signal across the data types (
The patient sample belongs to one clustering partition, which is the same in both data sets. The clustering structure for this patient across the two data sets is said to be concordant.
The patient sample belongs to different clustering partitions in each data set. The clustering structure for this patient across the two data sets is said to be contradictory.
By introducing a binary indicator parameter (
By treating the data on a sample-by-sample basis, we can identify which samples are likely to belong in a fused state and which are likely to belong in an unfused state. This gives us a principled way of finding subgroups of samples with concordant or heterogeneous structure which, as we show below, leads to new insights about the disease and its subtypes.
Feature selection (biomarker discovery) is also built-in to PSDF, using two sets of binary indicator parameters,
Fuller details on this can be found in the
The breast cancer data from
First, copy number data are filtered based on whether there is a concomitant change between a locus's copy number and its own expression. This is to exclude passenger events without explicit downstream effects. Each expression probe is matched to its nearest copy number probe allowing for multiple matches, i.e. a copy number probe can be matched to multiple expression probe. This resultes in 37,411 matched pairs of copy number and expression data annotated by expression probes. We then calculate the adjusted
PSDF yields 4 clusters for all 106 breast cancer samples and 3 fused clusters, containing only samples for which
Features are ranked by their probability of uses in the MCMC sampling from high to low respectively for copy number and expression features, as indicated on the left. (b–d) Posterior similarity matrices (red: high posterior probability between patient samples; blue: low posterior probability).
The case study results show the power of patient-specific data fusion. The similarity matrix for all items (
The unfused samples are also interesting. Part of Cluster 1, 2, and 4, as well as the entire Cluster 3 are unfused, for which lots of ambiguity exists in the similarity matrix (
The case study results also demonstrate the power of feature selection. For the informative features selected by PSDF, there are 60% of copy number and 40% of expression features. Copy number features from 8q (Chromosome 8 q arm), 17p (Chromosome 17 p arm), 17q, 20q are among the most frequently used. These regions harbor some of the most well known genes in breast cancer. For example, 8q contains
Clinical follow-up for this data set facilitates the assessment of data-driven subtype discovery with respects to their prognostic outcome. For PSDF, the Kaplan-Meier breast cancer specific survival curves for all samples reveal a low survival group (PSDF 1), a good outcome group (PSDF 4), and two intermediate groups (PSDF2 and 3), as shown in
Fused subtypes are prognostic in both events and timing. For the three fused clusters in
Subtype-specific features reveal functional implications. With respect to the genetic features that characterise these subtypes, the poor prognosis subtype (dark blue) has 8q copy number gains and over-expressions (see
For each of the cluster/subtype, we extract its cluster/subtype-specific genes based on both copy number and expression data. Limma
The subtype-specific genes are combined with a Protein-Protein Interaction (PPI) network to extract functional network modules. The PPI network is downloaded from HPRD, release 9, April 2010
The network module of PSDF 1 in
The node color in the network modules indicates the type of alterations relative to this cluster: red - copy number gain or over-expression, green - copy number loss or under-expression. The shape of nodes indicates the type of data: square - copy number, round - expression. (B) the KEGG pathway enrichment maps for PSDF 1 and 2. The node colors indicate the significance of enrichment result and the thickness of the edges indicates the amount of overlaps between pathways.
The subtype-specific module 2 in
Meanwhile, KEGG
The PSDF-specific pathways for PSDF 1 include Cell Cycle, Oxidative Phosphorylation, Pyrimidine metabolism, which are known to be deregulated in breast cancer
PSDF 2 is characterised by deregulations in the Apoptosis pathway which includes several important genes such as TP53. Combined with the network module in
For the prostate cancer data set, there are 150 tumour samples with both copy number and expression data
To extract features, we use a slightly different approach since the scale of this data set is much larger than that of the breast cancer data. Substantially larger number of probes compared to the breast cancer study means that the probe-centric method is not suitable, hence we take a gene-centric method by aggregating copy number and expression data to 12,718 genes based on array annotation. For copy number data, the aggregation is done by taking the median for probes within a gene. For the expression, the probe most highly correlated with the copy number profile of a gene is chosen to represent this gene. Even if so, only modest correlations are observed between the two data types. Finally, 286 genes with highly correlated copy number and expression (adjusted
To compare with PSDF outcome, we take the original subtype classification for this data set
Features are ranked by their probability of uses in the MCMC sampling from high to low respectively for copy number and expression features, as indicated on the left. Color codes for the heatmap are the same as in
Significant differences of recurrence outcome was found among the PSDF clusters (log-rank test
Interestingly, although PSDF and iCluster share two clusters, PSDF/iCluster 2 and 3, this poor outcome cluster PSDF 7 is lost among the iCluster clusters. PSDF 7 is also not identified by the original TS subtypes. This is because if only copy number data are used, PSDF 4 and PSDF 7 would be clustered together. If only expression data are used, PSDF 5 and PSDF 7 are likely to be jointed. Thus, clustering on a single data type is not able to recover this subtype, highlighting the strength of data fusion. Additionally, integrative clustering methods that force all samples to be fused, such as iCluster, will tend not to recover PSDF 7, instead dividing those samples between PSDF-4- and PSDF-5-like clusters. This is evidenced by that fact that PSDF 7 is largely unfused (Fusion status in
We focus on the two worst outcome groups PSDF 7 and PSDF 5 and examine their subtype-specific genes in the same manner as done before for the breast cancer data set. Interestingly, PSDF 7 is characterised by the under-expression of many functionally-related growth factors, such as
The node color in the network modules indicates the type of alterations relative to this cluster: red - copy number gain or over-expression, green - copy number loss or under-expression. The shape of nodes indicates the type of data: square - copy number, round - expression. (c–d) KEGG pathway enrichment maps for PSDF 7 and 5 module genes. The node colors indicate the enrichment significance and the thickness of the edges indicates the amount of overlaps between pathways.
On the other hand, PSDF 5 features copy number losses of the functional network module centered at
This paper explores the potential of patient-specific data fusion to enhance prediction power in cancer subtype discovery. Cancer subtype discovery combining both genomics and transcriptomics leads to a more comprehensive understanding of the heterogenous cellular contexts. By using a flexible, nonparametric model such as the model presented in this paper, we can learn both the concordant and contradictory structures underlying those multiple data types. This structure leads to an improved understanding of the functional components and pathway regulations for each cancer subtype, something that is essential for the future development of targeted therapeutics.
We propose a model that is able to separate concordant and discordant signals and find sub-structures based on either one data type or both. This is in contrast to most previous approaches, where samples are typically forced to cluster together based on both data types
We demonstrate that by identifying the concordant/
Functional analysis on subtype-specific genes reveals the genetic components that may lead to the poor outcome cancer subtypes. These are worthy of future investigation and may lead to therapeutic benefits.
With both breast cancer and prostate cancer data, PSDF is able to discover poor outcome subtypes with early-stage, highly frequent recurrences/deaths. These subtypes are not identified by other methods which either force to fuse data on all samples, or cluster patients based on single data type. We show that there exist both concordant and contradictory signals in these data, which, when forced to cluster together, can result in inferior subtype identification. Moreover, data fusion is necessary in predicting both events and timing of cancer survivals/recurrrences. Hence, taking this approach is vital in the discovery of new disease subtype consisting of early-stage events.
A promising aspect of studying cancer subtypes is the identification of key pathways altered unique to this subtype. Our network analyses show functionally interacting genes in the subtype-specific network modules whose deregulations may contribute to the poor outcome of a cancer subtype. The pathway enrichment analysis facilitates functional interpretation of the new clusters/subtypes in a coherent manner with the network modules. Under-lying driver events for poor outcome may be revealed during this process, such as the over-expression of the Cell Cycle pathway in breast cancer, and the under-expression of Endocytosis and Chemokine signaling pathway in prostate cancer. Further exploration of these results may lead to the discovery of new genes participating in the cancer-related pathways, as well as the identification of treatment target and the development of pathway inhibitors.
Our analysis results also highlight the difference between different cancer types. Previously, relatively low concordance between prostate cancer copy number and expression has been reported
PSDF extends the model of
PSDF is constructed from a two-level hierarchy of Dirichlet Processes, as shown in
The
Within any given mixture component from the Dirichlet Processes, we model the (discretised) data as being drawn from a multinomial distribution with a weakly informative multinomial prior. The features are assumed to be independent, giving rise to a naive Bayes data model for each data set. We use this data model for both gene expression and copy number data sets. Since our method use discretised data as input, copy number calls are made with R package CGHcall
We note that in principle, this model could be extended to 3+ data sources. In practice however, this will become unwieldy, and so we restrict ourselves in this paper to considering fusion between two data sources. We are currently developing a related model that will scale much better with increasing numbers of data sources.
The naive Bayes data model used in
To perform feature selection, we will consider two different likelihoods for a given feature, corresponding to the feature being
This has the effect of defining an ‘indifference’ likelihood, where it makes no difference to the overall posterior (for the given feature) to which cluster any given sample is assigned. It is straightforward to write down the conditional distribution for a single indicator variable
The switching on/off of a given feature can be regarded as a kind of model selection. Considering the limit of many samples (and hence negligible uncertainty in the value of the class probabilities for
To give improved mixing, we run 50 MCMC chains for each analysis. The chains are
All chains are examined using the R package
The multiple MCMC chains are used to compute uncertainties in statistics of interest (for example, the probability that a given feature is selected). This gives us a direct measure of chain mixing quality.
Each chain runs to completion in less than 48 hours on nodes of the University of Warwick's high performance computer cluster.
In order to validate our model, we performed a simulation study. We constructed a pair of synthetic data sets. For each synthetic data set, we started with the 106
To each synthetic data set, we then added 50
Finally, we added to each synthetic data set 200
signal items | noise items | |
fused | 105 | 17 |
unfused | 1 | 33 |
signal features | noise features | |
selected | 392 | 0 |
rejected | 8 | 400 |
Shown are the fused/unfused items (top) and the selected/rejected features (bottom). The fusion threshold is set at
The method successfully finds 105 of the 106 fused items. It also identifies 17 of the noise items as being fused. We note that we expect some level of coincidental fusion for the noise items, where they happen to have been drawn from the same cluster. For example, if we assume there are 5 (equally-sized) underlying clusters in the copy number data, we expect