Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Gene expression based survival prediction for cancer patients—A topic modeling approach

  • Luke Kumar ,

    Contributed equally to this work with: Luke Kumar, Russell Greiner

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada

  • Russell Greiner

    Contributed equally to this work with: Luke Kumar, Russell Greiner

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada

Gene expression based survival prediction for cancer patients—A topic modeling approach

  • Luke Kumar, 
  • Russell Greiner


Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at

1 Introduction

The World Health Organization reports that cancer has become the second leading cause of death globally, as approximately 1 in 6 deaths are caused by some form of cancer [1]. Moreover, cancers are very heterogeneous, in that the outcomes can vary widely for patients with similar diagnoses, who receive the same treatment regimen. This has motivated researchers to seek other features to help predict individual outcomes. Many such analyses use just clinical features. Unfortunately, features such as lymph node status and histological grade, while predictive of metastases, do not appear to be sufficient to reliably categorize clinical outcome [2]. This has led to many efforts to improve the prognosis for cancer, based on genomics data (e.g., gene expression (GE) or copy number variation (CNV)), possibly along with the clinical data [26]. Focusing for now on breast cancer, van’t Veer et al. [2] used the expression of 70 genes to distinguish high vs low risk of distant metastases within five years. Parker et al. [4] identified five subtypes of breast cancer, based on a panel of 50 genes (PAM50): luminal A, luminal B, HER2-enriched, basal-like, and normal-like. Later, Curtis et al. [7] examined ≈2000 patients from a wide study combining clinical and genomic data, and identified around ten subtypes. All three of these studies showed that their respective subtypes produce significantly different Kaplan-Meier survival curves [8], suggesting such molecular variation does influence the disease progression. There are also many other systems that use such expression information to divide the patients into two categories: high- vs low-risk; cf., [2, 6].

More recently, many survival prediction models have been applied to cancer cohorts, with the goal of estimating survival times for individual patients; some are based on standard statistical survival analysis techniques, and others based on classic regression algorithms—e.g., random survival forests [9] or support vector regression for censored data (SVRc) [10]. With the growing number of gene expression experiments being cataloged for analysis, we need to develop survival prediction models that can utilize such high dimensional data. Our work describes such a system that can learn effective survival prediction models from high-dimensional gene expression data.

The 2012 DREAM Breast Cancer Challenge (BCC) was designed to focus the community’s efforts to improve breast cancer survival prediction [3]. Its organizers made available clinical and genomic data (GE and CNV) of ≈2000 patients from the [7] study (mentioned above). Each submission to the BCC challenge mapped each patient to a single real value (called “risk”), which is predicting that patients with higher risk should die earlier than those with lower risk. The entries were therefore evaluated based on the concordance measure: basically, the percentage of these pairwise predictions that were correct [11]. This is standard, in that many survival prediction tasks use the concordance as the primary measure to assess the performance of the survival predictors, here and in other challenges [12]. The winning model [13] performed statistically better than the state-of-the-art benchmark models [3].

This paper explores several dimensionality reduction technique, including a novel approach based on topic modeling, “discretized Latent Dirichlet Allocation” (dLDA), seeking one that can produce highly predictive features from the high-dimensional gene expression data. We explored several ways to apply this topic-modeling approach to gene expression data, to identify the best ways to use it to map the gene expression description into a much lower dimensional description (from ≈50K features to 30 in this METABRIC dataset). We then gave the resulting transformed data as input to a recently-developed non-parametric learning algorithm, multi-task logistic regression (MTLR), which produced a model that can then predict an individual’s survival distribution [14]. We show that this predictor performs better than other standard survival analysis tools in terms of concordance. We also found that it was “D-calibrated” [15, 16]; see Appendix B.2.

To test the generality of our learning approach (dLDA + MTLR), we then applied the same learning algorithm—the one that worked for the METABRIC microarray gene expression dataset—to the Pan-Kidney dataset, which is a different type of cancer (kidney, not breast), and is described using a different type of features (mRNAseq, not microarray). We found that the resulting predictor was also extremely effective, in terms of both concordance and D-calibration.

This paper provides the following three contributions: (1) We produce an extension to LDA, called “dLDA”, needed to handle continuous data; (2) we use this as input to a survival prediction tool, MTLR—introducing that tool to this bioinformatics community; and (3) we demonstrate that this dLDA+MTLR combination works robustly, in two different datasets, using two different modalities—working better than some other standard approaches, in survival prediction.

Section 2 introduces the basic concepts, related to the survival prediction task in general and latent dirichlet allocation; Section 3 then describes the datasets used in this study; and Section 4 presents an overview of learning and performance tasks, at a high level. Section 5 (resp., 6, 7) then presents our results (resp., discussions, contributions). The supplementary appendices provide additional figures, tables, and other and material—e.g., defining some of the terms, and introducing “D-calibration”.

2 Foundations

This section provides the foundations: Section 2.1 overviews the survival prediction task in general then Section 2.2 describes Latent Dirichlet Allocation (LDA), first showing its original natural language context, then discussing how we need to extend it for our gene expression context. These significant modifications lead to a discretized variant, dLDA. We also contrast this approach with other survival analysis of gene expressions.

2.1 Survival prediction

Survival prediction is similar to regression as both involve learning a model that regresses the covariates of an individual to estimate the value of a dependent real-valued response variable—here, that variable is “time to event” (where the standard event is “death”). But survival prediction differs from the standard regression task as its response variable is not fully observed in all training instances—this tasks allows many of the instances to be “right censored”, in that we only see a lower bound of the response value. This might happen if a subject was alive when the study ended, meaning we only know that she lived at least (say) 5 years after the starting time, but do not know whether she actually lived 5 years and a day, or 30 years. This also happens if a subject drops out of a study, after say 2.3 years, and is then lost to follow-up; etc. Moreover, one cannot simply ignore such instances as it is common for many (or often, most) of the training instances to be right-censored; see Table 1. Such “partial label information” is problematic for standard regression techniques, which assume the label is completely specified for each training instance. Fortunately, there are survival prediction algorithms that can learn an effective model, from a cohort that includes such censored data. Each such dataset contains descriptions of a set of instances (e.g., patients), as well as two “labels” for each: one is the time, corresponding to the time from diagnosis to a final date (either death, or time of last follow-up) and the other is the status bit, which indicates whether the patient was alive at that final date (Fig 1).

Fig 1. Survival prediction training and performance tasks.

Training-Task: Historical data with event times and censor-status along with covariates are used to train a model—top-to-bottom. Performance-Task: new patient covariates are input to the learned model to produce a prediction of survival time—bottom, left-to-right. (Included picture designed by Freepik.).

2.1.1 Patient specific survival prediction using the MTLR model.

This project considered 3 ways to learn a survival model: The standard approaches—Cox and Regularized Cox (RCox)—are overviewed in Appendix A.5. This subsection describes the relatively-new MTLR [14] system, which learns a model (from survival data) that, given a description of a patient x ∈ ℜr, produces a survival curve, which specifies the probability of death D, P(Dt | x) vs t for all times t ≥ 0. This survival curve is similar to a Kaplan–Meier curve [8], but incorporates all of the patient specific features x. In more detail: MTLR first identifies m time points {ti}i=1‥m and then learns a variant of a logistic regression function, parameterized by W = {[wi, bi]}i=1‥m over these m time points, a different such function for each time ti—meaning W is a matrix of size m × (r + 1). Using the random variable D for the time of death for the patient described by x: (1) The MTLR model then combines the values of the PMF (probability mass function) into a CMF (cumulative mass function), by adding them in the reverse order—hence from the probability value of 1 at t0 = 0—i.e., PrW(D ≥ 0 | x) = 1—down to smaller values as the time t increases. Fig 2 shows the individual survival curves of several patients. Here, we view a patient’s descretized survival time d ∈ ℜ≥0 as a binary vector (of classification labels) y(d) = [y1(d), y2(d), …, ym(d)], where each yj(d) ∈ {0, 1} encodes that patient’s survival status at each time interval [tj, tj+1]: yj(d) = 0 (no death yet) for all j with tj < d and yj(d) = 1 (death) for all tjd. The learning system attempts to optimize (2) (This formula applies to uncensored patients; we apply the obvious extension to deal with censored instances.) This overall equation includes a L2 regularization term to reduce the risk of overfitting. The MTLR parameter m (the number of time points) is set to the square-root of the number of instances in all our experiments.

Fig 2. The dashed line is the Kaplan-Meier plot for this dataset.

Each of the other 5 curves is a patient-specific survival curve, for 5 different METABRIC patients, from a learned MTLR model. The curves show that there are very different prognoses for the different patients, even though they all are breast cancer patients from the same cohort: here, the patient with the orange survival curve (near the top) has a very good prognosis, especially compared to the patient with the blue survival curve.

Given the learned parameters W, we can then use Eq 1 to produce a curve for each patient; we can then use the (negative of) the mean of the patient’s specific predicted survival distribution as her risk score. Yu et al. [14] presents more detailed explanations of model formulation, parameter learning (W), and the prediction task. MTLR differs from many other models (such as the standard Cox model) as: (1) MTLR produces a survival function, rather than just a risk score; and (2) MTLR does not make the proportional hazards assumption—i.e., it allows effect of each covariate to change with time. See also Haider et al. [16]. Note this is the learning process of LearnSurvivalModel (LSM[Ψ=MTLR]) appearing below in Fig 3, and Section 4.1.

Fig 3. Overview of learning process.

LSM = LearnSurvivalModel Uses Ψ ∈ {MTLR, Cox, RCox} for type of learner; ρ ∈ {dLDA, PCA} for the type of basis; XGE are the gene expression values, XCF are clinical features for a set of patients, “Lbl” is the set of their (survival prediction) labels, is the basis, of type ρ based on the instance XGE; W is the learned survival model, of type Ψ; and Ω is information about the preprocessing. Each (unrounded) box corresponds to data, whose dimensions appear around it. Each row is a patient, and each column, a feature. Each rounded box is a subroutine; here we show the input and output of each.

2.2 Discretized Latent Dirichlet Allocation (dLDA)

Latent Dirichlet Allocation (LDA) is a widely used generative model [17], with many successful applications in natural language (NL) processing. LDA views each document as a distribution over multiple topics (document-topics distribution), where each topic is a distribution over a set of words (topic-words distribution)—that is, LDA assumes that each word in a document is generated by first sampling a topic from the document’s document-topics distribution and then sampling a word from the selected topic’s topic-words distribution. Given the set of topics (each corresponding to a specific topic-word distribution), we can view each document as its distribution over topics, which is very low dimensional.

The LDA learning process first identifies the latent topics—that is, the topic-words distributions corresponding to each latent topic—based on the words that frequently co-occur across multiple documents; n.b., it just uses the documents themselves, but not the labels. For example, it might find that many documents with the word “ball” also included “opponent” and “score”; and vice versa. Similarly, “finances”, “transaction”, and “bank” often co-occur, as do “saint”, “belief” and “pray”. Speaking loosely, the topic-model-learner might then form one topic, , that gives high probabilities to the first set of words (and relatively low probabilities to the remaining words)—perhaps (Technically, each topic is specified as a Dirichlet distribution over the set of words, βi. To simplify the presentation, here we are showing their expected values, these values are based on the priors; see Appendix A.2(3).)

This corresponds to an n-tuple over the n words; we call this . It would similarly identify a second topic with the n-tuple that gives high probabilities to the different set of words, etc. (While we might view the first topic as related to sports, the second related to finances, and third to religion, that is simply our interpretation, and is not needed by the learning algorithm. Other topics might not be so obvious to interpret.) This produces the topic-words distribution over K topics.

The learner would then map each document into a “distribution” over this set of K topics—perhaps document f1 would be decomposed as Θ(f1) = [θ1(f1), θ2(f1) …, θK(f1)] = [0.01, 13.02, 50.01 …, 0.03]—these are parameters for a Dirichlet distribution, which are non-negative, but do not add up to 1. These are different for different documents—e.g., perhaps f2 is expressed as Θ(f2) = [12.03, 0.001, 3.1, …, 2.4], etc. This is the document-topic distribution {Θ(fj)}j=1‥m over the m documents.

The specific learning process depends on the distributional form of the document-topics and topic-words distributions (here, we use Dirichlet for both) and also the number of latent topics, K. Given this, the LDA learning process finds the inherent structure present in the data—i.e., a model (topic-words distributions for each of the K topics ) that maximizes the likelihood of the training data.

The same way certain sets of words often co-occur in a document, similarly sets of genes are known to be co-regulated: under some condition (corresponding to a “c_topic”), every gene in that set will have some additional regulation—some will be over-expressed, each by its own amount, and the others will be under-expressed. Moreover, just as a natural language (NL) topic typically involves relatively few words, most c_topics effectively involve relatively few genes. Also, just like a document may involve a mixture of many topics, each to its own degree, so a patient’s cancer often involves multiple c_topics; see work on cancer subclones [18]. This has motivated many researchers to use some version of topic modeling to model gene expression values, under various (sets of) conditions.

For example, Rogers et al. [19] proposed Latent Process Decomposition (LPD), a probabilistic graphical model that was inspired by LDA, for microarray data, and presented clustering of genes that led to results comparable to those produced by hierarchical clustering. (Their results are descriptive; they do not use the results in any downstream evaluation). Later Masada et al. [20] proposed improvements to the original LPD approach and showed similar results. Bicego et al. [21] report topic modeling approaches (including LPD) were useful in classification tasks with gene expression data. They applied several topic models as dimensionality reduction tools to 10 different gene expression data sets, and found that the features from the topic models led to better predictors.

Further, Lin et al. [22] reviewed various different topic models applied to gene expression data, including LDA and probabilistic latent semantic analysis (PLSA) [23], as well as the topic model approaches described above, for gene classification and clustering. They note that the topic model approaches improve over other models as one can easily interpret the topic-words distributions and the mixed membership nature of the document-topic distribution.

However, none of these tasks were survival analysis. They used shifting and scaling to convert the continuous gene expression values to discrete values; we considered this approach for our data, but found that it was not able to learn distinct topics for our data. Moreover, this gave all patients very similar document-topic distributions. Dawson et al. [24] proposed a survival supervised LDA model, called survLDA, as an extension of supervised LDA [25]. survLDA uses a Cox model [26] to model the response variable (survival time) instead of the generalized linear model [27] used in supervised LDA [25]. But Dawson et al. [24] reported that the topics learned from survLDA were very similar to the ones learned from the general (unsupervised) LDA model.

Here, we apply the “standard” topic-modeling approach to gene expression data, for the survival prediction task. While previous systems applied topic modeling techniques to gene expression data, very few have applied topic models to predict a patient’s survival times (and none to our knowledge have used mRNAseq expression data). Our work presents a more direct analogue to the NL topic modeling that can be applied to our cohort of patients with gene expression data, where each patient corresponds to a document and the genes/probes in the expression data correspond to the words that form the document. This requires making some significant modifications to the standard LDA model, which assumes the observations are frequencies of words, which are non-negative integers that generally follows a monotonically decreasing distribution. By contrast, gene expression values are arbitrary real values, believed to follow a skewed Gaussian distribution [28]; see also Fig 4. (This is also true for mRNAseq, as we need to normalize the expression counts to be comparable, from patient to patient).

Fig 4. Histogram of the normalized Gene Expression values , from METABRIC, showing how we descretized them into essentially equal-width bins.

Note the heights are on a log-scale. The material under the histogram—involving x and y—compare two ways to compute the discretized Gene Expression Values (dGEV): The top Enc_A discretizes the GEVs into a single count-feature A(⋅), and the bottom Enc_B discretizes the GEVs into two count-features B(⋅) representing over-expression and under-expression, respectively.

We follow the approach of explicitly discretizing the expression values in a preprocessing step, so the resulting values basically, approximate a Zipf distribution. There are still some subtleties here—e.g., while the NL situation involves only non-negative integers, an affected gene can be either over-expressed, or under-expressed—i.e., we need to deal with two directions of “deviation”, while NL’s LDA just deals with one direction; see Section 4.1.1. We refer to our model as dLDA and the discretized gene expression values as dGEVs. The same way the standard LDA approach reduces the description of a document from a ≈ 105-dimensional vector (corresponding to the words used in that document) to a few dozen values (the “distribution” of the topics), this dLDA approach reduces the ≈ 50K-dimensional gene expression tuple to a few dozen values—here the “distribution” of the c_topics. Fig 5 summarizes this process: using the subroutines defined in Section 4 below, at learning time, ComputeBasis[ρ=dLDA] first identifies the set of relevant c_topics from the set of gene expression values , then later (at performance time), UseBasis[ρ=dLDA] uses those learned c_topics to transform a new patient’s high-dimensional gene expression profile to a low-dimensional c_topic-profile, xGE—here going from 50K values to 30.

Fig 5. The ComputeBasis[ρ=dLDA] process (shown top-to-bottom, on left) uses a set of high-dimensional gene expression vectors from many patients, to produce a set of basis vectors (corresponding the parameters of the LDA) .

The UseBasis[ρ=dLDA] process (shown left-to-right horizontally) uses the set of dLDA “basis vectors” to transform a gene expression vector from a novel patient, into a low-dimensional description . (Note ComputeBasis[ρ=dLDA] also uses other information, and Lbl, to determine the number K of basis vectors; to simplify the figure, we did not show this.).

Section 5 presents empirical evidence that this method works effectively for our survival prediction task; Appendix C.1 shows that it performs better than the LPD technique.

3 Datasets used

We apply our methods to two large gene expression datasets: the METABRIC breast cancer cohort [7] (mircroarray) and the Pan-kidney cohort KIPAN (mRNAseq) [29, 30]. We initially focus on the METABRIC dataset [40] which is one of the largest available survival studies that includes genomic information. In 2012, the Breast Cancer Prognostic Challenge (BCC) organizers released the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset for training [7]. While they subsequently released a second dataset (OSLO) for final testing [7], we are not using it for several reasons: (1) METABRIC provided disease-specific survival (DS), which considers only breast cancer death (BC-based death), rather than all causes of death [13]. By contrast, OSLO provides “overall survival”, which does not distinguish BC-based deaths from others. As DS is clearly better for our purpose, it is better to evaluate on the METABRIC dataset. (2a) OSLO and METABRIC contained different sets of probes—and in particular, OSLO contains only ∼ 80% of the METABRIC probes. (2b) Similarly, the OSLO dataset is also missing some of the clinical covariates that are present in the METABRIC dataset—e.g., menopausal status, group, stage, lymph nodes removed, etc.; see [3, Table 1]. This means a “METABRIC-OSLO study” would need to exclude some METABRIC features and some METABRIC probes.

We then used a second independent dataset, to verify the effectiveness of our “dLDA+MTLR” approach. Here, we did not use OSLO, as we wanted to explore a different type of cancer, and also use a different platform, to show that our system could still identify an appropriate (and necessarily different) set of cancer-topics (c_topics). We therefore used the KIPAN dataset from TCGA (The Cancer Genome Atlas), as it (also) contains a large number of patients and provides survival information.

Table 1 lists some of the important characteristics of these datasets. Note that KIPAN contains 15,529 genes, while METABRIC has 49,576 probes. This is because many METABRIC probes may correspond to the same gene each targeting a different DNA segment of the gene. As different probes for the same gene might behave differently, we gave our learning algorithm the complete set of probes. Our results on the KIPAN dataset show that our approach also works when dealing with gene expression data from a totally different cancer and platform (here kidney not breast, and mRNAseq rather than Microarray)—demonstrating the generality of our approach.

3.1 Training vs test data

We apply the same experimental procedure to both datasets (METABRIC and KIPAN): We partition each dataset into two subsets, and use 80% of the data for training and the remaining 20% for testing. Both partitions contain instances with comparable ranges of survival times and comparable censored-versus-uncensored ratio. When necessary, we ran internal cross-validation, within the training set, to find good settings for parameters, etc.

4 Overview of learning and performance processes

As typical for Supervised Machine Learning systems, we need to define two processes:

  • The learning algorithm, LearnSurvivalModel
    LSM[ρ=dLDA; Ψ=MTLR]([XGE, XCF], Lbl) takes a labeled dataset, involving both gene expression data XGE and clinical features XCF (and survival-labels Lbl) for many patients, and computes a Ψ = MTLR survival model W. (Many subroutines are parameterized by a dimensionality reduction technique ρ ∈ {dLDA, PCA}, and/or by a survival learning algorithm Ψ ∈ {MTLR, Cox, RCox}. We use notation “Alg[ρ; Ψ](⋅)” to identify the specific parameters; hence LSM[ρ=dLDA; Ψ=MTLR]([XGE, XCF], Lbl) is dealing with the ρ = dLDA encoding and Ψ = MTLR survival learning algorithm).
    It also returns the ρ = dLDA “basis set” (here, think of a set of c_topic distributions), and some information about the pre-processing performed, Ω. See Fig 3.
  • The performance algorithm, UseSurvivalModel
    USM[ρ=dLDA; Ψ=MTLR]([xGE, xCF], , W, Ω), takes a description of an individual (both gene expression xGE, and clinical features xCF), as well as the ρ = dLDA basis set and the Ψ = MTLR survival model W (and pre-processing information Ω), and returns a specific survival prediction for this individual, from which we can compute that person’s risk score. See Fig 6.
Fig 6. Overview of performance process, USM = UseSurvivalModel.

xGE and xCF are the gene expression values and clinical feature values, for a single patient; see also terms ρ, Ψ, Ω, , W from Fig 3.

To simplify the presentation, the main text will describe the process at a high-level, skipping most of the details. Notice these functions are parameterized by the type of dimensionality reduction ρ and the survival learner Ψ. This section will especially focus on the novel aspects here, which are the ρ = dLDA transformation (Section 2.2), which complicates the ComputeBasis[ρ=dLDA](⋯) function (Section 4.1.1); and the Ψ = MTLR algorithm for learning the survival model (Section 2.1.1). Appendix A summarizes the more standard ρ = PCA approach to reducing the number of features, and the more standard survival models Ψ ∈ {Cox, RCox}, as well as other details about the learning, and performance models, in general.

4.1 Learning system LSM

Here, LSM[ρ=dLDA; Ψ=MTLR] ([XGE, XCF], Lbl) first calls PreProcess, which fills-in the missing values in the XCF clinical features (producing ), and normalizes the real-valued XGE genetic features, which is basically computing the z-scores , over all of the values. It then calls ComputeBasis[ρ=dLDA](⋯) to compute a set of c_topics from the gene expression data (as well as the other inputs), then calls UseBasis[ρ = dLDA](), which “projects” onto this to find a low dimensional description of the genetic information; see Fig 5. These projected values, together with and Lbl, form the labeled training set given to the Ψ = MTLR learning system, which computes a survival model W. Here, the LSM process returns the dLDA “basis” and the MTLR-model W. (Further details appear in Appendix A.)

4.1.1 ComputeBasis[ρ=dLDA](⋯) function.

As noted above, the ≈50,000 expression values for each patient is so large that most standard learning algorithms would overfit. We consider two ways to reduce the dimensionality. One standard approach, Principal Component Analysis (PCA), is discussed in Appendix A.4. Here, we discuss a different approach, dLDA, that uses the Latent Dirichlet Analysis.

The PreProcess routine computes z-scores for the gene expression values XGE; the ComputeBasis[ρ=dLDA] subroutine then has to transform those real values to the non-negative integers required by LDA—moreover, it was designed to deal with word counts in documents where, in any given document, most words appear 0 times, then many fewer words appear once, then yet fewer words appear twice, etc. We therefore need a method for converting the real values into non-negative integers.

This process therefore discretizes the standardized gene expression values (in ) into the integers {-10, -9, …, -1, 0, 1, …, 9, 10}, by mapping each real number to the integer indexing some essentially equal-sized bins; see Fig 4, and Appendix A.2 for details.

This does map each gene expression to an integer, but this includes both positive and negative values. Given that over-expression is different from under-expression, an obvious encoding uses two non-negative integer values for each gene: mapping +2 to [2, 0], and −3 to [0, 3], etc. Note that the range of each component of the encoding will be non-negative integers, and that most of the values will be 0, then fewer will be 1, etc.—as desired. However, this does double the dimensionality of representation; i.e., we now have twice the number of genes: UNDER-‘gene_name’ and OVER-‘gene_name’.

(Below we call this the Enc_B encoding; see also B(⋅) at the bottom of Fig 4.) Given that very few values are < −1 (in METABRIC, over 14% (normalized) expression values were > 1, but less than less than 0.04% were < −1; recall that heights in Fig 4 are on a log scale), we considered another option: collapsing the +values and −values to a single value—so both +4 and −4 would be encoded as 4. This would mean only half as many features (which would reduce the chance of overfitting), and would continue to note when a gene had an exceptional value. (This is the Enc_A encoding, which corresponds to the A(⋅) at the bottom of Fig 4.) As it was not clear which approach would work better, our implementation explicitly considered both options—and used the training set to decide which worked best; see below.

The standard LDA algorithm also needs to know the number of topics (here c_topics) K to produce. ComputeBasis uses (internal) cross-validation to find the best value for K, over the range K ∈ {5, 10, 15, …, 150}, as well as encoding technique t ∈ {Enc_A, Enc_B}—seeking the setting leading to the Cox model with the best concordance (on each held-out portion). See Appendix A.2 for details. After finding the best K* and encoding t*, ComputeBasis then finds the K* c_topics on the t*-encoded (preprocessed) training gene expression data ; this is the c_topic distribution, .

The vertical left-side of Fig 5 gives a high-level description of the ComputeBasis[ρ=dLDA] process: given a large set of (preprocessed) high-dimensional gene expression profiles, produce a small set of c_topics (each corresponding to a mapping from the gene expression profiles). We will later describe the UseBasis[ρ=dLDA] process that uses those c_topics to transform the high-dimensional gene expression profile of a novel instance, into a small dimensional set of values—see the left-to-right “Performance Process” part here. At this abstract level, it is easy to see that it nicely matches the ρ = PCA process, where ComputeBasis[ρ=PCA] would find the top principle components of the datasets (here, the box would be those components), which UseBasis[ρ=PCA] could then use to transform a new gene expression profile into that low-dimensional “PC-space”.

4.2 Performance system, USM

As shown in Fig 6, the USM[ρ=dLDA;Ψ=MTLR]([xGE, xCF], W, Ω) system applies the learned Ψ = MTLR model W, to a PreProcess’ed description of a novel patient, , whose gene expression values have been “projected” into the relevant basis by UseBasis[ρ=dLDA]. This produces a survival curve, which it then uses to produce that patient’s predicted risk score: the negative of the expected time for this distribution, which corresponds to the area under its survival curve.

Each of the various subroutines are described in an appendix: PreProcess’, UseBasis[ρ=dLDA] and UseBasis[ρ=PCA] are described in Appendices A.1, A.3 and A.4, respectively. The UseModel[Ψ=Cox] and UseModel[Ψ=RCox] produce standard risk scores for each patient, obtained by applying the learned Cox (resp., RCox) model to the patient’s clinical and gene expression features; see Appendix A.5.

5 Experimental results

As noted above, we intentionally designed our learning and performance systems (Figs 3 and 6) to be very general—to allow two types of basis ρ ∈ {dLDA, PCA} and three different survival prediction algorithms Ψ ∈ {MTLR, Cox, RCox}. This allows us to explore 2 × 3 frameworks, on the two different datasets (METABRIC and KIPAN). For each, the learner uses internal internal cross-validation to find the optimal parameters. Below we report the results of each optimized model on the held-out set, focusing on the Concordance Index (CI)—a discriminator measure. We also discuss a calibration measure of these results; see Appendix B.2.

We also present our experimental results from the BCC Dream Challenge winner’s model [13]. As discussed in Section 4.1.1, ComputeBasis[ρ=dLDA] ran internal cross-validation on the training set to determine the appropriate encoding t* ∈ {Enc_A, Enc_B} and the optimal number of c_topics for the dLDA model K* from a large potential values (see Algorithm 1 in Appendix A.2). Our experiments found that the discretization t* = Enc_B, along with K* = 30 c_topics, produced the best dLDA algorithm for survival prediction in METABRIC; after fixing the encoding scheme as Enc_B, we used the same technique on the KIPAN dataset and found K = 50 c_topics to be the best. We used the C implementation from Blei et al. [17, 41] to compute the c_topics. On a single 2.66GHz processor with on 16Gb memory, a single fold takes around ∼20–30 hours (more time for larger K). We, of course, parallelized each CV fold.

We experimented with different combinations of the features from three groups: (1) clinical features, (2) SuperPC+ principle components (ρ = PCA), and/or (3) the dLDA c_topic (ρ = dLDA); and with three different survival prediction algorithms Ψ ∈ {Cox, Cox, MTLR}. Our goal in these experiments is to empirically evaluate the performance of the survival models that use various types of features. Given this goal, we evaluate the performance using different GE basis methods (ρ) by comparing their performance to a baseline model that only uses the clinical features with Cox [26]. The other combinations include clinical features as well as various different GE features; each is trained using each of the three aforementioned survival prediction algorithms (Ψ).

As an additional feature selection step, we removed the covariate “Site” from the METABRIC clinical covariates, based on our experimental results (on the training data) that shows its inclusion led to worse concordance. We experimented with a large, but selective, set of model combinations, to answer our major queries:

  1. (i). does adding GE features improve survival prediction?
  2. (ii). which is the best feature combination for survival prediction?
  3. (iii). which is better representation of the GE features: dLDA or SuperPC+?
  4. (iv). are we deriving GE features that are redundant with PAM50?

Our results appear in Table 2, shown visually in Fig 7(left). Note the baseline is “A-Cox”, where the ‘A’ refers to the feature set used, which here is the far left triplet of blocks in Fig 7(left), and the ‘Cox’ refers to the learning algorithm, which appears left-most in each triplet. These results lead us to claim:

  1. (i). Comparing the baseline, A-Cox, to the other models, we immediately see that adding GE features (using any of the dimensionality reduction technique) leads to better predictive models;—i.e., all of the results are better than A-Cox’s CI of 0.6810 (the left-most light-shaded bar in Fig 7(left)).
  2. (ii). The best model for METABRIC is the one that includes all of the types of features derived from the gene expression—here E-MTLR, which is the right-most bar of Fig 7(left).
    We also performed student’s t-tests on random bootstrap samples from the test data to validate the significance of our results. When we compare this best model, E-MTLR, against models B-RCox (which is the best model using only PCA GE features) and C-MTLR (the best model using only dLDA GE features), we find statistically significant difference between them (respective pairwise p-value: 4.8e-16, 1e-3), showing that the E-MTLR model is significantly better than its closest counterparts.
  3. (iii). These empirical results show that, if you are pick only a single GE feature set, the dLDA c_topics perform better than the principle components—that is, the C-χ has a higher score than B-χ, for χ ∈ {Cox, RCox, MTLR}; moreover, a model using both sets of features performs yet better (i.e., D-χ is better that C-χ).
  4. (iv). Comparing the D-χ to E-χ, we see that adding PAM50 subtypes as features to the METABRIC database improves the held-out test concordance.
Fig 7. Test CI: METABRIC (left) and KIPAN (right).

Note higher values are better. The labels on the x-axis correspond to Table 2.

Table 2. Concordance results of various models from METABRIC (over 395 test instances) and KIPAN (over 176 test instances).

+ = used these features and − = did not use these features As PAM50 is specific to breast cancer, it is not applicable to the kidney (KIPAN) data. The first row, with the ID “A(*)”, is the baseline.

Indeed, we see that the performance of models that include PAM50 are marginally better than similar models that do not (row D), suggesting that the information added by these different representations of GE data are not redundant. Moreover, we see that, in all feature groups, both RCox and MTLR clearly outperform Cox—i.e., ν-RCox and ν-MTLR are better than ν-Cox, for ν ∈ {A,B,C,D,E}. We then tested the first three claims on the KIPAN dataset; see Table 2 (right-most column) and Fig 7(right). (As KIPAN does not deal with breast cancer, the PAM50 features are not relevant, so we could not test claim (iv).)

  1. (i). As before, we found that adding expression information improves over the baseline A-Cox –i.e., essentially all values are better than 0.7656.
  2. (ii). We again found that the best model was the one that included all of the features; here D-MTLR. Moreover, a t-test on bootstrap replicas show that this model D-MTLR was significantly better than the top model that does not include dLDA features, B-MTLR.
  3. (iii). We again see that C-χ has a higher score than B-χ, meaning (again) that models trained with only the c_topics performed much better than PCA-features; but that including both features was yet better (D-χ).

These sets of experiments support our claim that

a model learned by running MTLR on all GE features, gives very good concordance scores

–statistically better than other options in two different datasets, using different platforms, related to different cancer types.

In addition to these evaluations using the discriminative concordance measure, we also applied a calibration measure: “D-Calibration” (“D” for “Distribution”) [15, 16], which measures how well a individual survival distribution model is calibrated, using the Hosmer-Lemeshow (HL) [30] goodness-of-fit test; see Appendix B.2. We found that all of our models, for both datasets (METABRIC and KIPAN), passed this calibration test; see Appendix C.2, especially Table 3. But we have found that this is not universal. For example, we experimented with another breast cancer dataset BRCA (results not shown here), and found that the Cox model failed for all configurations (of ρ), showing that the Cox model does not always produce calibrated results—here, for situations where RCox and MTLR produced D-calibrated predictors. See also Haider et al. [16].

Table 3. D-calibration results from METABRIC, KIPAN, and BRCA, on the held-out test data.

p-values greater than 0.05 suggest the model is good (“D-calibrated”). Note that Cox models fail D-calibration test for all feature combinations, for BRCA dataset.

5.1 Other comparisons

In 2012, Cheng et al. [13] won the BCC Dream Challenge (which was based on the METABRIC data) by (i) leveraging prior knowledge of cancer biology to form Meta-Genes and (ii) training an ensemble of multiple learners, fueled by the continuous insights from the challenge competitors via open sharing of code and trained models. To compare our performances with this BCC winning program, we reproduced their models (using the DreamBox7 package), then re-trained their ensemble learners on our training split of the METABRIC data and tested on the held-out test set. Table 2[Row F] shows that the resulting ensemble model achieved a CI of 0.7293 on the test data. While that score is slightly better than the performance of our best model (Table 2[Row E]), note that all of our tuning was performed solely on the training (n = 1586) data, while their team made major design choices for their model using the entire METABRIC cohort (all n = 1981 instances), on which it was then evaluated.

Recently, Yousefi et al. [31] trained a deep neural network on this KIPAN data—including this gene expression data, as well as other features: Mutation, CNV and Protein. They reported concordance scores around 0.73−0.79, which are lower than our best, 0.8495.

Finally, while we focused on the LDA approach, we also explored another topic modeling technique, Latent Semantic Indexing (LSI) [23]. Running this on both datasets (using the same discretization approach, the same t* = Enc_B encoding and the same number of c_topics, K* = 30), we found essentially the same Concordance values, and confirmed that all four claims (i) through (iv) still hold, just replacing dLDA with the “discretized LSI” (dLSI) encoding.

6 Discussion

Given the growing number of gene expression datasets as part of survival analysis studies, it is clearly important to develop survival prediction models that can utilize such high-dimensional GE data. This motivated us to propose a novel survival prediction methodology that can learn predictive features from such GE data—exploring ways to learn and use c_topics as features for models that can effectively predict survival. N.b., this paper focuses exclusively on this predictive task, as this can lead to clinically relevant patient-specific information; indeed, this motivated the BCC Dream challenge, which provided the METABRIC dataset. We anticipate future work will explore the possible interpretation of these c_topics.

We included Cox as one of our learning modules for this task as it is known to be effective at optimizing concordance, both empirically and theoretically [32]. We included RCox as this algorithm recently won Prostate Cancer Dream Challenge 9.5 [12]. Finally, we included the MTLR survival prediction model as its performance, there, was competitive with the best, as well as based on the empirical evidence in Haider et al. [16]. Our evaluations on these two datasets show that MTLR’s performance was often better than RCox and Cox. Moreover, while the basic RCox and Cox functions produce only a risk score for each patient, MTLR provides a survival distribution for each, mapping each time to a probability; see Fig 2. Such models, which produce an individual survival distribution, can be used to compute a risk score, allowing them to be used for concordance-tasks; they can also be used to predict single time probabilities (e.g., probability of a patient living at least 3 years), and also can be visualized. (We used the Kalbfleisch-Prentice approach to estimate the base hazard function, allowing Cox and RCox to similarly produce individual survival curves. Appendix B.2 describes a way to evaluate such “individual survival distribution” models, D-Calibration. Appendix C.2 then shows that, for these datasets, these models all pass this test).

To summarize the main disadvantages and advantages of our approach, versus more standard approaches (e.g., PCA for dimensionality reduction, and (R)Cox for survival prediction):

  • Disadvantages:
    • Topic models are not simple to describe.
    • This approach requires a fairly long training time (∼20 hours on a 16GB, 2.66GHz processor for a single model)—to first find the parameters (encoding, number of c_topics), then the c_topics themselves, and finally, to learn the model that has the best performance. (However, using the trained dLDA model to predict c_topic contributions for a new patient is very fast—under a second on a general purpose laptop computer.)
  • Advantages:
    • An effective process to learn representation from gene expression data, as a meaningful probability distribution over the genes.
    • The learned representation from the gene expression data improves survival prediction, over standard methods, in:
      • Different cancer types: Breast and Kidney.
      • Different gene expression data types: Microarray and mRNASeq.
      • Different survival prediction algorithms: Cox, Regularized-Cox and MTLR.
    • Our combined approach for feature learning and survival prediction (dLDA + MTLR) archives strong concordance scores compared to standard survival models across different cancer types.

7 Conclusion

Table 2 shows that our proposed model, which uses MTLR to learn a model involving various types of derived GE features (dLDA c_topics and/or SuperPC+), has the best concordance, in two datasets representing different types of cancer, and two different gene expression platforms (micro-array and mRNAseq). That table shows that adding GE features improves survival prediction and that including both dLDA c_topics and SuperPC+ principle components gives the most improvements across held-out datasets. We also found that the “framework” that produced the best model in METABRIC, was also the best in the Pan-kidney KIPAN dataset, which shows the robustness of our proposed prediction framework. Moreover the c_topics extracted by our dLDA procedure (inspired by topic modeling) can be interpreted as collections of over-expressed or under-expressed gene sets; further analysis is needed to discover and validate the biological insights from these c_topics. Our results show that our novel survival prediction model—learning a MTLR survival model based on our derived GE features (dLDA c_topics and SuperPC+ components)—leads to survival prediction models that can be better than standard survival models. We anticipate that others will find this dLDA+MTLR approach (and code at helpful for their future tasks.

A Details about the algorithms

Section 4 gave a high-level overview of the important parts of the learning, and performance, systems; see also Fig 8. This appendix completes that description. In particular, it summarizes the components of the learning and performance systems—each shown as a rounded-rectangle in Figs 3 or 6—roughly in a top-to-bottom fashion. Appendix A.1 describes the PreProcess(⋯) routine that preprocesses the training data (both gene expression and clinical features), and the related PreProcess’(⋯) routine, used by USM, to preprocess a novel instance. Appendix A.2 then gives many details about ComputeBasis[ρ=dLDA](⋯) that computes the set of “c_topic − genes” distributions, given gene expression values (and some additional information)—extending the high-level description in routine in Section 4.1.1. Appendix A.3 describes the UseBasis[ρ=dLDA] routine that uses these c_topic-genes distributions to map each patient’s gene expression profile into that patient’s specific “c_topic—distribution”; see Fig 5. Appendix A.4 presents ComputeBasis[ρ=PCA] and UseBasis[ρ=PCA] techniques, to deal with the other approaches for reducing the dimensionality, PCA. Finally Appendix A.5 describes two related standard survival analysis methods: Cox [26], and Ridge-Cox (RCox) [33]. (Section 2.1.1 presented another approach, based on the more recent MTLR approach to survival analysis.)

Fig 8. Simplified flow diagram describing the overall prediction process, including both the learning task and the performance task.

A.1 PreProcess and PreProcess

The PreProcess process (used by LSM in Fig 3) applies various standard “normalizations” and simple “corrections” to the training data—both raw clinical features, and gene expression values. For the clinical features XCF, PreProcess produces a normalized dataset without any missing values, ready for the subsequent steps in the pipeline—see the orange-lines in Fig 3. This uses the standard steps: (1) impute missing real (resp., categorical) values for a feature with the mean (resp., mode) of the observed values for that feature; and (2) binarizing each categorical variable (aka “one-hot encoding”)—e.g., we encoded the 12-valued “Histological type” using twelve bits: e.g., Invasive Tumor is [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. For the gene expression data XGE, PreProcess applies the following steps: (1) As we want to deal with the log of the initial gene expression value, we first log2-transformed the data, if necessary. (Below we use “gene expression” to refer to this transformed value.) (2) Then translate all expression values into their “common z-scores”. It first computes the (common) mean and standard deviation over all the genes from the entire XGE dataset: Let be the expression value of probe/gene gi of patient j, then compute the common mean (where n = 1,981×49,576 is the total number of entries for METABRIC), and the variance . We then use the Z-score transformation of each entry: . Notes: (a) this standardization is done prior to dividing the data into train and validation sets. (b) Using z-scores based on only a single gene would not be able to identify which genes did not vary much, as (after this transform) all genes would vary the same amount. (3) PreProcess then removes the genes that do not vary much, removing a gene i iff its values are all within the first standard deviation—i.e., if .

This filtering process is motivated by the assumption that any gene whose expressions does not change much across multiple patients, is unlikely to be directly related to the disease, while the genes that contribute, typically have significant variations in their expression levels across patients. While this filtering procedure is unsupervised, we anticipated that it would retain the genes that have the most prognostic ability. This was confirmed as we found that this process does not eliminate any of the “top” 100 probes in the METABRIC data (these are the probes with the 100 highest concordance values); see [13, Table 1]. In METABRIC, this filtering procedure eliminates 27,131 of the original 49,576 probes, leaving only 22,445 probes—i.e., a ≈54.7% reduction in the number of features.

Later, the performance system USM will need to apply these pre-processing steps to a novel instance—in particular, for each clinical feature, it will need to know the mean (or median) value, for imputation. Similarly, it will need to transform each gene expression values into an integer; this requires knowing the global mean and values to produce the z-values values, We include all of these values in the Ω term, which is output by the PreProcess subroutine. This Ω is one of the inputs to the PreProcess’ process, within USM, which applies these pre-processing steps to a novel instance encoded by its xGE and xCF features. Note finally that neither PreProcess nor PreProcess’ use the labels (survival times).

A.2 ComputeBasis[ρ=dLDA]

As shown by the blue lines in Fig 3, the ComputeBasis process takes as input a pre-processed version of the labeled dataset that was input to LSM: the PreProcessed gene expression data and clinical features , with their associated labels (Lbl). This process produces the “basis” set, of type ρ. This subappendix will focus on ρ = dLDA.

Algorithm 1 ComputeBasis[ρ=dLDA] algorithm

1: function ComputeBasis[ρ=dLDA]()     ⊳ Returns a set of c_topics

2:   ≔ Discretize()

3:  for t in {Enc_A, Enc_B} do

4:   GEt ≔ Encode-GE(t, )

5:   [GEt,1, GEt,2, …, GEt,5] ≔ Partition(GEt)

6:   % Notation: GEt,−i = GEtGEt,i

7:   % = clinical features;

8:   for K in (5, 10, 15, …, 150) do

9:    for i = 1:5 do

10:     % Find LDA “basis set” (set of c_topics)

11:      ≔ Compute_dLDA(GEt,−i, K)


13:     % Project the hold-out set onto this basis set,

14:     % encoding each patient as a K-tuple of values

15:     GE_Topict,K,i ≔ Use_dLDA(GEt,i, )


17:     % Learn a Cox model for this encoding, value of K, and fold i

18:     % using both c_topics and the clinical features

19:     % LearnCox & PredictCox are based on [26]

20:     wt,K,i ≔ LearnCox()


22:     % Evaluate model on the hold out set

23:     % Using evaluation measures concordance and likelihood

24:     ct,K,i ≔ Concordance(PredictCox(), Lbli)

25:     lt,K,i ≔ Average{Likelihood(, GEt,i)}


27:     ≔ Average{ct,K,i}

28:     ≔ Average{lt,K,i}


30:   % Find t* (encoding scheme), with the highest concordance

31:   t* ≔

32:   % Selecting K*


34:    % Break ties giving priority to small K’s


36:   return Compute_dLDA(GEt*, K*)

As shown at the bottom of Algorithm 4, ComputeBasis returns the results of Compute_dLDA(GEt*, K*), which are a set of K* c_topic–distributions, based on its input GEt*, which encodes the gene expression values () as non-negative integers. This means ComputeBasis must first (1) transform its input real-valued gene expression values into non-negative integers GEt*, and (2) determine the appropriate number of c_topics . Task (1) has two parts: (1a) Line 2 first discretizes the real-valued XGE into (positive and negative) integers . (1b) The next part of the subroutine determines the best way to transform those integers into non-negative integers . Below we describe these three steps, followed by (3) a description of Compute_dLDA.

(1a) Discretize subroutine.

Recall first that the PreProcess routine already translated the real-valued XGE gene expression values into z-scores , and excluded every genes whose values here all were in (−1, +1). To simplify the notation, view . The Discretize routine first assigns each to 0. For the remaining “non-trivial” standardized gene expression values ’s (outside the first standard deviation) of each gene: Letting be the non-trivial positive values, we divide into 10 regions, of size each and identify each positive with the index ∈{ 1, 2, …, 10} of the appropriate bin. We similarly divide the non-trivial negative expression values into their 10 bins, based on and each negative is identified with the index ∈{ -1, -2, …, -10} of the appropriate bin; see Fig 4. (Of course, the actual divisions are specific to the different genes; this figure just shows a generic split.) In general, we let be the integer bin index associated with gene gi for subject j.

Notes: (1) We initially tried to discretize the values into the bins associated with the standard deviation, in general. However, we found this did not work well. (2) ComputeBasis also returns these values, as part of the encoding –i.e., along with —and UseBasis will later use this information to discretize its real-valued gene expression input. We did not show this detail, to avoid overcluttering the text and images.

(1b) Transform to non-negative integers.

While Discretize mapped each gene expression value to an integer , the Compute_dLDA routine requires non-negative values. Section 4.1.1 discussed two ways to deal with this: using either encoding Enc_A versus Enc_B; see bottom of Fig 4. ComputeBasis uses internal cross-validation to determine which of these is best, t*, along with the number K* of c_topics; see below. (In general, we will let GEt refer to the t-encoding of the gene expression values.)

(2) Finding optimal K*, t*.

As noted, the Compute_dLDA algorithm also needs to know the number of c_topics K* to produce. Rather than guess an arbitrary value, ComputeBasis instead uses (internal) cross-validation to find the best value for K, over the range K ∈ {5, 10, 15, …, 150}. For each technique t ∈ {Enc_A, Enc_B} and each of the 30 values of K, ComputeBasis first computes the dLDA model over the training set, using Compute_dLDA (for that encoding and number of c_topics); it then used these and the (preprocessed) clinical features () as covariates, along with the survival labels Lbl, to learn a Cox model [26]—see Algorithm 1, lines 9–20. Note it does this in-fold—using 4/5 of the training set to learn the dLDA c_topics and the Cox model, which is evaluated by computing the concordance (based on this learned model) on the remaining 1/5 (line 24).

As noted above, we need to determine (1b) which is the best discretization t*, Enc_A or Enc_B, and (2) what is the appropriate K* for that technique. To answer the first question, ComputeBasis picked the encoding technique t* that gave the highest cross-validation concordance from all the (30 × 2) combinations (see Algorithm 1, line 31). Secondly, after deciding on a encoding scheme, it sets to be the value with the largest (cross-validation) likelihood, then selects the set of K’s that are smaller than and whose cross-validation likelihood scores are within the first standard deviation of the ’s; see Algorithm 1, line 33. From these candidates, it selected the K* that gives essentially the highest concordance (see Algorithm 1, line 34). Empirically, we found that the internal cross-validation concordance scores was fairly flat over the critical region—e.g., K ∈ {20, ‥, 35} for METABRIC—before dipping to smaller values for larger value of K, presumably due to overfitting. This is why we are confident that the upper limit, of 150 topics, is sufficient. Once it finds the best K* and the encoding technique t*, ComputeBasis then runs Compute_dLDA on the t*-encoded (preprocessed) training gene expression data GEt*, seeking K* c_topics; this is “basis”. This routine also returns the values used to produce the discretized values, GEt.

(3) Compute_dLDA.

The Compute_dLDA(GEt, K) process, based on Blei et al. [17], computes K c_topics, based on the preprocessed, discretized gene expression data GEt, as well as the number of latent c_topics K; it then returns K c_topics–distributions, each ≈50,000-parameters of the Dirichlet distribution (for METABRIC), corresponding to a line of the shown in Fig 5. (Each point here corresponds to its estimate of the posterior βGE, conditioned on the observed gene expression values.) This routine also uses the Dirichlet prior for the patient–c_topics distribution; here we used the symmetric Dirichlet(α, …, α) for some α ∈ ℜ>0. (As there are K c_topics; we view this as a vector α1K.) We experimented with several values α ∈ {0.01, 0.1, 0.5, 1.0}, but found that the prior did not make much difference, since we allowed the model to estimate the prior internally. We therefore set α = 0.1.

This routines also needs to set the priors for the K different c_topic–gene_expression distributions , for i = 1‥K, each sweeping over the N genes. Here, we use the prior where δU[0, 1/N2]—i.e., δ is sampled from the uniform distribution over the interval [0, 1/N2]. This LDA learning process [17] uses the data in GEt to compute the posterior distribution {βGE[i,:]}i for each of these K c_topics—revealing GEt’s intrinsic structure. Recall these are just the parameters for Dirichlet distribution; note they must be positive, but do not add up to 1. The Compute_dLDA returns the expected values of the gene expression values drawn from this posterior distribution: . Here, the probability values for each c_topic add up to 1. We will let refer to the entire “matrix”.

A.3 UseBasis[ρ=dLDA]

Once LSM has learned the c_topics () for the best K* and best encoding technique t*, we can then compute the c_topic distribution for a new patient (based on her gene expression xGE); see Figs 3 and 6. This will call UseBasis, which in turn runs Use_dLDA (the LDA inference procedure) on the preprocessed gene expression data of the current patient to compute the individual topic contributions for this patient [17]. The inference procedure determines the posterior distribution of the patient-c_topic Dirichlet distribution , where each quantifies how much of this patient’s gene expression is from the jth c_topic (using the posterior mean probabilities of the c_topics, ).

This process reduces the ≈20 000-dimension gene expression values to a very small K*-dimensional c_topics representation—e.g., L* = 30. These low-dimensional feature vectors are then used in the survival prediction algorithms to predict survival times/risk.

A.4 ComputeBasis[ρ=PCA] and UseBasis[ρ=PCA]

The previous subappendix described one way to reduce the dimensionality of the data—to transform each patient’s 20,000-tuple to a more manageable K-tuple—there based on topic modeling ideas. There have been many other feature selection methods proposed for survival prediction using gene expression data, such as hierarchical clustering, univariate gene selection, supervised PCA, penalized Cox regression and tree-based ensemble methods [34]. Some of these techniques first apply a procedure to reduce the dimensionality of the data, based on feature selection, feature extraction or a combination of both, while others, such as random survival forests [9] and L1-penalized Cox [35], include internal feature selection. As we wanted to compare our dLDA approach to other dimensionality reduction techniques, we chose an extension to the principle component analysis called supervised principal component analysis (SuperPC) [36], instead of other regularization techniques.

This algorithm first calculates the univariate Cox score statistic of each individual gene against the survival time, then retains just the subset of genes whose score exceeds a threshold, determined by internal cross-validation. Then it computes PCA on the dataset containing only those selected genes, then projects each patient onto the first one (or two) components. The main disadvantage of the SuperPC algorithm is that the individual genes selected from the univariate selection process might not perform the best in a multivariate (final) model, perhaps because many of these top-ranked genes may be highly correlated with one another –i.e., it would be better having a more “diverse” set of genes [34, 37]. Instead, we use a variant, called SuperPC+, that initially applies PCA on the normalized gene expression data after the constant genes are removed; see PreProcessin Appendix A.1). The PCA transformation projects the initial “raw” features into a different space, which then can be used to select the top components based on the univariate Cox regression. Note this SuperPC+ is (still) computationally efficient, as it is based on PCA, which is efficient: Even though gene expression data is high dimensional (pn, where p is the number of genes and n is the number of instances), the rank of the GE matrix will be (at most) min{p, n} = n. Therefore, PCA can be performed without many computational restraints on the whole gene expression dataset, as here the PCA time complexity is O(n3). After performing PCA on the GE dataset, we can then identify the most important principal components by computing a Cox score statistic for the univariate association between each principal component and the survival time. In our experiments, we select the threshold η for the p-value of the Cox score by internal cross-validation (wrt concordance), and retained all PCs having a p-value lower than this η—finding η = 5e-4 for the METABRIC dataset and η = 5e-2 for KIPAN. These selected PC components form the basis set BGE[ρ = PCA].

UseBasis[ρ=PCA] is simply the projection of the gene expression data into the chosen PC components. This gives us a low dimensional feature representation of the original gene expression data to feed into the survival prediction algorithms.

A.5 Cox models: LearnModel[Ψ=Cox], LearnModel[Ψ=RCox]

The Cox regression model’s [26] hazard function over time t, for an individual described by x, is the product of two components: (3) where the baseline hazard h0(t) is independent of the covariates x and the covariates are (independently) multiplicatively related to the hazard, based on a (learned) W. This formulation simplifies modeling of the hazard function by limiting the contribution of the “time” variable t to the baseline hazard h0(⋅), which means the hazard ratio (HR) between two patients does not depend on time and is linear (proportional) in the exponent. To estimate the coefficients of the model, Cox [26] proposed a partial likelihood technique that eliminates the need to estimate the baseline hazard. This procedure allows the Cox proportional hazards model to be semi-parametric, by only using the survival times to rank the patients [32]. We can compute partial likelihood with all patients—both censored and uncensored: (4)

  • is the risk set at time yj, which are the indices of individuals who are alive and not censored before time yj
  • [xi, yi, δi] describes the ith subject, where
    xi = vector of covariates
    yi = (survival or censor) time
    δi = censor bit (0 for censored; otherwise 1)
  • N—total number of patients in the cohort
  • W—coefficients (to be learned)

Note that only the uncensored likelihoods contribute directly, since for censored instances δi = 0. Therefore the censored observations are only utilized in the denominator when summed over the instances in a risk set. In essence, the partial likelihood only uses the patient’s death times to rank them in the ascending order to find the risk sets and does not use the exact times explicitly [32]. Hence, the coefficients estimated by maximizing the partial likelihood depend only on the ordering of the patient’s death times and the covariates, allowing for an implicit optimization for good concordance of the risk score. An in-depth study on the Cox proportional hazard model has revealed that the partial likelihood proposed by [26] is approximately equivalent to optimizing concordance [32].

There are several extensions of the basic Cox proportional hazards model: some extend the initial model estimating the baseline hazard and others are based on the regularization methods imposed on the coefficients (W). Generally, regularization based on LASSO, ridge penalty or the elastic-net regularization (which allows both L1 and L2 penalties) are adopted to reduce overfitting. In our work, we use the glmnet R package [33] with ridge penalty (by setting α = 0 in the glmnet function); here called RCox. We selected ridge penalty based on the internal cross validation. We found that concordance results using models with ridge penalty were better than those having no regularization (LASSO, elastic-net).

B Foundations

B.1 Evaluation: Concordance Index (CI)

This “CI” evaluation applies to any model that assigns a real number—a “risk score”—to each instance f(⋅). It considers all pairs of “comparable” instances, and determines which is predicted (by the risk model f(⋅)) to die first, and also who actually died first. CI is the proportion (probability) of these pairs of instances whose actual pair-wise survival ordering, matches the predicted ordering, with respect to f(⋅): (5) where I[ϕ] is the indicator function, which is 1 if the proposition ϕ is true, and 0 otherwise. A pair of patients is “comparable” if we can determine which died first –i.e., if both are uncensored, or when one patient is censored after the observed death time of the other; this corresponds to the set of ordered pairs of indices Ψ. This CI(f) score is a real value between 0 to 1, where 1 means all comparable pairs are predicted correctly. CI can be viewed as a general form of the Mann–Whitne–Wilcoxon statistic and is similar to the Area Under the ROC (Receiver Operator Curve), AUC, of classification problems [32].

B.2 Evaluation: D-calibration

The concordance index is a discriminatory measure, which is relevant, for example, when deciding which patient with liver failure will die first without a transplant. By contrast, calibration measures the deviation between the observed and the predicted event time distributions. While this is not meaningful if we only have a risk score (e.g., as produced by the basic Cox Proportional Hazard function), this deviation can be computed for a survival distribution, like ones produced by the MTLR survival prediction tool, or the Cox+KF system—which extends the standard Cox model by using the Kalbfleisch-Prentice estimator to produce the baseline hazard function h0(x) in Eq 3; see [11]. In general, this calibration involves computing the difference between the predicted versus observed probabilities in various subgroups—e.g., if the predicted probability of surviving at least t = 2576 days is 0.75 for some subgroup, then we expect to observe around 75% of these patients to be alive at this time t.

We consider a novel measure of the calibration of such survival curves, called D-calibration (“D” for “Distribution”) [16]. To motivate this, consider a standard Kaplan-Meier (KM) [8] plot shown in Fig 9, which plots the set of points (t, KM(t))—i.e., it predicts that the KM(t) fraction of patients will be alive at each time t ≥ 0. Hence, the point (6184 days, 0.50) means the median survival time of the cohort is 6184 days; see Fig 9(solid line). We will use KM−1(p) to be the time associated with the probability p—technically, KM−1(p) is the earliest time when the KM curve hits p; hence KM−1(p) (0.5) = 6184 days. If this plot is D-calibrated, then around 50% of the patients (from a hold-out set, not used to produce the KM curve) will be alive at this median time. So if we (for now) ignore censored patients, and let di be the time when the ith patient died, consider the n values of {KM(di)}i=1‥n. Here, we expect KM(di) > 0.5 for 1/2 of the patients. Similarly, as the curve includes (2576 days, 0.75) and (8941 days, 0.25), then we expect 75% to be alive at 2576 days, and 25% at 8941 days; see Fig 9. Collectively, this means we expect 25% of the patients to die between KM−1 (1.0) = 0 days and KM−1 (0.75) = 2576 days, and another 25% between KM−1 (0.75) and KM−1 (0.5), etc. These are the predictions; we can also check, to see how many people actually died in each interval: in the first quartile (between 0 and 2576 days), in the second (between 2576 and 6184 days), in the third (between 6184 and 8941 days), and the fourth (after 8941 days). If the KM plot is “correct”—i.e., is D-calibrated—then we expect 1/4 of the patients will die in each of these 4 intervals. The argument above means we expect 1/4 of the {KM(di)} values to be in the interval [0, 0.25], and another quarter to be in [0.25, 0.5], etc. Stated more precisely, (6)

Fig 9. Kaplan–Meier survival function from METABRIC (training) data.

(We only use the quartiles for pedagogic purposes.).

A single KM curve is designed to represent a cohort of many patients. The MTLR system, however, computes a different survival curve for each patient—call it Pri(·) = PrW (· | xi) (from Eq 1). But the same ideas still apply: Each of these patients has a median predicted survival time—the time where its Pri(⋅) curve crosses 0.50.

By the same argument suggested above, we expect (for a good model W) that 1/2 of patients will die before their respective median survival time—; that is, |{i: Pri(di) ≤ 0.5}| ≈ n/2. Continuing the arguments from above, we therefore expect the obvious analogue to Eq 6: (7)

We can now test whether a model is D-calibrated by using the Hosmer-Lemeshow (HL) [30] goodness-of-fit test, which compares the difference between the predicted and observed events in the event subgroups: (8) where G is the number of subgroups (here 4), where the gth subgroup has events, with the empirical number of events (which here is N/g, for N = ∑g Ng total patients), the corresponding predicted number of events in each group , and (which here is ) is the proportion of the gth subgroup. Under the null hypothesis (Eq 7), this HL statistics follows a Chi-Square distribution with G − 2 degrees of freedom. If the predicted and empirical event rates are similar for the subgroups, the test statistic will fail to reject the null hypothesis, providing evidence that the model’s predictions are well D-calibrated—i.e., large p-values from the test statistic suggest not rejecting the null hypothesis).

Notes: (1) This evaluation criterion only applies to models that produce survival distributions, which means it directly applies to the MTLR models. For the Cox and RCox models, we used the Kalbfleisch-Prentice baseline hazard estimator [11] to produce personalized survival curves. (2) To provide more precise evaluation, rather than using 4 bins (quantiles), we mapped the Pri(di) probabilities into 20 bins: [0, 0.05); [0.05, 0.1), …, [0.95, 1.0]. (3) This analysis deals only with uncensored data; Haider et al. [16] discusses how to cope with censored data.

C Additional results

This appendix presents additional results: First, Appendix C.1 evaluates the Latent process decomposition (LPD) method, then Appendix C.2 provides D-calibration results of our various models.

C.1 Latent process decomposition (LPD) for microarray feature extraction

Rogers et al. [19] introduced LPD as a topic model adaptation for microarray data. We experimented with LPD (on METABRIC data) to derive genetic features and used them along with the clinical features for comparison. We used internal cross-validation for LPD to find the optimal number of latent processes for the METABRIC data—and found that 10 was best.

We then used the model based on these 10 latent process; the resulting concordance results, on the hold-out dataset, was 0.6915 (Cox), 0.6077 (RCox) and 0.6995 (MTLR). Comparing this to the “B” and “C” rows of Table 2, we see that our dLDA approach performs better than this complex adaptation of the LDA model for microarray data, for the survival prediction task—i.e., dLDA produces better features from the gene expression data.

There are two other reasons to prefer our dLDA-approach: (1) LPD has large time and memory requirements. (2) Moreover as our dLDA directly uses the LDA model, it can utilize all available off-the-shelf implementations, across several technology platforms with efficient and scalable implementation [38].

C.2 D-calibration results

Table 3 shows the D-calibration results for all of the domain-independent experiments we ran—i.e., excluding the “E” and “F” rows from Table 2, which used features that were specific to breast cancer. We see that the results were D-calibrated (i.e., had a HL p-value > 0.05) in all 12 situations, for METABRIC and KIPAN—for all feature groups {A, B, C, D}, and all 3 learning algorithms {Cox, RCox, MTLR}. We note that we found that Cox failed this test on other datasets, including BRCA.


We greatly acknowledge Compute Canada for providing us with computing resources. We also thank Dream Challenges and Broad Institute for making the METABRIC and KIPAN cancer data sets publicly available.


  1. 1. Stewart B, Wild CP, et al. World cancer report 2014. Health. 2017.
  2. 2. Van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. nature. 2002;415(6871):530–536.
  3. 3. Margolin AA, Bilal E, Huang E, Norman TC, Ottestad L, Mecham BH, et al. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Science translational medicine. 2013;5(181):181re1–181re1. pmid:23596205
  4. 4. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of clinical oncology. 2009;27(8):1160–1167. pmid:19204204
  5. 5. Naderi A, Teschendorff A, Barbosa-Morais N, Pinder S, Green A, Powe D, et al. A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene. 2007;26(10):1507–1516. pmid:16936776
  6. 6. Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature medicine. 2002;8(8):816–824. pmid:12118244
  7. 7. Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–352. pmid:22522925
  8. 8. Altman DG. Practical statistics for medical research. CRC; 1990.
  9. 9. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random Survival Forests. The Annals of Applied Statistics. 2008;2:841–860.
  10. 10. Khan FM, Zubek VB. Support vector regression for censored data (SVRc): a novel tool for survival analysis. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE; 2008. p. 863–868.
  11. 11. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. vol. 360. John Wiley & Sons; 2011.
  12. 12. Guinney J, Wang T, Laajala TD, Winner KK, Bare JC, Neto EC, et al. Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data. The Lancet Oncology. 2016. pmid:27864015
  13. 13. Cheng WY, Yang THO, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Science translational medicine. 2013;5(181):181ra50–181ra50. pmid:23596202
  14. 14. Yu CN, Greiner R, Lin HC, Baracos V. Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors. In: Neural Information Processing Systems (NIPS); 2011. p. 1845–1853.
  15. 15. Andres A, Montano-Loza A, Greiner R, Uhlich M, Jin P, Hoehn B, et al. A novel learning algorithm to predict individual survival after liver transplantation for primary sclerosing cholangitis. PLoS One. 2018.
  16. 16. Haider H, Hoehn B, Davis S, Greiner R. Effective Ways to Build and Evaluate Individual Survival Distributions. arXiv preprint arXiv:181111347. 2018.
  17. 17. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. the Journal of machine Learning research. 2003;3:993–1022.
  18. 18. Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q, et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015;16:35. pmid:25786235
  19. 19. Rogers S, Girolami M, Campbell C, Breitling R. The latent process decomposition of cDNA microarray data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2005;2(2):143–156.
  20. 20. Masada T, Hamada T, Shibata Y, Oguri K. Bayesian multi-topic microarray analysis with hyperparameter reestimation. In: International Conference on Advanced Data Mining and Applications. Springer; 2009. p. 253–264.
  21. 21. Bicego M, Lovato P, Perina A, Fasoli M, Delledonne M, Pezzotti M, et al. Investigating topic models’ capabilities in expression microarray data classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2012;9(6):1831–1836.
  22. 22. Liu L, Tang L, Dong W, Yao S, Zhou W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus. 2016;5(1):1608. pmid:27652181
  23. 23. Hofmann T. Unsupervised learning by probabilistic latent semantic analysis. Machine learning. 2001;42(1-2):177–196.
  24. 24. Dawson JA, Kendziorski C. Survival-supervised latent Dirichlet allocation models for genomic analysis of time-to-event outcomes. arXiv preprint arXiv:12025999. 2012.
  25. 25. McAuliffe JD, Blei DM. Supervised topic models. In: Advances in neural information processing systems; 2008. p. 121–128.
  26. 26. Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187–220.
  27. 27. McCullagh P, Nelder JA. Generalized linear models. vol. 37. CRC; 1989.
  28. 28. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, et al. Assessing gene significance from cDNA microarray expression data via mixed models. Journal of computational biology. 2001;8(6):625–637. pmid:11747616
  29. 29. Analysis Overview for Pan-kidney cohort (KICH+KIRC+KIRP) (Primary solid tumor cohort). Broad Institute TCGA Genome Data Analysis Center (2016). 28 January 2016.
  30. 30. Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression. vol. 398. John Wiley & Sons; 2013.
  31. 31. Yousefi S, Amrollahi F, Amgad M, Dong C, Lewis JE, Song C, et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Scientific Reports. 2017;7(1):11707. pmid:28916782
  32. 32. Steck H, Krishnapuram B, Dehing-oberije C, Lambin P, Raykar VC. On ranking in survival analysis: Bounds on the concordance index. In: Advances in neural information processing systems; 2008. p. 1209–1216.
  33. 33. Simon N, Friedman J, Hastie T, Tibshirani R. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software. 2011;39(5):1–13. pmid:27065756
  34. 34. Van Wieringen WN, Kun D, Hampel R, Boulesteix AL. Survival prediction using gene expression data: a review and comparison. Computational statistics & data analysis. 2009;53(5):1590–1603.
  35. 35. Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biometrical journal. 2010;52(1):70–84. pmid:19937997
  36. 36. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2(4):e108. pmid:15094809
  37. 37. Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology. 2005;3(02):185–205. pmid:15852500
  38. 38. Hoffman M, Bach FR, Blei DM. Online learning for latent dirichlet allocation. In: advances in neural information processing systems; 2010. p. 856–864.
  39. 39.
  40. 40.!Synapse:syn1688369/wiki/27311
  41. 41.