Visual analytics framework for survival analysis and biomarker discovery from gene expression data

doi:10.1371/journal.pone.0325399

Fig 1.

An example of a Kaplan-Meier survival plot for two gene expression-dependent conditions associated with patient survival.

(a) The difference between survival curves of gene CD14 is not evident. (b) The survival curve is substantially higher for a group of patients with highly expressed gene MAP7. We could say that MAP7 is hence a better biomarker for survival. In biomarker discovery, one of the tasks is to rank features, e.g., gene mutation status, gene expression values, and protein levels, according to the degree of separation between survival signatures given gene expression.

More »

Expand

Fig 2.

An example workflow to explore cohort formation based on clustering and biomarker discovery.

The workflow starts by loading the data (Datasets widget). The workflow progresses from (1) stratification of patients based on specified cutoffs value of continuous features (using Distributions, Kaplan-Meier Plot) to more complex (2) cohort formation based on several data features (the branch with Rank Survival Features, Cohorts). For comparison, (3) we plotted the map of patients (t-SNE), where patients are grouped by similarity of their feature-profiles. The survival curve for the cluster of patients selected in t-SNE is then compared to all other patients (Kaplan-Meier Plot (1)).

More »

Expand

Fig 3.

We zoom in on the top branch in workflow from Fig 2.

We use the Distribution widget to plot the distribution of the tumor size feature. The distribution plot is interactive, for example, we select data instances where the tumor size is less than 35 mm (b). Patient subgroups defined in the Distribution widget (b) are linked to survival curves in the Kaplan–Meier Plot (c) through brush-and-link interaction, providing an interactive browser to characterize the survival of the selected group. Different analysis options are available to the user, for example plotting of the confidence interval (a).

More »

Expand

Fig 4.

Focusing on the middle branch of workflow from Fig 2.

We use the Rank Survival Features to prioritize features most relevant (a). Data that is passed to Cohorts widget contains only two selected features to construct a risk-based model and stratify patients into low- and high-risk cohorts (b). This information is then passed to the Kaplan-Meier Plot that estimates, plots and assess how significantly different are the two survival curves (c).

More »

Expand

Fig 5.

A small sample of the METABRIC dataset available in Orange, where each row represents a primary breast cancer patient characterized by (1) clinical features (such as age at diagnosis and tumor stage), (2) gene expression values (e.g., KRAS, RERE, PHF7), (3) and clinical outcomes (overall patient survival).

More »

Expand

Fig 6.

Evaluation of the prognostic potential of individual genes.

This simple workflow effectively reproduces the study of Hwang et al. [51]. The METABRIC data set is pushed through the Distributions widget to allow for an interactive selection of gene expression threshold, and survival probability is visualized in the Kaplan-Meier Plot widget. Different gene expression thresholds can be explored at a click of a button.

More »

Expand

Fig 7.

(a) By using Gene sets, we have reduced the initial data to include only those genes that are participating in Ras signalling pathway from KEGG pathway databse. (b) We conduct standard survival analysis tasks on those genes through Rank Survival Features and Cohorts. Visualizing survival curves in Kaplan-Meier plot allows us to explore newly defined cohorts.

More »

Expand

Fig 8.

Identifying groups of patients on the METABRIC dataset.

The workflow computes distances between patient profiles (Distances widget) and clusters the patients accordingly in Hierarchical Clustering. The user can use the dendrogram in Hierarchical Clustering to choose the clusters by brushing or defining a cut-off distance. We display survival curves of identified clusters of patients in the Kaplan-Meier Plot. The workflow uses Box Plot to characterize given clusters by providing a sorted list of features, which can help in the formation of new hypotheses. In the particular case displayed in the figure, we observe that expression levels of gene PSAT1 are highest in cluster C2.

More »

Expand

Fig 9.

Workflow for gene set–based biomarker scoring and exploration.

(a) Expression of ESR1 alone does not sufficiently separate patient survival. (b) We load a set of expert-curated gene groups capturing major breast-cancer programs. (c) A t-SNE projection of data from gene sets reveals three clusters with different proportions of ER-positive patients. (d) Gene sets are ranked by prognostic value, showing Estrogen and Proliferation as most informative. (e) A Proliferation and Estrogen-based marker constructed from this sets yields clear separation of Kaplan–Meier survival curves.

More »

Expand