Cancer molecular subtyping using limited multi-omics data with missingness

doi:10.1371/journal.pcbi.1012710

Fig 1.

The overview of CancerSD.

(a) The incomplete data imputation module uses contrastive learning to extract cross-omics consistent features from available patient data. Then, it feeds these features into the generator, facilitating the imputation of missing omics in samples. (b) The cancer subtype diagnosis module leverages available and imputed omics of samples to diagnose cancer subtypes. (c) The knowledge transfer module follows the meta-learning paradigm, which develops a meta learner and a category-level contrastive loss to mine domain-specific knowledge from external datasets and to initialize a backbone network composed with the representation and diagnosis modules. (d) A series of comparison experiments and downstream analyses are conducted to evaluate the performance and application value of CancerSD.

More »

Expand

Fig 2.

Diagnostic performance on the GC dataset in the standard supervised learning setting.

(a) The cancer subtype diagnosis performance of CancerSD vs. comparison methods, including kNN, RFC, AE-XGBoost, MOMA, MOGONET, DCP, and APADC on the STAD dataset.(b) Sample clustering using original data and embedded representations given by CancerSD and other methods.(c) The diagnosis performance of the tested methods under different degrees of omics missingness.(d) The diagnostic accuracy of CancerSD for different GC subtypes.(e) The diagnostic probability of CancerSD for different GC subtypes.

More »

Expand

Fig 3.

Diagnostic performance on the GC datasets in the few-sample scenario.

(a) The cancer subtype diagnosis performance of CancerSD vs. comparison methods on the GSE62254 dataset under the standard supervised learning setting.(b) The cancer subtype diagnosis performance (Accuracy and F1 Score) of different methods under the multi2mRNA (upper figure) and mRNA2mRNA (lower figure) settings. Here, the red dot dash line represents the performance obtained by optimizing CancerSD with the entire training set, while the gray dotted line indicates that obtained by optimizing CancerSD with a 4-way 10-shot set.(c) The similarity of representations for samples from different datasets.

More »

Expand

Fig 4.

Diagnostic performance under different omics data types.

(a) Performance comparison for subtype diagnosis using different types of omics. Among them, methylation, miRNA, and mRNA refer to make diagnosis via CancerSD using DNA methylation data, miRNA expression data, and mRNA expression data, respectively; meth+miRNA, miRNA+mRNA, and meth+mRNA refer to diagnosis with two types of omics; meth+miRNA+mRNA refers to diagnosis with three types of omics.(b) Sample similarity heatmaps obtained from representation at different levels.(c) Diagnostic performance of CancerSD using only a single type of omics under different training strategies.(d) Sample clustering based on omics and fusion representations output by CancerSD under multi-omics joint training strategy.(e) Sample clustering based on omics and fusion representations output by CancerSD under single-omics independent training strategy.

More »

Expand

Fig 5.

Important molecules identified by CancerSD.

(a) Importance scores of the top-ranked molecules identified by CancerSD in various omics.(b) Differences in the importance scores of molecules across various omics.(c) Clustering for samples of different subtypes using omics embeddings and fusion representation output by CancerSD, respectively.(d) The methylation levels of the top 100 CpG sites ranked by importance, where the CpG sites are secondary sorting based on the average values across all samples.(e) The expression levels of mRNA characteristics across different GC subtypes.(f) The expression levels of mRNA and miRNA characteristics across different GC subtypes, where the expression values subjected to log2 transformation and normalization. Wilcoxon rank-sum test is employed to evaluate the differences in the expression levels of specific molecules among patients of distinct subtypes.(g) Gene co-expression analysis result for EBV subtype.(h) KEGG Pathway Enrichment results for module-2 (ME-2, top) and module-4 (ME-4, bottom), respectively.

More »

Expand

Table 1.

Important molecules identified by CancerSD.

More »

Expand

Fig 6.

The relationship among CancerSD outcomes, mRNAsi scores, and patient clinical characteristics.

(a) An overview of the association between the mRNAsi and clinical features. The median of mRNAsi score is used to categorize mRNAsi levels.(b) mRNAsi scores across different molecular subtypes.(c) Kaplan-Meier survival curves of different mRNAsi levels. Among them, HR and 95CI are abbreviations of Hazard Ratio and 95% Confidence Interval, respectively.(d) The relationship between CancerSD scores (with normalization) and mRNAsi scores in GC patients. The former is derived from the output of CancerSD before the softmax layer, while the latter is obtained through the mRNAsi model.(e) The integrated sankey diagram portrays the underlying correlations across the mRNAsi, molecular subtypes and Lauren classification.(f) mRNAsi scores across different Lauren subtypes.(g) The relationship between Integrated CancerSD Score (ICS) and mRNAsi scores in GC patients.(h) Kaplan-Meier survival curves of different ICS levels.(i) Correlation of mRNAsi and expression levels of important genes identified by CancerSD. The regression lines in figures are fitted by the corresponding data. The significance in the figure is estimated by pearson correlation coefficient.

More »

Expand