Exploring the interplay between colorectal cancer subtypes genomic variants and cellular morphology: A deep-learning approach

doi:10.1371/journal.pone.0309380

Fig 1.

Summary of the TCGA COAD and READ datasets application: The total cohort encompasses n = 632 patients.

Some patients were excluded due to technical reasons, resulting with n = 430 patients. Out of this, Kather et al. [7] pre-processed and published data for n = 360 patients, segmenting them into a training and a testing set. The training set was balanced at the patch (p) level. For our research, we used stratified cross-validation folds at the patient level. The partitioning into these folds was informed by the novel sub-labels based on SNP rates and CIMP classifications.

More »

Expand

Fig 2.

Experimental flow for our exploration of the interplay between CRC subtypes genomic variants and cellular morphology.

The TCGA-CRC dataset, pre-processed by Kather et al. [7] (N = 360) is split into different sets for analysis. A baseline model is trained, and based on its results, a molecular feature analysis is performed. Based on the analysis we choose to define our data classes based on the ranges and categories of SNP, CIMP and CNV (the BP class definitions step). After the definition, we divide the classes into five stratified folds. Next, three models are trained: BP-CNN_CIMP, BP-CNN_SNP, and BP-CNN_CNV to evaluate the interplay between genomic variations and cellular morphology. The BP-CNN_CIMP, BP-CNN_SNP, and BP-CNN_CNV models classify the data based on CIMP, SNP, and CNV features, respectively. Based on their results, BP-CNN_CIMP and BP-CNN_SNP are further combined into BP-CNN_Combined to incorporate the entire set of genomic variations identified as influencing cellular morphology.

More »

Expand

Fig 3.

Model architectures.

(a) Baseline Model Architecture: Patches are input into the Inception-Net [24] for feature extraction, with the last two layers acting as fully connected classifier layers. Outputs are propagated to a softmax layer for determining probabilities. N represents the number of patient patches, while P_i denotes the MSI probability for each patch. The MSI score for each patient, P_w, is the average of its corresponding MSI probabilities. (b) Biologically-Primed Model Architecture: Similar to the baseline model, the softmax layer outputs class probabilities at the patch level. However, the MSI probability here is calculated as the maximum value between MSI₁ and MSI₂ outputs. The calculation of P_w remains the same as in the baseline model.

More »

Expand

Fig 4.

Our BP-CNN_Combined model.

Models A and B represent biologically-primed models informed by two distinct genomic variations. The network outputs from trained and fixed models A and B are concatenated, fed into a linear layer, and then propagated to a softmax layer to yield probabilities. ‘N’ represents the number of patches for each patient, and P_i indicates the corresponding MSI probabilities for these patches. The MSI score for each patient denoted as P_w, is derived from averaging its respective MSI probabilities.

More »

Expand

Fig 5.

Baseline model results for per-patient classification of the test set validated over 5-folds.

Average and 95% CI curves: (a) ROC curve, (b) PR curve.

More »

Expand

Fig 6.

The distribution of patient-level molecular features in the test set, categorized based on the patch-level classification by the baseline model.

The x-axis indicates the classification of patches, while the y-axis denotes the molecular level determined at the patient level. Here, MSI serves as the positive class and MSS as the negative class: (a) A boxplot illustrating SNP rates for each patch. The y-axis quantifies the cumulative count of SNPs throughout the DNA sample. (b) A bar plot depicting the methylation types for each patch. The y-axis showcases the distribution of various methylation types across classification categories. (c) A boxplot highlighting the CNV rates for patches, with the y-axis measuring the proportion of the DNA sample that manifests CNV.

More »

Expand

Fig 7.

Average and 95% CI ROC and PR curves for per-patient classification using: (a) the BP-CNN_SNP model compared to its corresponding baseline model, (b) the BP-CNN_CIMP model compared to its corresponding baseline model, and (c) the BP-CNN_CNV model compared to its corresponding baseline model.

More »

Expand

Fig 8.

Box-plot visualization of (a) AUROC results, (b) AP results and (c) F1-scores for per-patient classification, comparing the biologically primed models with their corresponding baseline model on the test set over different training sessions.

It’s worth noting that due to the stratified k-fold approach used to partition the training data across sessions, the performance of the baseline model can vary between experiments.

More »

Expand

Fig 9.

Confusion matrices of the patient-level predictions for the different models.

Each matrix represents an average from the test set over various training sessions. The threshold for MSI prediction is determined by the best F1 score over the folds. (a) Baseline model corresponding to the BP-CNN_SNP folds. (b) Baseline model corresponding to the to BP-CNN_CIMP folds. (c) BP-CNN_SNP model. (d) BP-CNN_CIMP model.

More »

Expand

Fig 10.

Average and 95% CI ROC and PR curves for per-patient classification using the BP-CNN_Combined model compared to the baseline model.

(a) ROC curve. (b) PR curve. (c), (d) and (e) are the 5-fold results comparison of the AUROC, AP, and F1 results respectively.

More »

Expand

Fig 11.

A histogram showcasing the MSI scores for patches from selected patients, misclassified by the baseline model but accurately classified by our proposed models.

The x-axis represents the patch MSI probabilities given by the CNN, while the y-axis denotes the count of patches, normalized to the total number of patches for each patient. The comparisons are between (a) the Baseline and BP-CNN_SNP model, (b) the Baseline and BP-CNN_CIMP model, and (c) the Baseline and BP-CNN_Combined model.

More »

Expand

Fig 12.

Patches of patients that were miss-classified by our models.

Top row: patches of patients that were misclassified by the Baseline model and correctly classified by the BP-CNN_Combined model. (a) TCGA-AA-3833, Baseline: MSS, BP-CNN_Combined: MSI, reference: MSI (SNP<1200), (b) TCGA-AY-6197, Baseline: MSS, BP-CNN_Combined: MSI, reference: MSI (CIMP-low), (c) TCGA-A6-2685, Baseline: MSI, BP-CNN_Combined: MSS, reference: MSS, (d) TCGA-NH-A6GC, Baseline: MSI, BP-CNN_Combined: MSS, reference: MSS. Bottom row: patches of patients that were misclassified by both the Baseline model and the BP-CNN_Combined model. (e) TCGA-A6-2686, Baseline: MSS, BP-CNN_Combined: MSS, reference: MSI, (f) TCGA-AG-A02N, Baseline: MSS, BP-CNN_Combined: MSS, reference: MSI, (g) TCGA-AG-3881, Baseline: MSI, BP-CNN_Combined: MSI, reference: MSS, (h) TCGA-DC-6682, Baseline: MSI, BP-CNN_Combined: MSI, reference: MSS.

More »

Expand