Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Summary of the TCGA COAD and READ datasets application: The total cohort encompasses n = 632 patients.

Some patients were excluded due to technical reasons, resulting with n = 430 patients. Out of this, Kather et al. [7] pre-processed and published data for n = 360 patients, segmenting them into a training and a testing set. The training set was balanced at the patch (p) level. For our research, we used stratified cross-validation folds at the patient level. The partitioning into these folds was informed by the novel sub-labels based on SNP rates and CIMP classifications.

More »

Fig 1 Expand

Fig 2.

Experimental flow for our exploration of the interplay between CRC subtypes genomic variants and cellular morphology.

The TCGA-CRC dataset, pre-processed by Kather et al. [7] (N = 360) is split into different sets for analysis. A baseline model is trained, and based on its results, a molecular feature analysis is performed. Based on the analysis we choose to define our data classes based on the ranges and categories of SNP, CIMP and CNV (the BP class definitions step). After the definition, we divide the classes into five stratified folds. Next, three models are trained: BP-CNNCIMP, BP-CNNSNP, and BP-CNNCNV to evaluate the interplay between genomic variations and cellular morphology. The BP-CNNCIMP, BP-CNNSNP, and BP-CNNCNV models classify the data based on CIMP, SNP, and CNV features, respectively. Based on their results, BP-CNNCIMP and BP-CNNSNP are further combined into BP-CNNCombined to incorporate the entire set of genomic variations identified as influencing cellular morphology.

More »

Fig 2 Expand

Fig 3.

Model architectures.

(a) Baseline Model Architecture: Patches are input into the Inception-Net [24] for feature extraction, with the last two layers acting as fully connected classifier layers. Outputs are propagated to a softmax layer for determining probabilities. N represents the number of patient patches, while Pi denotes the MSI probability for each patch. The MSI score for each patient, Pw, is the average of its corresponding MSI probabilities. (b) Biologically-Primed Model Architecture: Similar to the baseline model, the softmax layer outputs class probabilities at the patch level. However, the MSI probability here is calculated as the maximum value between MSI1 and MSI2 outputs. The calculation of Pw remains the same as in the baseline model.

More »

Fig 3 Expand

Fig 4.

Our BP-CNNCombined model.

Models A and B represent biologically-primed models informed by two distinct genomic variations. The network outputs from trained and fixed models A and B are concatenated, fed into a linear layer, and then propagated to a softmax layer to yield probabilities. ‘N’ represents the number of patches for each patient, and Pi indicates the corresponding MSI probabilities for these patches. The MSI score for each patient denoted as Pw, is derived from averaging its respective MSI probabilities.

More »

Fig 4 Expand

Fig 5.

Baseline model results for per-patient classification of the test set validated over 5-folds.

Average and 95% CI curves: (a) ROC curve, (b) PR curve.

More »

Fig 5 Expand

Fig 6.

The distribution of patient-level molecular features in the test set, categorized based on the patch-level classification by the baseline model.

The x-axis indicates the classification of patches, while the y-axis denotes the molecular level determined at the patient level. Here, MSI serves as the positive class and MSS as the negative class: (a) A boxplot illustrating SNP rates for each patch. The y-axis quantifies the cumulative count of SNPs throughout the DNA sample. (b) A bar plot depicting the methylation types for each patch. The y-axis showcases the distribution of various methylation types across classification categories. (c) A boxplot highlighting the CNV rates for patches, with the y-axis measuring the proportion of the DNA sample that manifests CNV.

More »

Fig 6 Expand

Fig 7.

Average and 95% CI ROC and PR curves for per-patient classification using: (a) the BP-CNNSNP model compared to its corresponding baseline model, (b) the BP-CNNCIMP model compared to its corresponding baseline model, and (c) the BP-CNNCNV model compared to its corresponding baseline model.

More »

Fig 7 Expand

Fig 8.

Box-plot visualization of (a) AUROC results, (b) AP results and (c) F1-scores for per-patient classification, comparing the biologically primed models with their corresponding baseline model on the test set over different training sessions.

It’s worth noting that due to the stratified k-fold approach used to partition the training data across sessions, the performance of the baseline model can vary between experiments.

More »

Fig 8 Expand

Fig 9.

Confusion matrices of the patient-level predictions for the different models.

Each matrix represents an average from the test set over various training sessions. The threshold for MSI prediction is determined by the best F1 score over the folds. (a) Baseline model corresponding to the BP-CNNSNP folds. (b) Baseline model corresponding to the to BP-CNNCIMP folds. (c) BP-CNNSNP model. (d) BP-CNNCIMP model.

More »

Fig 9 Expand

Fig 10.

Average and 95% CI ROC and PR curves for per-patient classification using the BP-CNNCombined model compared to the baseline model.

(a) ROC curve. (b) PR curve. (c), (d) and (e) are the 5-fold results comparison of the AUROC, AP, and F1 results respectively.

More »

Fig 10 Expand

Fig 11.

A histogram showcasing the MSI scores for patches from selected patients, misclassified by the baseline model but accurately classified by our proposed models.

The x-axis represents the patch MSI probabilities given by the CNN, while the y-axis denotes the count of patches, normalized to the total number of patches for each patient. The comparisons are between (a) the Baseline and BP-CNNSNP model, (b) the Baseline and BP-CNNCIMP model, and (c) the Baseline and BP-CNNCombined model.

More »

Fig 11 Expand

Fig 12.

Patches of patients that were miss-classified by our models.

Top row: patches of patients that were misclassified by the Baseline model and correctly classified by the BP-CNNCombined model. (a) TCGA-AA-3833, Baseline: MSS, BP-CNNCombined: MSI, reference: MSI (SNP<1200), (b) TCGA-AY-6197, Baseline: MSS, BP-CNNCombined: MSI, reference: MSI (CIMP-low), (c) TCGA-A6-2685, Baseline: MSI, BP-CNNCombined: MSS, reference: MSS, (d) TCGA-NH-A6GC, Baseline: MSI, BP-CNNCombined: MSS, reference: MSS. Bottom row: patches of patients that were misclassified by both the Baseline model and the BP-CNNCombined model. (e) TCGA-A6-2686, Baseline: MSS, BP-CNNCombined: MSS, reference: MSI, (f) TCGA-AG-A02N, Baseline: MSS, BP-CNNCombined: MSS, reference: MSI, (g) TCGA-AG-3881, Baseline: MSI, BP-CNNCombined: MSI, reference: MSS, (h) TCGA-DC-6682, Baseline: MSI, BP-CNNCombined: MSI, reference: MSS.

More »

Fig 12 Expand