Figures
Abstract
Background and objective
Glioblastoma (GBM) is one of the most aggressive and lethal human cancers. Intra-tumoral genetic heterogeneity poses a significant challenge for treatment. Biopsy is invasive, which motivates the development of non-invasive, MRI-based machine learning (ML) models to quantify intra-tumoral genetic heterogeneity for each patient. This capability holds great promise for enabling better therapeutic selection to improve patient outcome.
Methods
We proposed a novel Weakly Supervised Ordinal Support Vector Machine (WSO-SVM) to predict regional genetic alteration status within each GBM tumor using MRI. WSO-SVM was applied to a unique dataset of 318 image-localized biopsies with spatially matched multiparametric MRI from 74 GBM patients. The model was trained to predict the regional genetic alteration of three GBM driver genes (EGFR, PDGFRA and PTEN) based on features extracted from the corresponding region of five MRI contrast images. For comparison, a variety of existing ML algorithms were also applied. Classification accuracy of each gene were compared between the different algorithms. The SHapley Additive exPlanations (SHAP) method was further applied to compute contribution scores of different contrast images. Finally, the trained WSO-SVM was used to generate prediction maps within the tumoral area of each patient to help visualize the intra-tumoral genetic heterogeneity.
Results
WSO-SVM achieved 0.80 accuracy, 0.79 sensitivity, and 0.81 specificity for classifying EGFR; 0.71 accuracy, 0.70 sensitivity, and 0.72 specificity for classifying PDGFRA; 0.80 accuracy, 0.78 sensitivity, and 0.83 specificity for classifying PTEN; these results significantly outperformed the existing ML algorithms. Using SHAP, we found that the relative contributions of the five contrast images differ between genes, which are consistent with findings in the literature. The prediction maps revealed extensive intra-tumoral region-to-region heterogeneity within each individual tumor in terms of the alteration status of the three genes.
Citation: Wang L, Wang H, D’Angelo F, Curtin L, Sereduk CP, Leon GD, et al. (2024) Quantifying intra-tumoral genetic heterogeneity of glioblastoma toward precision medicine using MRI and a data-inclusive machine learning algorithm. PLoS ONE 19(4): e0299267. https://doi.org/10.1371/journal.pone.0299267
Editor: Bardia Yousefi, University of Maryland College Park, UNITED STATES
Received: July 31, 2023; Accepted: February 6, 2024; Published: April 3, 2024
Copyright: © 2024 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The dataset is publicly available on Figshare. The project page can be accessed at: https://figshare.com/projects/Quantifying_intra-tumoral_genetic_heterogeneity_of_glioblastoma_toward_precision_medicine_using_MRI_and_a_data-inclusive_machine_learning_algorithm/195419.
Funding: This work was supported by NIH grant CA220378 and CA250481, and NSF grant DMS-2053170.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Glioblastoma (GBM) is one of the most aggressive and lethal human cancers, with a median overall survival of only about 15 months despite best available standard therapy [1]. Intra-tumoral genetic heterogeneity is a major contributor to poor clinical outcomes [2]. Each tumor is comprised of genetically distinct subpopulations with different sensitivities to treatment, and genetic targets from one biopsy location may not accurately reflect those from other parts of the same tumor [3]. Moreover, due to the invasive nature of the disease, diffusely invaded GBM cells are always left behind in the brain after resection, and these residual regions may be genetically distinct from the biopsy samples collected during surgery [4, 5]. The region-to-region genetic variability within a single tumor provides potential mechanisms for therapeutic escape and makes single targeted therapies less effective [6].
There are substantial challenges for quantifying intra-tumoral genetic heterogeneity of GBM. Ideally, one would want to take biopsy samples from many different regions of a tumor and perform genetic analysis of each sample. This, however, is infeasible due to the invasive nature of biopsy. Although the central tumor mass can often be surgically removed, the invasive portions of the tumor are often left unresected and unbiopsied given the risk to adjacent neurologic structures. Thus, biopsy alone is insufficient to characterize the full landscape of the intra-tumoral heterogeneity [2, 7].
Neuroimaging techniques, such as MRI, provide data of the entire tumor and even the whole brain in a non-invasive manner. The emerging field of radiogenomics has demonstrated the feasibility of using MRI features to predict genetic characteristics of GBM via machine learning (ML). For example, Kha et al. [8] proposed an eXtreme Gradient Boosting (XGBoost)-based model to predict the 1p/19q codeletion status in a binary classification task for lower-grade gliomas. Lam et al. [9] developed a hybrid machine learning-based radiomics by incorporating a genetic algorithm and XGBoost classifier to classify low-grade glioma molecular subtypes. Akbari et al. [10] used a Support Vector Machine (SVM) to predict Epidermal Growth Factor Receptor (EGFR)-vIII mutation based on multiparametric MRI features extracted from tumor regions. Tykocinski et al. [11] predicted EGFR-vIII mutation based on features extracted from perfusion-weighted MRI using multivariable logistic regression. KickingeredeThe tir et al. [12] utilized stochastic gradient boosting machine, random forest, and logistic regression to predict the copy number variants (CNVs) of several GBM driver genes such as EGFR, Platelet-Derived Growth Factor Receptor Alpha (PDGFRA), and Phosphatase and Tensin Homolog (PTEN) based on multiparametric MRI. Chen et al. [13] developed a convolutional neural network to predict PTEN mutation using multiparametric MRI. However, these existing studies focus on predicting overall or average genetic status for the entire tumor, so they are suitable for relatively homogeneous tumors where genetic status does not significantly vary region-to-region. Although these studies have demonstrated the predictive utility of MRI, they fall short for identifying intra-tumoral or regional genetic heterogeneity within each tumor.
This paper aims to develop an ML model that can predict the genetic status region-by-region within a tumoral area of interest (AOI) of each patient using MRI. The model, denoted as f:x→y, takes as input a vector x consisting of MRI features extracted from each region within a tumoral AOI and outputs the genetic status of that region, y, where y = 1 or 2 represents that the gene is non-altered or altered, respectively. The resulting regional predictions can be used to generate a prediction map that reveals the intra-tumoral heterogeneity across the AOI.
To train the ML model f, a binary classification approach can be considered by using a training set consisting of (xi, yi) for n biopsy samples. However, the biopsy sample size is often small, and a more robust approach is to use semi-supervised learning (SSL) [14]. SSL trains f by including both the biopsy/labeled samples (xi, yi) and unlabeled tumoral samples, (xj), i.e., samples from the unbiopsied regions of the tumoral AOI. Additionally, it is possible to leverage samples from outside the tumoral AOI (i.e., the normal brain area), (xk). To include these normal brain samples, one option is to treat them as a third class (class 0), in addition to non-altered gene (class 1) and altered gene (class 2) within the tumoral AOI, and train a three-class classifier. The other option, which may be more appropriate, is to train an ordinal classifier [15–17] by considering that class 0, 1, and 2 have an intrinsic order of increasing abnormality. Fig 1 illustrates the different modeling options. However, none of these models can include all available data. To address this gap, we propose a new model called Weakly-Supervised Ordinal SVM (WSO-SVM), which is designed to integrate unlabeled tumoral samples and normal brain samples beyond just biopsy (labeled) samples to enhance the model’s learning capacity.
WSO-SVM is a novel ordinal classifier based on SVM. Unlike the existing algorithms that only utilize labeled samples from each class (e.g., normal brain samples—class 0, and biopsy samples—class 1 & 2), WSO-SVM introduces a unique optimization formulation to allow the incorporation of unlabeled tumoral samples (class 1 or 2, not 0). This helps identify accurate classification boundaries and improve prediction performance. The development of WSO-SVM is significant as it represents the first method capable of integrating multiple data sources, including biopsy samples, unlabeled tumoral samples, and normal brain samples, to train a robust classifier for predicting regional genetic status using MRI. In our case study, we demonstrate the superior performance of WSO-SVM compared to a variety of ML algorithms. The clinical utility of this work lies in the non-invasive quantification of intra-tumoral genetic heterogeneity using MRI for individual patients. WSO-SVM enables the generation of regional prediction maps for GBM driver genes such as EGFR, PDGFRA, and PTEN across the entire tumoral AOI for each patient. These maps have practical implications in guiding therapy selection and predicting response to targeted therapies, such as EGFR inhibitors [7]. Furthermore, the predictive maps reveal the co-existence of genomically distinct tumor subpopulations within individual tumors, which can enhance our understanding and develop new approaches, such as adaptive therapy, to leverage the interplay and competition between different molecular subpopulations for therapeutic benefit [18].
2. Method
Fig 2 shows a pipeline of the proposed method whose components are discussed in subsequent sections.
Left: model training; Right: model deployment.
2.1 Data collection
This study used data from a cohort of 74 GBM patients with IRB approval from Barrow Neurological Institute (BNI) and Mayo Clinic Arizona (MCA). These patients were prospectively recruited for the study. The recruitment period is from February 29, 2012, until present. All patients provided written informed consent. The data were accessed for research purposes from February 29, 2012, until present. A total of 318 biopsy samples were acquired from these patients (average: 4; range: 1–13). Each patient went through a pre-operative multiparametric MRI exam, from which five contrast images were obtained: T1-weighted contrast-enhanced image (T1+C), T2-weighted image (T2), mean diffusivity (MD), fractional anisotropy (FA), and relative cerebral blood volume (rCBV).
2.2. Biopsy sample analysis
Array CGH data was obtained for a subset of biopsy samples [19]. Whole exome sequencing (WES) was performed remaining biopsies and paired blood samples. Quality control was performed using the MultiQC toolkit. The aligned paired-end clean reads were processed using Burrows-Wheeler Aligner2 and GATK3 to remove low-quality reads and realign around indels. Somatic SNVs and indels were detected using a combination of six variant calling algorithms: Freebayes5, MuTect26, TNhaplotyper7, TNscope7, TNsnv7, and VarScan28. Somatic copy number and tumor purity were estimated from WES data using PureCN12. GISTIC213 analysis was performed to identify recurrently amplified or deleted genomic regions by integrating the results from individual patients.
We focused on three GBM driver genes: EGFR, PDGFRA, and PTEN. For each gene, we considered the gene is altered (class 2) if it has an abnormal CNV or is mutated, and non-altered (class 1) otherwise. For EGFR and PDGFRA, we followed the literature [19] and considered amplification as abnormal CNV; for PTEN, deletion or loss was considered as abnormal CNV [20]. To maximize the sample size in ML training, we included all available samples for each gene. There are 130/171, 53/238, and 206/109 biopsy samples with altered/non-altered EGFR, PDGFRA, and PTEN, respectively.
2.3 MRI preprocessing and feature extraction
Detailed MRI protocols and preprocessing approaches can be found in S1 Appendix. The same approaches have been used in our prior publications [2, 7, 21], which have shown robust performance.
The MRI features corresponding to each biopsy sample were extracted from a defined “region”, i.e., an 8x8 pixel2 window centered at the sampling location. This specific window size was thoughtfully chosen due to its approximate equivalence to the physical size of biopsy samples, ensuring an alignment between the MRI features and the genetic status derived from the biopsy. Moreover, prior research findings have supported the suitability of this window size for effectively capturing the intra-tumoral heterogeneity of GBM [2, 7, 19].
From this window, we extracted 280 features from five aforementioned MRI contrast images, which included statistical features and texture features using two well-established texture analysis algorithms, Gray-Level Co-occurrence Matrix (GLCM) [22] and Gabor Filters (GF) [23]. Please find names of these features in S1 Appendix. Fig 3 depicts the biological connection between genetic alterations and these imaging-phenotypic features. These features have been widely used in the radiomics literature for GBM to aid in diagnosis, prognosis, and prediction of genetics-related tumor characteristics, such as genetic subtypes and copy number variations [7, 19, 24–27].
As shown in Fig 2, the training of WSO-SVM requires not only biopsy samples, but also unlabeled tumoral samples and normal brain samples. These are sampled from a pre-segmented tumoral AOI and the contralateral AOI based on the MRI of each patient, respectively. The tumoral AOI was segmented by following standard procedures [2, 19], which is the union of the contrast-enhancing portion (CE) and the non-enhancing portion (NE) of the tumor. The contralateral AOI is located on the opposite side of the brain from the tumor and is considered “normal”. To extract MRI features for these samples, the same approach as that used for biopsy samples was adopted.
The selection of unlabeled tumoral samples and normal brain samples was based on multi-fold considerations: (a) Representation of tumoral heterogeneity: Biologically, a GBM tumor includes a contrast-enhancing portion (CE) and a non-enhancing portion (NE). The former harbors proliferative tumor cells, while the latter harbors invading tumor to the surrounding brain tissue [28]. To ensure our unlabeled samples capture this biological heterogeneity of each tumor, an equal number of samples were taken from CE and NE. (b) Avoidance of outlier samples: We were careful to avoid selecting samples from areas that could be considered outliers. Notably, we excluded regions like necrosis, where the tissue characteristics significantly differ [28]. Additionally, for tumors located near fixed brain structures like the skull or cerebrospinal fluid, precautions were taken to prevent sample overlap with these structures. (c) Model accuracy and efficiency: Since unlabeled tumoral samples and normal brain samples are “auxiliary” samples to biopsies, their size should not be excessively larger even though acquiring these samples is much easier than biopsies. This is to prevent sample imbalance and potential dilution of the predominant influence of biopsy samples on model training. Therefore, we kept an equal number of unlabeled tumoral samples and normal brain samples, with their combined total aligning with that of biopsy samples. This choice also ensures the computational efficiency of model training.
Moreover, as depicted in Fig 2, when the trained WSO-SVM is applied to a patient, the goal is to generate a regional prediction map of the genetic status within the tumoral AOI. To accomplish this, an 8×8 pixel2 sliding window with a stride size of one pixel was placed at each pixel within the tumoral AOI, and MRI features were extracted from each window.
2.4 Proposed WSO-SVM model
Let D denote a training set that consists of N patients. Assume there are n1 and n2 total biopsy samples from these patients with a gene of interest being non-altered (y = 1) and altered (y = 2), respectively. Let and
denote the MRI feature vectors for a biopsy sample in class 1 and 2, respectively; i = 1,…,n1; i′ = 1,…,n2. Also, assume there are m12 unlabeled tumoral samples (y = 1 or 2). Let
denote the MRI feature vector for an unlabeled sample, j = 1,…,m12. Additionally, assume there are m0 normal brain samples (y = 0). Let
denote the MRI feature vector for a normal brain sample, k = 1,…,m0.
As illustrated in Fig 4, WSO-SVM maps the MRI feature vector of each sample, x, into a high-dimensional Reproducing Kernel Hilbert Space, ϕ(x), where a linear classifier wTϕ(x) is constructed to separate the three classes (y = 0,1,2) with largest possible margin, 2/‖w‖, while also minimizing the empirical errors of samples that cannot be classified correctly, such as . The goal of training WSO-SVM to find the weight vector w and two classification boundaries, b0 and b1. Formally, we construct WSO-SVM as the following optimization:
Subject to:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
The objective function seeks to maximize the margin that separates different classes. The constraints (1)-(2) are designed to classify biopsy samples into classes 1 and 2, while introducing slack values, , to allow for some misclassification errors, which are bounded in (3). The constraints (4)-(5) are designed to classify normal brain samples into class 0, and to prevent unlabeled tumoral samples and biopsy samples from being classified as class 0, while introducing slack values,
, to allow for some misclassification errors, which are bounded in (6). The constraint in (7) is intended to retain the intrinsic order of the ordinal classes, 0, 1, and 2.
It is important to note that WSO-SVM is different from ordinal SVM in its ability to incorporate unlabeled tumoral samples. This is achieved by introducing a constraint in Eq (5) to prevent the classification of these samples as normal brain samples (class 0). The inclusion of unlabeled tumoral samples helps better identify the classification boundary b0, and also contributes to the estimation of the weight vector w, indirectly aiding in a better identification of b1.
It is easier to solve the WSO-SVM optimization in its dual form which is given in Proposition 1.
Proposition 1: The dual form of the primal WSO-SVM optimization problem in Eqs (1)–(7) is:
subject to:
where
,
, and K is a covariance matrix with
that can be computed by a kernel function defined on the feature space. C1 and C2 are tuning parameters. (Proof in S1 Appendix.)
The dual problem is a convex quadratic programming problem, which can be solved by a standard quadratic optimization solver such as CPLEX.
Once the optimal solutions of α and β in the dual problem are obtained, we can obtain the optimal coefficients in the primal problem, w, and further get . Also, b0 and b1 can be estimated as: b0 = h(x)−y for any (x, y) belonging to normal brain samples (or biopsy and unlabeled tumoral samples) whose corresponding β(0) (or β(12)) satisfies 0≤β(0) (or β(12))≤C2; b1 = h(x)−y for any (x, y) belonging to non-altered biopsy samples (or altered biopsy samples) whose corresponding α(1) (or α(2)) satisfies 0≤α(1) (or α(2))≤C1. Then, we can obtain the discriminant functions for any new sample x*, i.e.,
and
. The decision rule for classifying the new sample x* is: it belongs to class 2 if f1(x*)≥0, to class 1 if
, and to class 0 if f0(x*)<0.
2.4.1. Training and cross validation (CV).
We used 10-fold CV to mitigate the risk of overfitting. To further reduce potential bias in evaluating model performance due to the specific fold division in CV, we repeated the CV procedure 30 times. We reported the model’s average performance and the standard deviation across the 30 repetitions with the latter capturing uncertainty. Specifically, the biopsy samples were divided into 10 folds. In each iteration, WSO-SVM was trained based on 9 folds of the biopsy samples and randomly selected unlabeled tumoral samples and normal brain samples of the same size according to the considerations illustrated in Sec. 2.3.
2.4.2. Choice of tuning parameters.
There are two key tuning parameters for WSO-SVM according to Proposition 1, C1 and C2. C1 affects the classification boundary between biopsy samples in class 1 (gene not altered) and class 2 (gene altered). C2 affects classification boundary between class 1 or 2 (comprising tumoral samples, both unlabeled and labeled) and class 0 (normal brain samples). Our experiments found that distinguishing between class 1/2 and class 0 was relatively easy, which also aligned with the intuition that discerning tumoral samples from normal brain samples should inherently be a formidable task. Therefore, we tuned C2 on a coarser grid within the range of 0.01 to 100 and kept multiple settings that yielded >80% accuracy in differentiating class 1/2 from class 0. At each setting, we tuned C1 on a finer grid between 0.01 and 100, and selected C1 with the highest accuracy to differentiate class 1 and 2.
2.4.3. Generation of a regional predictive map of genetic status for each patient.
To personalize the model toward each patient’s data, we re-trained WSO-SVM under the previously found optimal tuning parameter setting but using randomly selected unlabeled tumoral samples and normal brain samples from the specific patient. Next, we applied the model to predict the gene status for each sliding window within the tumoral AOI of the patient, based on MRI features extracted from that window. The resulting predictions formed the predictive map for that patient.
2.4.4. Time complexity in training and deployment.
As WSO-SVM adopted SVM as its base model, its time complexity in model training is similar to that of SVM [29], which ranges between O(n2×d) and O(n3×d), where n is the sample size and d is the feature dimension. Currently, we used quadratic programming to solve the WSO-SVM optimization, which can be further expedited by using more advanced optimization algorithms such as sequential Minimal optimization [30] and stochastic gradient descent [31]. While SVM-type of models are not the most computationally efficient, the training time complexity is acceptable and the performance gain over more efficient methods has made it an appealing choice for large datasets in various applications. In our application, the model training is done offline, which makes it feasible to train WSO-SVM on large datasets. During deployment, the trained model generates regional genetic characteristics within the tumoral area on a patient-by-patient basis. The time required to produce the prediction map for an individual patient is less than 30 seconds when executed on a standard desktop computer. This level of efficiency aligns well with the clinical use case, ensuring that the model can be deployed in a timely and practical manner.
2.5 Model interpretation
It is important to understand the contribution of different MRI features to the prediction made by WSO-SVM. While WSO-SVM can use either a linear or non-linear kernel, we found that a non-linear kernel produced better performance. Also, previous studies have shown that the relationship between MRI features and genetic status is highly non-linear [32]. To interpret the non-linear WSO-SVM, we utilized a popular, model-agnostic method called SHapley Additive exPlanations (SHAP) [33]. Essentially, SHAP estimates the contribution of a feature, referred to as the SHAP value, by computing the difference in the model’s prediction when the feature is present versus absent. The higher the absolute SHAP value of a feature, the greater its impact on the prediction. In our study, we were more interested in the contribution of each MRI contrast image rather than individual features. Thus, we aggregated the feature-wise SHAP values to the contrast level.
2.6 Competing methods
We compared the performance of WSO-SVM with existing algorithms in several categories (using the same CV process):
- Binary classifiers: SVM, random forest (RF).
- Semi-supervised learning algorithms: transductive SVM (TSVM) [34], Laplacian SVM (LapSVM) [35], co-training [36], semi-supervised RF (semi-RF) [37].
- Multi-class classifiers: SVM, RF.
- Ordinal classifiers: ordinal SVM, ordinal RF
- Multi-task learning (MTL): regularized MTL (regMTL) [38], MTL Gaussian Process (MTL-GP) [39], MTL RF (MTL-RF) [40]. These are multi-class classification algorithms by coupling the models of the three GBM driver genes together.
3. Results
Tables 1–3 summarize the average CV performance and standard deviation over 30 repeated experiments for each gene. Fig 5 compares WSO against the competing algorithm with the best accuracy in each category. WSO-SVM achieved the highest accuracy, sensitivity, and specificity for EGFR and PTEN. For PDGFRA, WSO-SVM achieved the highest accuracy and sensitivity, while its specificity is second highest after MTL-RF. However, the sensitivity of MTL-RE is very low (only 0.5). Due to the heavy class imbalance for PDGFRA, most existing algorithms struggle to achieve a reasonable sensitivity, whereas WSO-SVM did not have this issue. Among all the competing algorithms, random forest types of methods performed better in most cases. Moreover, the standard deviation of WSO-SVM is among the smallest over all the methods being compared. The magnitude of the standard deviation is also small, indicating that the model performance is quite stable (i.e., less uncertainty).
The overall best competing algorithm is highlighted by **.
WSO-SVM performed significantly better than the overall best competing algorithm in accuracy (p<0.001), sensitivity (p<0.001), and specificity (p = 0.002) using a Wilcoxon rank-sum test.
WSO-SVM performed significantly better than the overall best competing algorithm in accuracy (p = 0.04) and sensitivity (p<0.001) using a Wilcoxon rank-sum test.
WSO-SVM performed significantly better than the overall best competing algorithm in accuracy (p<0.001), sensitivity (p<0.001), and specificity (p<0.001) using a Wilcoxon rank-sum test.
To assess the statistical significance of the performance gain for WSO-SVM, we performed a one-sided Wilcoxon rank-sum test to compare WSO-SVM against the competing algorithm with the overall best accuracy. For EGFR, WSO-SVM significantly outperformed multi-class RF in accuracy, sensitivity, and specificity (p<0.001, p<0.001, p = 0.002). For PTEN, WSO-SVM significantly outperformed binary RF in accuracy, sensitivity, and specificity (p<0.001, p<0.001, p<0.001). For PDGFRA, WSO-SVM had significantly higher accuracy and sensitivity than MTL-RF (p = 0.04, p<0.001), but its specificity was not significantly higher.
Furthermore, Fig 6 shows the absolute SHAP values of the five MRI contrast images. It is evident that all contrast images contribute to the classification of each gene, but their relative contributions vary between genes. Further discussion will be provided in the next section.
Contributions of MRI contrast images to the classification of (a) EGFR, (b) PDGFRA, and (c) PTEN, by WSO-SVM.
Finally, the trained WSO-SVM models were used to generate prediction maps of the three genes for each patient. For demonstration, Fig 7 shows the prediction maps for four different patients. The alterations in EGFR and PDGFRA promote tumor growth. Thus, we showed their co-alteration patterns in one map. PTEN is a tumor suppressor gene, whose alteration is shown in a separate map. Patient A demonstrates predominant regions with EGFR alteration, with scattered regions of PDGFRA co-alteration; the PTEN map shows largely non-alteration. For patient B, the PTEN map shows an opposite pattern, whereas the EGFR & PDGFRA map demonstrates a similar pattern as patient A. In contrast to patient A and B, patient C demonstrates predominant regions with PDGFRA alteration. For patient D, the regions with EGFR & PDGFRA co-alteration are relatively concentrated compared to the other patients. These examples demonstrated the great extent of intra-tumoral genetic heterogeneity for each patient.
EGFR & PDGFRA prediction map (left column) and PTEN prediction map (right column) in tumoral AOI for four patients (rows). Yellow dots represent biopsy samples whose predicted gene statuses by WSO-SVM are reported underneath the maps (all predictions are correct).
4. Discussion
Our results demonstrated that WSO-SVM surpasses a variety of existing ML algorithms for predicting the regional status of three GBM driver genes using MRI. To interpret WSO-SVM, the SHAP values in Fig 6 revealed the importance of each contrast in influencing WSO-SVM’s prediction for each gene. Specifically, the model’s predictions on EGFR were primarily influenced by T2 and rCBV, which aligns with prior research that found significant correlations of EGFR with T2 [19, 41] and rCBV [11, 19, 42]. T1+C demonstrated the highest contribution to PDGFRA prediction. This is consistent with previous studies indicating that PDGFRA subpopulations tend to localize in CE with relatively less infiltration into NE, in comparison to EGFR [43]. For PTEN, the model’s prediction received the greatest contribution from rCBV. Prior studies have highlighted the correlation between PTEN and rCBV, particularly when co-existing with EGFR alterations [44].
The prediction maps in Fig 7 and in S1 Fig for other patients in our dataset provided strong evidence of the extensive intra-tumoral genetic heterogeneity in each patient. While intra-tumoral genetic heterogeneity in GBM is well-documented in literature, practical methods for quantifying this heterogeneity are lacking. Biopsy samples, which can only be obtained from a few locations of the brain, leave many regions uncharacterized. This study introduces WSO-SVM as a non-invasive approach to predict regional genetic status across the entire tumoral AOI for each patient using MRI.
The clinical utility of the prediction maps for GBM driver genes, EGFR, PDGFRA, and PTEN, is multi-fold. First, these driver genes have been investigated as therapeutic targets for GBM. EGFR is one of the most commonly altered gene drivers in GBM and has been implicated in several pathogenic mechanisms. Targeted drug therapies, including those directed at EGFR and other receptor tyrosine kinases (RTKs) like PDGFRA, have been developed [11, 12]. However, the clinical outcomes of current therapies are unsatisfactory for most patients due to the limited information obtained from sparse biopsy samples, which cannot fully capture the genetic landscape of each patient’s tumor. With the capability provided by WSO-SVM, there is an opportunity to optimize therapy selection for each patient and provide better prognostic information regarding their response to treatment. This holds great potential for improving patient outcomes and tailoring therapies to individual genetic characteristics.
Moreover, this study goes beyond individual gene predictions and allows for the simultaneous prediction of multiple GBM driver genes. Interactions between tumor subpopulations within GBM tumors are increasingly acknowledged for their impact on biological behavior, therapeutic response, and local phenotypic expression. Although such interactions have been extensively studied in non-CNS tumors, their exploration in GBM remains limited. Existing studies have primarily focused on the heterogeneous expression of receptor tyrosine kinase (RTK) aberrations, such as EGFR and PDGFRA amplifications. For instance. Inda et al. [45] showed that a minority subpopulation expressing EGFR-vIII could potentiate a majority subpopulation expressing wild-type EGFR to enhance growth, survival, and drug resistance. Szerlip et al. [46] observed cooperation between subpopulations expressing EGFR or PDGFRA amplifications, requiring combined inhibition for pathway attenuation in vitro. Fiorenzo et al. [47] suggests that in vivo and human studies are needed to fully understand subpopulation interactions’ impact on tumor growth. These interactions between subpopulations pose significant challenges for current treatment strategies and clinical trials that focus on single drug targets, such as EGFR [48]. By providing the capability to predict multiple GBM driver genes simultaneously, our study offers insights into these complex interactions and addresses the need for a more comprehensive understanding of tumor heterogeneity in GBM to develop future, advanced therapy [18, 49].
This study has several limitations. First, the biopsy sample size is relatively small. This is due to the highly invasive nature of acquiring these samples from patients’ brains. In the literature of integrating MRI and brain biopsy data for machine learning models, the typical sample size falls within the range of 82–244 [2, 5, 7, 11–13]. While our study included 318 biopsies, a size comparably larger than these existing studies, it remains relatively modest when compared to domains where sample collection is more accessible. To alleviate this problem, the WSO-SVM model was designed to incorporate unlabeled tumoral samples and normal brain samples. However, further research is imperative to validate the generalizability of WSO-SVM on a more extensive and diverse population. A related issue is that our performance evaluation was based on CV. Using external datasets to further validate our model is highly necessary. There is currently no publicly available dataset with the same nature of our dataset, due to the invasive nature of biopsy acquisition and the time-consuming process of patient consent, surgical procedures, genetic analysis, and image preprocessing. Nevertheless, our team is currently collecting more data and preparing for subsequent validations of the model. This paper serves as a starting point in addressing a critical issue of non-invasive quantification for intra-tumoral genetic heterogeneity using MRI and a novel machine learning model WSO-SVM.
Second, it is important to acknowledge that while our study establishes correlations between genetic alterations and imaging-phenotypic features, it does not establish causal relationships. Experimental validation of causal relationships, which may involve creating specific genetic alterations in animal models and observing their effects on imaging phenotypes, remains a critical step to confirm and gain a deeper understanding of the underlying cancer mechanisms.
Third, while we have provided some discussions on the potentials of using the method developed in this paper to help therapeutic selection and develop advanced therapy to improve patient outcomes, this paper focused on the research phase of the method development. Clinical validation in real-world setting is necessary to establish the actual utility and benefit of the proposed method. Such validation could encompass clinical trials designed to compare patient outcomes, such as treatment response and survival, between cohorts undergoing standard clinical protocols for therapeutic selection and those benefitting from the additional guidance provided by the regional genetic prediction maps generated by our method.
Last but not least, the WSO-SVM model has several aspects for improvement. For instance, WSO-SVM can incorporate unlabeled tumoral samples and normal brain samples. Currently, these samples were selected based on considerations illustrated in Sec. 2.3. This selection method can be refined by integrating more advanced computational strategies that take uncertainty and diversity into account [50] and by considering patient demographic information [51]. Also, WSO-SVM relies on texture features extracted from MRI as input, which may be influenced by imaging quality. Uncertainty quantification of WSO-SVM predictions considering input uncertainty is important, and a Bayesian version of the model could address this issue. Also, developing robust predictive models that are insensitive to input uncertainty would have greater clinical utility.
5. Conclusion
We developed a data-inclusive WSO-SVM model to predict regional genetic alteration status within each GBM tumor using MRI. This study demonstrated the feasibility of using MRI and WSO-SVM to enable non-invasive prediction of regional genetic alteration for each patient, which can inform future adaptive therapies for individualized oncology.
Supporting information
S1 Fig.
Patient-wise proportions of alteration vs. non-alteration for (a) EGFR, (b) PDGFRA, and (c) PTEN within tumoral AOI, aggregated from the prediction maps of these genes by WSO-SVM.
https://doi.org/10.1371/journal.pone.0299267.s002
(TIF)
Acknowledgments
We are grateful to all of those who have contributed to elements of this work, particularly the many surgeons for collecting the biopsies, with special thanks to Kris Smith, Peter Nakaji, Bernard Bendok, Devi Patra, and Richard Zimmerman, Klimet Donev, the glioma biopsy protocol teams for aiding in the logistics ensuring integrity of the biopsy samples and screenshots and in aiding with clinical data abstraction, including Barrett Anderies, Jessica Bauer, Spencer Bayless, Hend Bcharach, Regina Becker, Sameer Channer, Brenden Doyle, Lysette Elsner, Lily Esaleh, Ashlyn Gonzales, Crystal Harris, Morgan Hatlestead, Ryan Hess, Sandra Johnston, Yvette Lassiter-Morris, Julia Lorence, Ashley Napier, Ashley Nespodzany, Sejal Shanbhag, Sarah Van Dijk, Scott Whitmire, and finally other past and current members of the image analysis team, special mention of Cassandra Rickertsen and Lisa Paulson for their leadership.
References
- 1. Stupp R, Mason WP, van den Bent MJ, Weller M, Fisher B, Taphoorn MJB, et al. Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. New England journal of medicine. 2005;352: 987–996. pmid:15758009
- 2. Hu LS, Ning S, Eschbacher JM, Gaw N, Dueck AC, Smith KA, et al. Multi-parametric MRI and texture analysis to visualize spatial histologic heterogeneity and tumor extent in glioblastoma. PLoS One. 2015;10: e0141506. pmid:26599106
- 3. Hu LS, Hawkins-Daarud A, Wang L, Li J, Swanson KR. Imaging of intratumoral heterogeneity in high-grade glioma. Cancer Lett. 2020;477: 97–106. pmid:32112907
- 4. Swanson KR, Bridge C, Murray JD, Alvord EC Jr. Virtual and real brain tumors: using mathematical modeling to quantify glioma growth and invasion. J Neurol Sci. 2003;216: 1–10. pmid:14607296
- 5. Baldock AL, Ahn S, Rockne R, Johnston S, Neal M, Corwin D, et al. Patient-specific metrics of invasiveness reveal significant prognostic benefit of resection in a predictable subset of gliomas. PLoS One. 2014;9: e99057. pmid:25350742
- 6. Marusyk A, Almendro V, Polyak K. Intra-tumour heterogeneity: a looking glass for cancer? Nat Rev Cancer. 2012;12: 323–334. pmid:22513401
- 7. Hu LS, Wang L, Hawkins-Daarud A, Eschbacher JM, Singleton KW, Jackson PR, et al. Uncertainty quantification in the radiogenomics modeling of EGFR amplification in glioblastoma. Sci Rep. 2021;11: 1–14.
- 8. Kha Q-H, Le V-H, Hung TNK, Le NQK. Development and Validation of an Efficient MRI Radiomics Signature for Improving the Predictive Performance of 1p/19q Co-Deletion in Lower-Grade Gliomas. Cancers (Basel). 2021;13. pmid:34771562
- 9. Lam LHT, Do DT, Diep DTN, Nguyet DLN, Truong QD, Tri TT, et al. Molecular subtype classification of low-grade gliomas using magnetic resonance imaging-based radiomics and machine learning. NMR Biomed. 2022;35: e4792. pmid:35767281
- 10. Akbari H, Bakas S, Pisapia JM, Nasrallah MP, Rozycki M, Martinez-Lage M, et al. In vivo evaluation of EGFRvIII mutation in primary glioblastoma patients via complex multiparametric MRI signature. Neuro Oncol. 2018;20: 1068–1079. pmid:29617843
- 11. Tykocinski ES, Grant RA, Kapoor GS, Krejza J, Bohman L-E, Gocke TA, et al. Use of magnetic perfusion-weighted imaging to determine epidermal growth factor receptor variant III expression in glioblastoma. Neuro Oncol. 2012;14: 613–623. pmid:22492960
- 12. Kickingereder P, Bonekamp D, Nowosielski M, Kratz A, Sill M, Burth S, et al. Radiogenomics of glioblastoma: machine learning–based classification of molecular characteristics by using multiparametric and multiregional MR imaging features. Radiology. 2016;281: 907–918. pmid:27636026
- 13. Chen H, Lin F, Zhang J, Lv X, Zhou J, Li Z-C, et al. Deep Learning Radiomics to Predict PTEN Mutation Status From Magnetic Resonance Imaging in Patients With Glioma. Front Oncol. 2021;11. pmid:34671557
- 14.
Zhu X, Goldberg AB. Introduction to Semi-Supervised Learning. Springer International Publishing; 2009. https://doi.org/10.1007/978-3-031-01548-9
- 15. Chu W, Keerthi SS. Support Vector Ordinal Regression. Neural Comput. 2007;19: 792–815. Available: http://direct.mit.edu/neco/article-pdf/19/3/792/816834/neco.2007.19.3.792.pdf. pmid:17298234
- 16. Chu W, Uk ZUA, Williams CKI. Gaussian Processes for Ordinal Regression Zoubin Ghahramani. Journal of Machine Learning Research. 2005.
- 17. Shashua A, Levin A. Ranking with Large Margin Principle: Two Approaches*.
- 18. Gatenby RA, Silva AS, Gillies RJ, Frieden BR. Adaptive therapy. Cancer Res. 2009;69: 4894–4903. pmid:19487300
- 19. Hu LS, Ning S, Eschbacher JM, Baxter LC, Gaw N, Ranjbar S, et al. Radiogenomics to characterize regional genetic heterogeneity in glioblastoma. Neuro Oncol. 2017;19: 128–137. pmid:27502248
- 20. Koul D. PTEN signaling pathways in glioblastoma. Cancer Biol Ther. 2008;7: 1321–1325. pmid:18836294
- 21. Gaw N, Hawkins-Daarud A, Hu LS, Yoon H, Wang L, Xu Y, et al. Integration of machine learning and mechanistic models accurately predicts variation in cell density of glioblastoma using multiparametric MRI. Sci Rep. 2019;9: 10063. pmid:31296889
- 22. Haralick RM, Shanmugam K, Dinstein IH. Textural features for image classification. IEEE Trans Syst Man Cybern. 1973; 610–621.
- 23.
Feichtinger HG, Strohmer T. Gabor analysis and algorithms: Theory and applications. Springer Science & Business Media; 2012.
- 24. Lewis MA, Ganeshan B, Barnes A, Bisdas S, Jaunmuktane Z, Brandner S, et al. Filtration-histogram based magnetic resonance texture analysis (MRTA) for glioma IDH and 1p19q genotyping. Eur J Radiol. 2019;113: 116–123. pmid:30927935
- 25. Vamvakas A, Williams SC, Theodorou K, Kapsalaki E, Fountas K, Kappas C, et al. Imaging biomarker analysis of advanced multiparametric MRI for glioma grading. Phys Med. 2019;60: 188–198. pmid:30910431
- 26. Ryu YJ, Choi SH, Park SJ, Yun TJ, Kim J-H, Sohn C-H. Glioma: application of whole-tumor texture analysis of diffusion-weighted imaging for the evaluation of tumor heterogeneity. PLoS One. 2014;9: e108335. pmid:25268588
- 27. Alis D, Bagcilar O, Senli YD, Isler C, Yergin M, Kocer N, et al. The diagnostic value of quantitative texture analysis of conventional MRI sequences using artificial neural networks in grading gliomas. Clin Radiol. 2020;75: 351–357. pmid:31973941
- 28. Eidel O, Burth S, Neumann J-O, Kieslich PJ, Sahm F, Jungk C, et al. Tumor Infiltration in Enhancing and Non-Enhancing Parts of Glioblastoma: A Correlation with Histopathology. PLoS One. 2017;12: e0169292. pmid:28103256
- 29. Chapelle O. Training a support vector machine in the primal. Neural Comput. 2007;19: 1155–78. pmid:17381263
- 30. Platt J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. 1998 Apr. Available: https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/.
- 31.
Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Springer; 2006.
- 32. Ahn SJ, Kwon H, Yang JJ, Park M, Cha YJ, Suh SH, et al. Contrast-enhanced T1-weighted image radiomics of brain metastases may predict EGFR mutation status in primary lung cancer. Sci Rep. 2020;10. pmid:32483122
- 33. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st international conference on neural information processing systems. 2017. pp. 4768–4777.
- 34. Collobert R, Sinz F, Weston J, Bottou L, Joachims T. Large scale transductive SVMs. Journal of Machine Learning Research. 2006;7.
- 35. Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research. 2006;7.
- 36.
Zhou Y, Goldman S. Democratic co-learning. 16th IEEE International Conference on Tools with Artificial Intelligence. IEEE; 2004. pp. 594–602.
- 37. Leistner C, Saffari A, Santner J, Bischof H. Semi-Supervised Random Forests. IEEE 12th international conference on computer vision. 2009. pp. 506–513.
- 38. Cao H, Zhou J, Schwarz E. RMTL: an R library for multi-task learning. Bioinformatics. 2019;35: 1797–1798. pmid:30256897
- 39. Williams C, Bonilla E v, Chai KM. Multi-task Gaussian process prediction. Adv Neural Inf Process Syst. 2007; 153–160.
- 40. Linusson H. Multi-output Random Forests. 2013.
- 41. Aghi M, Gaviani P, Henson JW, Batchelor TT, Louis DN, Barker FG. Magnetic resonance imaging characteristics predict epidermal growth factor receptor amplification status in glioblastoma. Clinical Cancer Research. 2005;11: 8600–8605. pmid:16361543
- 42. Gupta A, Young RJ, Shah AD, Schweitzer AD, Graber JJ, Shi W, et al. Pretreatment dynamic susceptibility contrast MRI perfusion in glioblastoma: prediction of EGFR gene amplification. Clin Neuroradiol. 2015;25: 143–150. pmid:24474262
- 43. Snuderl M, Fazlollahi L, Le LP, Nitta M, Zhelyazkova BH, Davidson CJ, et al. Mosaic amplification of multiple receptor tyrosine kinase genes in glioblastoma. Cancer Cell. 2011;20: 810–817. pmid:22137795
- 44. Ryoo I, Choi SH, Kim J-H, Sohn C-H, Kim SC, Shin HS, et al. Cerebral blood volume calculated by dynamic susceptibility contrast-enhanced perfusion MR imaging: preliminary correlation study with glioblastoma genetic profiles. PLoS One. 2013;8: e71704. pmid:23977117
- 45. Inda M-M, Bonavia R, Mukasa A, Narita Y, Sah DWY, Vandenberg S, et al. Tumor heterogeneity is an active process maintained by a mutant EGFR-induced cytokine circuit in glioblastoma. Genes Dev. 2010;24: 1731–1745. pmid:20713517
- 46. Szerlip NJ, Pedraza A, Chakravarty D, Azim M, McGuire J, Fang Y, et al. Intratumoral heterogeneity of receptor tyrosine kinases EGFR and PDGFRA amplification in glioblastoma defines subpopulations with distinct growth factor response. Proc Natl Acad Sci U S A. 2012;109: 3041–6. pmid:22323597
- 47. Fiorenzo P, Mongiardi MP, Dimitri D, Cozzolino M, Ferri A, Montano N, et al. HIF1-positive and HIF1-negative glioblastoma cells compete in vitro but cooperate in tumor growth in vivo. Int J Oncol. 2010;36: 785–791. pmid:20198320
- 48. Hegi ME, Rajakannu P, Weller M. Epidermal growth factor receptor: a re-emerging target in glioblastoma. Curr Opin Neurol. 2012;25: 774–779. pmid:23007009
- 49. Bonavia R, Inda M-M, Cavenee WK, Furnari FB. Heterogeneity Maintenance in Glioblastoma: A Social Network. Cancer Res. 2011;71: 4055–4060. pmid:21628493
- 50. Etikan I. Sampling and Sampling Methods. Biom Biostat Int J. 2017;5.
- 51. Elfil M, Negida A. Sampling methods in Clinical Research; an Educational Review. Emerg (Tehran). 2017;5: e52. pmid:28286859