Figures
Abstract
Objective
Gliomas are among the most common and heterogeneous primary tumours of the central nervous system. Accurate grading is essential for treatment planning and prognosis, yet conventional histopathological approaches are limited by subjectivity and poor reproducibility. This study aimed to develop a machine learning–based prediction model that integrates clinical and molecular characteristics to improve early glioma grading, thereby enhancing diagnostic accuracy and supporting individualized treatment strategies.
Methods
An efficient prediction model for low-grade gliomas (LGGs) and glioblastoma (grade IV, GBM) was developed by utilizing the clinical and molecular characteristics of gliomas from The Cancer Genome Atlas (TCGA) dataset. A novel integration of recursive feature elimination (RFE) with random forest (RF) and elastic net regression (ENR) was implemented to select features efficiently. Additionally, the synthetic minority oversampling technique (SMOTE) was applied to balance the training set, and K-nearest neighbours (KNN), support vector machine (SVM), and other algorithms were optimized through random-search hyper-parameter optimization (HPO) with five-fold cross-validation, yielding nine distinct machine learning (ML) models. Ultimately, by applying the voting and stacking algorithms, 34 ensemble learning models were constructed. Furthermore, all the models were externally validated using the Chinese Glioma Genome Atlas (CGGA) dataset. Finally, SHapley Additive exPlanations (SHAP) analysis was conducted to elucidate the prediction processes of the ensemble models.
Results
Feature selection revealed 11 key grading features, including Tumour Protein 53 (TP53) and Isocitrate Dehydrogenase 1 (IDH1). Among the 9 basic models constructed by combining optimization techniques such as SMOTE, the RF model had the best performance (Area Under Curve (AUC) of 0.916 for TCGA and 0.797 for CGGA). Among the 34 integrated models constructed, the Voting25 model integrating RF, Extreme Gradient Boosting (XGBoost), and KNN achieved AUC values of 0.928 and 0.794, respectively, on the TCGA and CGGA datasets, demonstrating overall optimal predictive performance.
Conclusion
Eleven key features have been identified that facilitate molecular detection and personalized targeted therapy for glioma. Nine models were developed and optimized, and the RF model was observed to provide the best performance, potentially guiding future ML-related research in glioma. Additionally, the voting ensemble method, which integrates RF, XGBoost, and KNN, was shown to achieve superior performance, thereby enhancing both accuracy and robustness. Finally, all the models were successfully validated on the CGGA dataset, indicating strong generalizability.
Citation: Liu S, Xie Y, Gong X, He J, Zou W (2025) Machine learning-based prediction of glioma grading. PLoS One 20(12): e0314831. https://doi.org/10.1371/journal.pone.0314831
Editor: Zhanzhan Li, Xiangya Hospital Central South University, CHINA
Received: November 17, 2024; Accepted: November 27, 2025; Published: December 26, 2025
Copyright: © 2025 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data analyzed in this study are obtained from TCGA (https://www.cancer.gov/tcga) and CGGA (http://www.cgga.org.cn/).
Funding: This study is supported by National Nature Science Foundation of China (Grant No.82260666). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: LGG, Low-grade glioma; HGG, High-grade gliomas; WHO, World Health Organization; GBM, Glioblastoma; ML, Machine Learning; TCGA, The Cancer Genome Atlas; RFE, Recursive Feature Elimination; RF, Random Forest; ENR, Elastic Net Regression; SMOTE, Synthetic Minority Over-sampling Technique; KNN, K-Nearest Neighbors; SVM, Support Vector Machine; HPO, Hyper-Parameter Optimization; CGGA, Chinese Glioma Genome Atlas; SHAP, SHapley Additive exPlanations; XGBoost, eXtreme Gradient Boosting; EL-APMC, Adaptive Power Mean Combiner; DTI, Diffusion tensor imaging; MLP, Multi-Layer Perceptron; DT, Decision Tree; GBDT, Gradient Boosting Decision Tree; LightGBM, Light Gradient Boosting Machine; LR, Logistic Regression; ROC, Receiver Operating Characteristic; AUC, Area Under the Curve; DCA, Decision Curve Analysis.
Introduction
Glioma is a highly aggressive tumour associated with an extremely poor prognosis. In recent years, the global incidence of glioma has increased, and the affected population has become progressively younger, thereby posing a health challenge across multiple age groups [1–3]. In accordance with the World Health Organization (WHO) criteria, gliomas are classified as LGG or high-grade gliomas (HGGs). Among HGGs, GBM occurs in approximately 4.03 cases per 100 000 individuals and accounts for more than half of malignant central nervous system tumours [4,5]. Despite comprehensive treatment options—including surgical resection, radiotherapy, and chemotherapy—the five-year recurrence rate remains as high as 90%, and the median survival is approximately 15 months [6,7]. Tumour grading plays an essential role in guiding treatment plans and monitoring disease progression [8]. For example, patients with LGG generally receive conservative therapy or local resection, whereas those with HGG often undergo surgery followed by adjuvant radiotherapy and chemotherapy [9]. Therefore, accurate and robust early grading of glioma is paramount for predictive diagnosis. Nevertheless, traditional grading relies primarily on histopathological observation, and its inherent subjectivity and limited reproducibility prevent it from meeting the demands of precision medicine [10]. In recent years, genetic testing has been increasingly recognized as pivotal in the diagnosis of glioma. In 2016, the WHO recommended incorporating molecular phenotypes into the diagnostic framework for central nervous system tumours [5]. Studies have demonstrated that detecting tumour-derived DNA in cerebrospinal fluid can be used to effectively monitor the progression of certain gliomas, and this liquid biopsy approach provides a minimally invasive method for disease grading [11]. Concurrently, significant advancements have been made in glioma research utilizing the TCGA and CGGA datasets [12–14]. These datasets provide extensive clinical information and molecular mutational signatures pertaining to glioma. Consequently, characterizing the clinical and molecular attributes of glioma is essential for early-grade predictive diagnosis. ML technologies offer a novel avenue for constructing glioma grading models [15]. For instance, Su et al. [16] thoroughly analysed glioma characteristics by integrating clinical and molecular mutation data and applying univariate, multivariate, and ML methods. Building on this work, Guo et al. [17] combined multimodal medical imaging data with diverse ML algorithms to develop a high-precision glioma prediction model.
In recent years, the rise of big data has markedly expanded the number of variables analysed, increasing typical feature counts from dozens to hundreds or more [18,19]. This growth presents both challenges and opportunities, underscoring the need for feature selection. A larger feature set can theoretically provide richer information, potentially improving the predictive accuracy and generalizability of a model. For example, in image recognition, additional features capture finer image details and characteristics, thereby increasing recognition accuracy [20,21]. Nevertheless, excessive features also increase computational complexity and increase the risk of model overfitting [22]. Consequently, feature selection is a critical preprocessing step that retains informative variables relevant to the prediction task and discards redundant or irrelevant variables, thereby enhancing model performance and interpretability [23–27]. Data imbalance further hinders performance improvement [28,29], particularly by decreasing the predictive accuracy for minority-class samples [30,31]. To mitigate this issue, researchers have continually explored innovative techniques. Among these methods, SMOTE—proposed by Chawla et al. [32]—is the most widely used oversampling algorithm. SMOTE balances the class distribution in the training set by synthesizing new minority class samples rather than duplicating existing samples. This strategy not only reduces the overfitting associated with traditional oversampling but also increases sample diversity, thereby improving the ability of a model to identify minority instances [32]. Notably, SMOTE should be applied exclusively to the training data to ensure fair and accurate evaluation on the test set [32].
Ensemble learning (EL) is an ML approach that aggregates the predictions of multiple models, using supervised or unsupervised strategies, to produce improved outputs [33]. In their pioneering research, Hansen and Salamon [34] demonstrated that EL markedly improves model generalization by evaluating a series of neural network ensembles. In recent years, numerous studies have applied EL techniques to predictive modelling [35–37]. For instance, Hassan et al. [35] proposed EL-APMC, an EL algorithm that leverages novel magnetic resonance imaging (MRI) features for glioma grading. Vidyadharan et al. [36] combined diffusion tensor imaging (DTI)—which quantifies water diffusion in white-matter tissue—with several ML algorithms to classify low- and high-grade gliomas. Additionally, Joshi et al. [37] introduced a two-stage EL framework for glioma detection and grading. This study evaluated five biomarkers—human telomerase reverse transcriptase (hTERT), chitinase-like protein (YKL-40), interleukin-6 (IL-6), tissue inhibitor of metalloproteinases-1 (TIMP-1), and the neutrophil-to-lymphocyte ratio (NLR). Multiple EL classifiers and fusion strategies were employed to construct a computer-aided diagnostic system.
Despite considerable advances in glioma grading research, several critical gaps remain. Most previous studies have relied primarily on radiomic features from imaging data and overlooked clinical characteristics (e.g., age, gender, and symptoms) and molecular markers (e.g., IDH and EGFR). Such omission may limit the generalizability and robustness of prediction models across diverse patient populations, as imaging features alone may not fully capture patient heterogeneity, molecular variations, or cohort-specific differences in tumour biology [38]. In addition, the systematic integration of advanced feature selection, class balancing, and ensemble learning approaches has not been fully explored [39], and external validation on independent datasets has been relatively limited [38]. These limitations underscore the need for an integrative approach that leverages both clinical and molecular features to improve glioma grading accuracy. We hypothesize that a machine learning–based model incorporating clinical and molecular features, combined with optimized feature selection, class balancing, and ensemble learning, can achieve higher predictive performance and better generalizability than models relying on a single data type or conventional methods can achieve.
To address these gaps, this study aimed to develop a machine learning–based glioma grading prediction framework that integrates clinical and molecular features to improve early grading. The proposed framework is designed to increase diagnostic accuracy and facilitate individualized treatment strategies.
A high-performance glioma grading prediction model was developed using clinical and molecular data from the TCGA and was externally validated with the CGGA. The principal contributions of this study are summarized as follows:
- Recursive feature elimination (RFE) was integrated with random forest (RF) and elastic net regression (ENR) to select features efficiently, reduce redundancy and highlight variables with superior predictive value.
- SMOTE was applied to balance the training set by synthesizing additional minority class samples instead of duplicating existing observations.
- Random-search hyper-parameter optimization (HPO) was performed with five-fold cross-validation, improving parameter tuning and performance estimation.
- Voting and stacking ensemble strategies were employed to aggregate multiple base learners and decrease the error associated with any single model.
- External validation was performed on the CGGA to confirm the generalizability of the model across independent datasets and patient populations.
The remainder of this paper is structured as follows: The Materials and Methods section describes the data sources, feature selection procedures, and model construction processes, including hyperparameter optimization, SMOTE balancing, and ensemble strategies. The Results section presents the outcomes of feature selection and the performance of both the base models and the ensemble methods, together with the results of the calibration, decision curve, and SHAP analyses. The Discussion section interprets these findings in the context of the literature, highlights their biological and clinical implications, and addresses the limitations of this study. Finally, the Conclusion section summarizes the main contributions and potential applications of this study.
Materials and methods
Data collection and overview
The publicly available TCGA glioma dataset was used for model training and testing, whereas the CGGA dataset served as an external validation set. Independence of the CGGA validation was ensured by isolating the data, sourcing independent cohorts, and applying unified clinical inclusion criteria. Both datasets excluded low-quality cases and entries with missing key information; their cohorts were nonoverlapping across population, region, and temporal dimensions. The TCGA dataset contains 839 instances with three clinical variables and 20 high-frequency molecular mutation variables, with no missing or duplicate entries. The CGGA dataset includes 195 instances but lacks the clinical variable “race.” The remaining variables match across datasets, and no additional missing or duplicate entries are observed. Although race is not explicitly recorded in the CGGA, all the cases originated from a Chinese cohort. Consequently, the “race” variable in the CGGA was set to Asian for every case, and its feature dimension was aligned with that of the TCGA. The qualitative variables included two clinical factors—gender and race—and 20 molecular mutation markers (e.g., IDH1 and TP53). The sole quantitative variable was age, and mutation status was categorized as wild-type or mutant. The outcome label comprised two classes: LGG and GBM. An overview of the data is presented in Table 1.
This study relies on data extracted from two publicly accessible repositories: The Cancer Genome Atlas (TCGA) and the Chinese Glioma Genome Atlas (CGGA). Both repositories provide fully anonymized data that comply with established ethical guidelines for data sharing. Prior to data extraction and analysis, our research team thoroughly reviewed and formally agreed on the data usage policies of both repositories, ensuring full compliance with all the terms governing data access, processing, and reporting. The datasets are available at https://www.cancer.gov/tcga (TCGA) and https://www.cgga.org.cn (CGGA). TCGA Ethics and Policies provides relevant ethical explanations regarding the data in the TCGA database (https://www.cancer.gov/ccg/research/genome-sequencing/tcga/history/ethics-policies). The CGGA database also complied with relevant ethical regulations during its establishment [13]. The detailed, preprocessed clinical and mutation data for the TCGA and CGGA cohorts used in this analysis are provided in S1–S4 Tables.
Feature selection
Description of the feature selection model.
RFE was applied in combination with RF and ENR to filter features.
RF involves the construction of multiple decision trees using the bagging technique. For each node in the trees, a subset of features is randomly selected to compute and accumulate the reduction in Gini impurity. The average reduction in Gini impurity for each feature across all the trees is then assigned as its importance score. The formula of the RF model can generally be expressed as follows:
In the formula, denotes the Gini-based feature importance of the
-th feature in the random forest model;
represents the total number of decision trees in the random forest;
represents the Gini-based importance of the
-th feature within the
-th decision tree, which quantifies the reduction in Gini impurity achieved by the
-th feature during node splitting processes in that specific tree.
ENR incorporates both L1 and L2 regularization techniques. L1 regularization introduces sparsity, allowing for the compression of feature regression coefficients and facilitating feature selection; features with zero coefficients are deemed unimportant. Conversely, L2 regularization enhances the ability of the model to manage feature collinearity by imposing a penalty on the sum of the squares of the regression coefficients, thereby decreasing their overall magnitude. The formula of the ENR model can generally be expressed as follows:
In the formula, represents the target value.
denotes the feature vector.
represents the regression coefficient.
is the L1 regularization term (Lasso).
is the L2 regularization term (Ridge).
and
are regularization parameters.
RFE is a model-based feature selection method. First, a machine learning model is specified or trained. Iterative training is subsequently performed on this designated model: in each training iteration, several of the least important features are eliminated, and the model is then retrained on the basis of the remaining feature subset. This process is repeated iteratively until a preset number of features or other stopping criteria are met. This method can gradually filter out noise among features, remove redundant features, reduce the feature dimension step by step, and retain the features that contribute the most to the model’s prediction performance.
Development of the feature selection model.
In this study, RF and ENR models were initially constructed for feature selection. The number of features with nonzero regression coefficients in the ENR is used as a benchmark, defined as the number of retained features for the RFE. The two models are subsequently retrained and subjected to feature selection again using RFE. Ultimately, the intersection of the feature subsets identified by both models is selected as the final feature set. The technical roadmap is illustrated in Fig 1.
This flowchart illustrates the process from data acquisition and preprocessing, through feature selection (RFE + RF/ENR), model construction and optimization, to ensemble model integration using Voting and Stacking strategies. External validation was conducted using CGGA to assess model generalizability.
Model construction
Predictive model description.
The multilayer perceptron (MLP) is a feedforward neural network model that consists of an input layer, several hidden layers, and an output layer. Each neuron processes the input data by applying a weighted sum followed by nonlinear activation functions, and updates to the model parameters are performed using a backpropagation algorithm based on gradient descent.
A support vector machine (SVM) projects data into a high-dimensional space through a kernel function and identifies a hyperplane that maximizes the margin between data points of different classes, allowing it to handle both linear and nonlinear classification problems.
The decision tree (DT) model employs a tree structure to partition the dataset by selecting attributes and thresholds that maximize information gain (in ID3), the gain ratio (in C4.5), or the Gini index (in CART). New nodes are generated after each division, and this process continues until the predefined stopping conditions are met.
Gradient boosting trees (GBDT) employ the gradient boosting algorithm to iteratively train multiple decision tree models, with each tree trained to correct the prediction residuals (errors) from the previous trees. The final prediction is obtained by aggregating the predictions from all the trees.
The random forest (RF) method involves the construction of multiple decision trees using the bagging algorithm, and in classification tasks, the final prediction is determined by majority voting on the predictions from each decision tree.
eXtreme Gradient Boosting (XGBoost) employs an optimized gradient boosting algorithm, utilizing an efficient weak learner iteration strategy and regularization techniques, which improve both speed and overall performance.
The Light Gradient Boosting Machine (LightGBM) algorithm is also an optimization algorithm based on the gradient boosting framework and employs a histogram-based decision tree algorithm. Through parallelization strategies and sparse optimization techniques, it significantly improves training speed and model performance.
Logistic regression (LR) is a well-established linear classification model that maps the output of a linear model to a probability value between 0 and 1 using the sigmoid function.
The K-nearest neighbours (KNN) algorithm computes the distance between the input sample and the training samples in the feature space, selects the nearest K samples, and performs classification or regression tasks on the basis of the categories of these K samples.
The soft voting ensemble method derives the predicted probabilities for each classification result from multiple classifiers, performs a weighted or averaging operation, and ultimately selects the classification result with the highest corresponding probability as the final prediction.
The stacking ensemble method employs the prediction results of multiple base learners as input features for training a meta-learner. Both ensemble strategies increase the prediction accuracy and robustness of the model while mitigating the risk of overfitting.
Predictive model development.
In this study, nine classic prediction models, namely, the MLP, SVM, DT, GBDT, RF, XGBoost, LightGBM, LR, and KNN models, were constructed. On the basis of these nine models, 34 ensemble models were further developed using the voting and stacking ensemble methods. During the model construction process, with the random state set to 42, the dataset was split into a training set and a test set at an 8:2 ratio. The SMOTE algorithm was subsequently independently applied to the training set for oversampling to ensure the independence of the test dataset and the external validation dataset.
The random-search HPO method was employed to tune the parameters of nine base models, including the MLP and SVM, with iterative searches conducted within the specified parameter ranges. The parameter search ranges and configuration settings established prior to model training are provided in S7 Table, while the finalized parameter information of the optimized models is presented in S8 Table.
When the 34 ensemble models in this study were constructed, factors such as differences in the algorithmic principles, performance, and combination effects of the models were considered. For heterogeneous ensemble models, when all base models exhibit relatively excellent performance, diversity is crucial for improving the performance of the ensemble model [40]. Therefore, after comprehensively evaluating the performance of the base models, this study first categorized the base models into two groups on the basis of whether they adopted homogeneous ensemble algorithms (e.g., bagging and boosting algorithms). Models with similar algorithms within each group were subsequently further screened. Finally, when models were selected for combination, models from both groups were included simultaneously. This approach strengthens the diversity and differences among the algorithms of the combined models, integrates the advantages of different algorithms, and thereby enhances the overall robustness and generalization ability of the ensemble models. Specifically, the models were divided into two groups: (i) Group A: MLP, SVM, DT, LR, and KNN; and (ii) Group B: GBDT, RF, XGBoost, and LightGBM. Among these, GBDT, XGBoost, and LightGBM are all built on the basis of gradient boosting algorithms and share similar algorithmic frameworks. On the basis of the comprehensive performance of these three models, the one with the most balanced performance was retained in this study. For both the voting and stacking ensemble strategies, we combined three models each, aiming to improve model performance while conserving computational resources.
For the voting ensemble strategy, models were selected from Group A and Group B simultaneously to construct 25 combination schemes. The specific strategies are as follows: (i) one model was selected from Group A, and two models (RF and the top-performing boosting algorithm model) were selected from Group B, resulting in the formation of 5 combination schemes; (ii) two models were selected from Group A, and one model (either RF or the top-performing boosting algorithm model, exclusively) was selected from Group B, resulting in 20 combination schemes. Additionally, the voting ensemble models in this study adopted an equal weighting strategy of 1:1:1 to integrate the output results of all the learners.
For the stacking ensemble strategy, the LR from Group A was designated the meta-learner. This streamlined binary classification linear model can effectively integrate the outputs of different base learners, enhance model performance, and ensure computational efficiency. The same meta-learner selection method has also been reported in multiple studies [41–50]. Two base learners were selected from the remaining models in Groups A and B, generating a total of 9 combination schemes. The specific strategies are as follows: (i) one model was selected from Group A and one from Group B simultaneously, producing 8 combination schemes; (ii) both models in Group B (RF and the top-performing boosting algorithm model) were selected, leading to 1 combination scheme. In total, 9 combination schemes for the stacking ensemble strategy were generated.
Evaluation metrics
This study focused on the classification prediction task. After comparing the prediction results of the model on the test set and the external validation set with the distribution of real samples, we identified four key relationships: (1) true positive (TP), (2) false-positive (FP), (3) true negative (TN), and (4) false-negative (FN). We then calculate the following evaluation metrics:
- (1). Accuracy: The proportion of correctly predicted samples among the total number of samples is calculated as follows:
. It indicates the overall accuracy of predictions.
- (2). Precision: The proportion of samples predicted as positive that are actually positive is calculated as follows:
. It reflects the accuracy of positive predictions.
- (3). Recalling: The proportion of actual positive samples that are correctly identified as positive is calculated as follows:
t reflects the ability of the model to identify positive cases.
- (4). F1 score:
.
The AUC was chosen as the primary performance metric because it summarises the classification ability across all the thresholds. AUC values were computed for all the models, and 95% confidence intervals were derived by bootstrapping with 1000 resamples. Because very high precision can underdiagnose, whereas very high recall can overdiagnose, the F1 score was also emphasized. This approach provides a balanced view of under- and overdiagnosis. Learning curves for the nine base models were plotted to assess the risk of overfitting. A calibration curve was used to assess probability calibration, and decision curve analysis (DCA) was used to quantify clinical utility. Finally, SHAP analysis elucidated the ensemble decision processes. The confusion matrices corresponding to the predictions of all models on both the TCGA test set and CGGA validation set are provided in S9 and S10 Files.
Statistical analysis
Statistical analyses and model construction were performed in R 4.4.1 and Python 3.12.2. The complete analysis code is provided in S1 File (Python) and S2 File (R). Qualitative variables are summarized as frequencies and percentages (n%), whereas quantitative variables are reported as the means ± standard deviations (medians ± IQRs). Correlation hypotheses were tested for each variable against the label with α = 0.05. For qualitative variables, Fisher’s exact test was applied when any expected cell count was < 5; otherwise, the chi-square test was used. For quantitative variables, significant departures from normality detected by the Anderson–Darling test prompted the use of the Mann–Whitney U test. Cramér’s V was calculated to visualise correlations among categorical variables, and the point–biserial coefficient was computed to depict associations between quantitative and binary variables.
Results
Feature selection
This study assesses correlations between continuous-to-binary and categorical pairs with the point–biserial coefficient and Cramér’s V, and visualises the results in two heat maps. As illustrated in Fig 2A, the associations of markers such as PTEN, EGFR, IDH1, and ATRX differ with age. The data in Fig 2B show predominantly weak feature correlations, although TP53 and ATRX display a relatively strong association.
(A) Age correlations with binary traits (point-biserial). (B) Categorical feature correlations (Cramer’s V). (C) Feature importance ranking based on RF. (D) Feature importance ranking by ENR coefficients. These charts evaluate feature relationships and model contributions.
Additionally, correlation hypothesis tests were run for every variable against glioma grade (label). These tests revealed significant associations for most features with glioma grade (p < 0.05).
In summary, RFE was integrated with RF and ENR for feature screening. This strategy mitigates collinearity, reduces redundancy, and isolates predictors with high discriminative value. The data in Fig 2C indicate that the top three RF importance scores were assigned to age, IDH1, and PTEN, whereas Fig 2D shows that the largest absolute ENR coefficients belonged to IDH1, IDH2, and TP53. Fourteen predictors were retained after recursive elimination, and intersection analysis yielded 11 final inputs: EGFR, TP53, Race, IDH2, gender, NF1, MUC16, age, PTEN, IDH1, and CIC. Each model was externally validated on the CGGA under identical conditions.
Performance of the base model
Nine baseline models—MLP, SVM, DT, GBDT, RF, XGBoost, LightGBM, LR and KNN—were trained on the screened dataset. Learning curves were plotted for each model to evaluate potential overfitting; the curves suggested mild overfitting in several baselines (S1 Fig). The ROC curves for internal and external validation are shown in Fig 3A and 3B, respectively. Most curves lie near the upper-left corner, demonstrating high sensitivity, specificity and overall discriminative ability. The performance metrics are summarised in Table 2. During internal validation, the AUC of every test set exceeded 0.900, with LightGBM achieving the highest value (0.921), whereas DT yielded a lower AUC of 0.884 and an F1 score of 0.847. With respect to external validation, RF produced the greatest AUC (0.797), followed by XGBoost (0.793) and SVM (0.792). RF also obtained a test-set AUC of 0.916 and an external F1 score of 0.860, confirming robust generalisation; its ROC curve is close to the upper-left corner. The other indicators are also satisfactory. Moreover, the RF effectively controlled the model complexity and diversity through parameters such as n_estimators, criterion, and max_features. It achieved better generalisation in terms of the trade-off between bias and variance, highlighting the overall superior performance of RF among these nine baselines. The hyperparameter search spaces and optimal configurations are listed in Supplementary Tables S7 and S8, respectively.
(A) ROC curves from the TCGA test set. (B) ROC curves from the CGGA validation set. These curves assess the predictive performance of base models across different cohorts.
Calibration curves were used to assess the accuracy of the predicted probabilities. In both the TCGA test set and the CGGA external validation set, the calibration curves of most base models fluctuated near the ideal calibration curve, indicating a reasonable level of calibration in the predictive probabilities of the models. Decision curve analysis (DCA) was used to quantify clinical utility. On the TCGA test set, the curves for the MLP, RF, and some of the models above both treat – all and treat – no lines are observed when the threshold probabilities are between 0.2 and 0.8. Similarly, on the CGGA, similar benefits are observed for thresholds between 0.3 and 0.7. The calibration curves for all 43 models are provided in S3 File (TCGA) and S4 File (CGGA); the corresponding DCA results appear in S5 File (TCGA) and S6 File (CGGA). The 95% confidence intervals for the AUC values of all models on both the TCGA test set and CGGA validation set are provided in S5 and S6 Tables.
Performance of the ensemble model
Considering algorithmic diversity and performance metrics, nine baseline learners were grouped and combined to construct multiple ensemble models. The test-set AUC of XGBoost (0.917) approximated that of LightGBM (0.921); however, its validation-set AUC reached 0.793, and its F1 score (0.864) exceeded those of all the individual baselines. Furthermore, the gamma parameter of XGBoost better uses the minimum reduction in loss brought about by the splitting of leaf nodes as the threshold, accurately suppressing unprofitable splits and preventing overfitting. Owing to its strong discrimination and generalisability, XGBoost outperformed GBDT and LightGBM and ranked immediately below RF. Accordingly, XGBoost was retained as the sole gradient-boosting learner, yielding a second group composed of RF and XGBoost. Ensemble models were subsequently assembled on the basis of this grouping, and the combination schemes are detailed in Table 3.
The ROC curves for the voting ensembles in internal and external validation are summarised in Fig 4A and 4B, respectively, and the corresponding performance metrics are listed in Table 4. All the voting combinations show strong internal performance, with the external validation performance only marginally lower. Notably, voting25 yields the highest internal test-set AUC (0.928), underscoring its superior predictive capability among the voting ensembles. In contrast, voting15 has the greatest external AUC (0.808), which is relatively high for the validation set. Voting25 also has an F1 score of 0.860 on the test set and an external AUC of 0.794, confirming solid generalization. A formal statistical comparison of the ROC curves between the top-performing Voting25 model and the RF base model is presented in S9 Table. Overall, considering accuracy and generalisability, voting25 provides the most balanced and robust predictive performance.
(A)-(B) Voting models’ ROC curves for TCGA test set and CGGA validation set, respectively. (C)-(D) Stacking models’ ROC curves for TCGA test and CGGA validation sets, evaluating ensemble performance.
The ROC curves for all stacking ensembles in the internal and external validation are summarised in Fig 4C and 4D, respectively. Although the curves trend towards the upper-left corner, several crossings indicate that the AUC offers a clearer basis for comparison. As shown in Table 5, stacking8 achieves the highest internal AUC (0.925), reflecting strong discriminative power on the test set. Conversely, stacking6 attains an external AUC of 0.808, indicating good generalization. Notably, stacking7 has a test-set AUC of 0.923—comparable with stacking8—and an F1 score of 0.865, along with a favourable external AUC. The other metrics remain stable, suggesting that Stacking7 provides consistently strong overall performance.
In both the TCGA test set and the CGGA validation set, the calibration curves for most ensembles fluctuated only slightly around the ideal line. Decision curve analysis (DCA) revealed that in the TCGA cohort, several ensembles outperform the treat-all and treat-none strategies for threshold probabilities between 0.1 and 0.9, demonstrating clinical utility across this range. With respect to the CGGA, a similar benefit appears for thresholds between 0.3 and 0.7. The DCA curve for voting25, which highlights its clinical utility within the pertinent threshold range, is shown in Fig 5.
SHAP analysis was applied to every model to interpret its decision process. The depicts feature-importance rankings for voting25 in the TCGA test set and the CGGA validation set are shown in Fig 6. IDH1, age at diagnosis, CIC and PTEN received the high scores SHAP scores, indicating substantial contributions to model output. The complete SHAP results for all ensemble models on both the TCGA and CGGA datasets are provided in S7 and S8 Files, respectively.
Discussion
To circumvent the haemorrhage, infection, and sampling error risks inherent in conventional pathology-based grading [51,52], a minimally invasive glioma-grading model was developed with the TCGA dataset. The model was built via feature selection, five-fold cross-validation, random HPO, SMOTE oversampling, and ensemble learning, and its robustness was validated on the CGGA dataset. By integrating molecular information with morphological information, the model overcomes single-modality limits and provides reproducible, objective evidence for pathological grading because of its demonstrated stability and generalisability. Potential clinical uses include (i) molecular pregrading to avoid unnecessary biopsy in contraindicated patients [53] and (ii) decision curve analysis (DCA), which is useful as an adjunct for borderline cases [54], while final management still requires individualised clinical judgement.
The least absolute shrinkage and selection operator (LASSO) is an established feature-selection technique [55]. Strong collinearity was observed between several predictors, including TP53 and ATRX. The aggressive sparsity imposed by LASSO can bias coefficient estimates when predictors are highly correlated [56,57]. Elastic-net regression (ENR), which combines the L1 and L2 penalties, unites the benefits of ridge regression and LASSO [58,59]. ENR therefore controls both shrinkage and sparsity, a useful compromise when feature selection is needed under high collinearity. Accordingly, recursive feature elimination (RFE) coupled with RF and ENR was adopted to mitigate collinearity and remove redundancy. Previous studies have confirmed that the retained features are strongly linked to glioma grade. For instance, IDH1 mutations occur in > 70% of low-grade gliomas (LGGs) [60], EGFR amplification appears in > 50% of glioblastomas (GBMs) [61], and TP53 is an independent prognostic marker for LGG, although its mechanism remains unclear [62]. These observations underpin the biological and clinical rationale for the chosen feature set.
The 11-feature model—IDH1, PTEN, TP53, and eight others—yielded AUC values > 0.90, indicating strong predictive capability. It integrates key biomarkers and clinical variables to facilitate grade prediction. Zhang et al. [63] used multiparametric MRI with RF to distinguish LGG from HGG, achieving an AUC of 0.81 and an F1 score of 0.88. In the present work, RF achieved an AUC of 0.916 for the TCGA cohort and 0.797 for the CGGA cohort, confirming its advantage in grade prediction. Hao et al. [64] combined RF and XGBoost on cuprotosis-related lncRNAs/mRNAs to predict survival in LGG and GBM patients. Tasci et al. [65] reported that a soft-voting ensemble of RF, SVM and AdaBoost performed well on the TCGA (AUC 0.914; F1 score 0.842). Here, the optimal voting ensemble (RF + XGBoost + KNN) yielded an AUC of 0.928 and an F1 score of 0.860 for the TCGA cohort, further improving performance. Du et al. [66] constructed a multimodal MR-radiomics model; the best single classifier reached an AUC of 0.85, whereas our RF (AUC of 0.916) and voting25 (AUC of 0.928) models outperformed that model.
Ensemble learning is widely recognized for its ability to improve robustness. For example, it has been applied to deep Gaussian mixture models [67], mirroring the ensemble-based strategy adopted here. Waqas et al. [68,69] reported that stacking refines decision thresholds and bolsters robustness in complex multi-instance learning (MIL) settings. Accordingly, this work adopts a combined stacking + voting framework. The hybrid design strengthens base-model complementarity and enhances generalisation. Numerous studies report an ensemble performance that surpasses that of individual models [70–72]; however, the ensembles here provided only marginal gains over the already optimised single models. This outcome likely reflects the high baseline performance achieved after targeted optimisation, leaving limited room for ensemble uplift, a phenomenon also noted elsewhere [73,74]. Even when accuracy gains are minor, ensembles show greater resistance to overfitting and better generalisability than baselines do, benefits that remain valuable [33,75,76].
Limitations of our study
Several limitations should be acknowledged. First, the TCGA and CGGA datasets are retrospective public cohorts without prospective, multicentre, or real-world clinical data, limiting the generalisability and clinical applicability of the models [13,77]. Although independent external validation was performed, residual bias may persist because of source limitations and small sample sizes—especially the paucity of rare glioma subtypes in the TCGA—thereby reducing the predictive accuracy for these categories [70]. Moreover, the TCGA is heavily skewed towards Caucasian patients, further constraining its transferability to other ethnicities [78]. Future work should therefore recruit large, ethnically diverse, prospective multicentre cohorts to improve generalisability and translational potential.
Second, to correct class imbalance, SMOTE was applied. However, SMOTE can create overly smooth synthetic cases, inflating performance estimates and heightening overfitting risk [79]. Future studies might test advanced augmentations—for example, GAN-based synthesis [80]—or validate the model on real-world data to capture greater biological diversity and increase robustness.
Finally, the lack of comprehensive radiomics prevented head-to-head comparisons with emerging deep-learning or multimodal fusion models, partially limiting the assessment of novelty. Nevertheless, the present clinically and molecularly driven model retains strong biological interpretability and practical utility [80].
Integrating machine learning with medical imaging offers a new avenue for glioma grading. Multiple-instance learning (MIL), a weakly supervised paradigm, has achieved breakthroughs in imaging analysis [81,82] and serves as a methodological reference for the present work. Radiomic prediction of molecular traits—for example, 1p/19q codeletion [83,84]—confirms the clinical value of multimodal integration. Accordingly, future efforts will systematically merge imaging genomics with clinical data and apply advanced machine-learning techniques to increase performance and expedite clinical translation.
Conclusions
For models built with 11 key features—including IDH1, PTEN, and TP53—the AUC for the TCGA dataset exceeded 0.90 in nearly every case. During the external validation of the CGGA dataset, both the baseline and the ensemble models performed robustly, with RF and the soft-voting ensemble (RF + XGBoost + KNN) delivering the best overall results. These findings provide a novel strategy for glioma grading and diagnosis.
Supporting information
S1 Fig. AUC learning curves for base models on TCGA data.
https://doi.org/10.1371/journal.pone.0314831.s001
(TIF)
S1 Table. CGGA clinical and genetic mutation data (uncoded).
https://doi.org/10.1371/journal.pone.0314831.s002
(CSV)
S2 Table. CGGA clinical and genetic mutation data (coded).
https://doi.org/10.1371/journal.pone.0314831.s003
(CSV)
S3 Table. TCGA clinical and genetic mutation data (uncoded).
https://doi.org/10.1371/journal.pone.0314831.s004
(CSV)
S4 Table. TCGA clinical and genetic mutation data (coded).
https://doi.org/10.1371/journal.pone.0314831.s005
(CSV)
S5 Table. 95% confidence intervals of AUC values for models on the TCGA test set.
https://doi.org/10.1371/journal.pone.0314831.s006
(XLSX)
S6 Table. 95% confidence intervals of AUC values for models on the CGGA external validation dataset.
https://doi.org/10.1371/journal.pone.0314831.s007
(XLSX)
S7 Table. Random search hyperparameter ranges for models.
https://doi.org/10.1371/journal.pone.0314831.s008
(XLSX)
S8 Table. Model parameters after training completion.
https://doi.org/10.1371/journal.pone.0314831.s009
(XLSX)
S9 Table. DeLong test result for voting25 model and RF.
https://doi.org/10.1371/journal.pone.0314831.s010
(XLSX)
S1 File. Code for building predictive models.
https://doi.org/10.1371/journal.pone.0314831.s011
(IPYNB)
S3 File. Calibration curve for models on the TCGA test set.
https://doi.org/10.1371/journal.pone.0314831.s013
(ZIP)
S4 File. Calibration curve for models on the CGGA external validation dataset.
https://doi.org/10.1371/journal.pone.0314831.s014
(ZIP)
S5 File. Decision curve analysis for models on the TCGA test set.
https://doi.org/10.1371/journal.pone.0314831.s015
(ZIP)
S6 File. Decision curve analysis for models on the CGGA external validation dataset.
https://doi.org/10.1371/journal.pone.0314831.s016
(ZIP)
S7 File. SHAP results for models on the TCGA test set.
https://doi.org/10.1371/journal.pone.0314831.s017
(ZIP)
S8 File. SHAP results for models on the CGGA external validation dataset.
https://doi.org/10.1371/journal.pone.0314831.s018
(ZIP)
S9 File. The confusion matrix of models on the TCGA test set.
https://doi.org/10.1371/journal.pone.0314831.s019
(ZIP)
S10 File. The confusion matrix of models on the CGGA external validation dataset.
https://doi.org/10.1371/journal.pone.0314831.s020
(ZIP)
S11 File. The deployment program of the voting25 model.
https://doi.org/10.1371/journal.pone.0314831.s021
(PKL)
References
- 1. GBD 2016 Neurology Collaborators. Global, regional, and national burden of neurological disorders, 1990-2016: a systematic analysis for the global burden of disease study 2016. Lancet Neurol. 2019;18:459–80.
- 2. Steliarova-Foucher E, Colombet M, Ries LAG, Moreno F, Dolya A, Bray F. International incidence of childhood cancer, 2001-10: a population-based registry study. Lancet Oncol. 2017;18:719–31.
- 3. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209–49. pmid:33538338
- 4. Grech N, Dalli T, Mizzi S, Meilak L, Calleja N, Zrinzo A. Rising incidence of glioblastoma multiforme in a well-defined population. Cureus. 2020;12(5):e8195. pmid:32572354
- 5. Louis DN, Perry A, Reifenberger G, von Deimling A, Figarella-Branger D, Cavenee WK, et al. The 2016 World Health Organization classification of tumors of the central nervous system: a summary. Acta Neuropathol. 2016;131(6):803–20. pmid:27157931
- 6. Louis DN, Perry A, Wesseling P, Brat DJ, Cree IA, Figarella-Branger D, et al. The 2021 WHO classification of tumors of the central nervous system: a summary. Neuro Oncol. 2021;23(8):1231–51. pmid:34185076
- 7. Minata M, Audia A, Shi J, Lu S, Bernstock J, Pavlyukov MS, et al. Phenotypic plasticity of invasive edge glioma stem-like cells in response to ionizing radiation. Cell Rep. 2019;26(7):1893-1905.e7. pmid:30759398
- 8. Drexler R, Khatri R, Sauvigny T, Mohme M, Maire CL, Ryba A, et al. A prognostic neural epigenetic signature in high-grade glioma. Nat Med. 2024;30(6):1622–35. pmid:38760585
- 9. van den Bent MJ. Interobserver variation of the histopathological diagnosis in clinical trials on glioma: a clinician’s perspective. Acta Neuropathol. 2010;120(3):297–304. pmid:20644945
- 10. Smits A, Bento MJ, Leemans CR. Interobserver variability in glioma grading. Cancer. 2005;104:1971–8.
- 11. Miller AM, Shah RH, Pentsova EI, Pourmaleki M, Briggs S, Distefano N, et al. Tracking tumour evolution in glioma through liquid biopsies of cerebrospinal fluid. Nature. 2019;565(7741):654–8. pmid:30675060
- 12. Zhao Z, Meng F, Wang W, Wang Z, Zhang C, Jiang T. Comprehensive RNA-seq transcriptomic profiling in the malignant progression of gliomas. Sci Data. 2017;4:170024. pmid:28291232
- 13. Zhao Z, Zhang K-N, Wang Q, Li G, Zeng F, Zhang Y, et al. Chinese Glioma Genome Atlas (CGGA): a comprehensive resource with functional genomic data from chinese glioma patients. Genom Proteom Bioinform. 2021;19(1):1–12. pmid:33662628
- 14. Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061–8. pmid:18772890
- 15. Akbari H, Torkamani A, Schumacher M, Verhaak RGW. Machine learning in glioma biology: current applications and future prospects. Neuro Oncol. 2019;21:743–53.
- 16. Su C, Jiang J, Zhang S, Shi J, Xu K, Shen N, et al. Radiomics based on multicontrast MRI can precisely differentiate among glioma subtypes and predict tumour-proliferative behaviour. Eur Radiol. 2019;29(4):1986–96. pmid:30315419
- 17. Guo J, Ren J, Shen J, Cheng R, He Y. Do the combination of multiparametric MRI-based radiomics and selected blood inflammatory markers predict the grade and proliferation in glioma patients? Diagn Interv Radiol. 2021;27(3):440–9. pmid:33769289
- 18.
Bishop CM. Pattern recognition and machine learning. New York: Springer; 2006.
- 19. Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science. 2015;349(6245):255–60. pmid:26185243
- 20.
Kyal C, Poddar H, Reza M. Chapter 2-thermal biometric face recognition (TBFR): a noncontact face biometry. In: Sarangi PP, Panda M, Mishra S, Mishra BSP, Majhi B, editors. Machine learning for biometrics. Cambridge: Academic Press; 2022. p. 29–46.
- 21. Mohan G, Subashini MM. MRI based medical image analysis: survey on brain tumor grade classification. Biomed Signal Process Control. 2018;39:139–61.
- 22. Guyon I, Eliseeff A. An introduction to feature extraction. J Mach Learn Res. 2003;3:1157–82.
- 23. Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: a new perspective. Neurocomputing. 2018;300:70–9.
- 24. Dokeroglu T, Deniz A, Kiziloz HE. A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing. 2022;494:269–96.
- 25. Blum AL, Langley P. Selection of relevant features and examples in machine learning. Artif Intell. 1997;97:245–71.
- 26. Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl. 2004;6(1):20–9.
- 27. Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40:16–28.
- 28. Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR. DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl. 2021;168:114301.
- 29. Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: experimental evaluation. Inf Sci. 2020;513:429–41.
- 30. Koziarski M, Krawczyk B, Woźniak M. Radial-Based oversampling for noisy imbalanced data classification. Neurocomputing. 2019;343:19–33.
- 31. Kovács G. Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing. 2019;366:352–4.
- 32. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
- 33.
Dietterich TG. Ensemble methods in machine learning. In: Multiple classifier systems. Berlin, Heidelberg: Springer; 2000. p. 1–15.
- 34. Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intell. 1990;12(10):993–1001.
- 35. Hassan MF, Al-Zurfi AN, Alsalihi MH, Ahmed K. An effective ensemble learning approach for classification of glioma grades based on novel MRI features. Sci Rep. 2024;14(1):11977. pmid:38796531
- 36. Vidyadharan S, Rao BVVSNP, Yogeeswari P, Kesavadas C, Rajagopalan V. Accurate low and high grade glioma classification using free water eliminated diffusion tensor metrics and ensemble machine learning. Sci Rep. 2024;14(1):19844. pmid:39191905
- 37. Chandra Joshi R, Mishra R, Gandhi P, Pathak VK, Burget R, Dutta MK. Ensemble based machine learning approach for prediction of glioma and multi-grade classification. Comput Biol Med. 2021;137:104829. pmid:34508971
- 38. Tabatabaei M, Razaei A, Sarrami AH, Saadatpour Z, Singhal A, Sotoudeh H. Current status and quality of machine learning-based radiomics studies for glioma grading: a systematic review. Oncology. 2021;99(7):433–43. pmid:33849021
- 39. Perniciano A, Loddo A, Di Ruberto C, Pes B. Insights into radiomics: impact of feature selection and classification. Multimed Tools Appl. 2025;84:31695–721.
- 40.
Zhou ZH. Ensemble methods: foundations and algorithms. Boca Raton: Chapman and Hall/CRC; 2012.
- 41. Layeghian Javan S, Sepehri MM, Layeghian Javan M, Khatibi T. An intelligent warning model for early prediction of cardiac arrest in sepsis patients. Comput Methods Programs Biomed. 2019;178:47–58. pmid:31416562
- 42. Niestroy J, Han J, Luo J, Zhao R, Lake DE, Flower A. Prediction of decompensation in patients in the cardiac ward. 2019 Systems and Information Engineering Design Symposium (SIEDS). Charlottesville (VA): IEEE; 2019. p. 1–6.
- 43. Williams ML, James WP, Rose MT. Variable segmentation and ensemble classifiers for predicting dairy cow behaviour. Biosyst Eng. 2019;178:156–67.
- 44. Lin S-C, Chang YI, Yang W-N. Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing. 2009;73(1–3):484–94.
- 45. He H, Zhang W, Zhang S. A novel ensemble method for credit scoring: adaption of different imbalance ratios. Expert Syst Appl. 2018;98:105–17.
- 46. Cockroft NT, Cheng X, Fuchs JR. STarFish: a stacked ensemble target fishing approach and its application to natural products. J Chem Inf Model. 2019;59(11):4906–20. pmid:31589422
- 47. El-Rashidy N, El-Sappagh S, Abuhmed T, Abdelrazek S, El-Bakry HM. Intensive care unit mortality prediction: an improved patient-specific stacking ensemble model. IEEE Access. 2020;8:133541–64.
- 48. Arora A, Srivastava A, Bansal S. Business competitive analysis using promoted post detection on social media. J Retail Consum Serv. 2020;54:101941.
- 49.
Sobanadevi V, Ravi G. Handling data imbalance using a heterogeneous bagging-based stacked ensemble (HBSE) for credit card fraud detection. In: Peter JD, Fernandes SL, Alavi AH, editors. Intelligence in big data technologies—beyond the hype. Singapore: Springer; 2021. p. 517–25.
- 50. Cao C, Wang Z. IMCStacking: cost-sensitive stacking learning with feature inverse mapping for imbalanced problems. Knowl Based Syst. 2018;150:27–37.
- 51. Aneja S, Chang E, Omuro A. Applications of artificial intelligence in neuro-oncology. Curr Opin Neurol. 2019;32(6):850–6. pmid:31609739
- 52. Jackson RJ, Fuller GN, Abi-Said D, Lang FF, Gokaslan ZL, Shi WM, et al. Limitations of stereotactic biopsy in the initial management of gliomas. Neuro Oncol. 2001;3(3):193–200. pmid:11465400
- 53. Zhang Y-L, Liu Z-R, Liu Z, Bai Y, Chi H, Chen D-P, et al. Risk of cardiovascular death in patients with hepatocellular carcinoma based on the Fine-Gray model. World J Gastrointest Oncol. 2024;16(3):844–56. pmid:38577452
- 54. Feng G, Xu H, Wan S, Wang H, Chen X, Magari R. Twelve practical recommendations for developing and applying clinical predictive models. Innov Med. 2024;2:100105.
- 55. Yin P, Mao N, Zhao C, Wu J, Sun C, Chen L, et al. Comparison of radiomics machine-learning classifiers and feature selection for differentiation of sacral chordoma and sacral giant cell tumour based on 3D computed tomography features. Eur Radiol. 2019;29(4):1841–7. pmid:30280245
- 56. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301–20.
- 57.
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning: with applications in R. New York: Springer; 2013.
- 58. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. pmid:20808728
- 59. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol. 1996;58(1):267–88.
- 60. Mellinghoff IK, Ellingson BM, Touat M, Maher E, De La Fuente MI, Holdhoff M, et al. Ivosidenib in isocitrate dehydrogenase 1-mutated advanced glioma. J Clin Oncol. 2020;38(29):3398–406. pmid:32530764
- 61. Lassman AB, Pugh SL, Wang TJC, Aldape K, Gan HK, Preusser M, et al. Depatuxizumab mafodotin in EGFR-amplified newly diagnosed glioblastoma: a phase III randomized clinical trial. Neuro Oncol. 2023;25(2):339–50. pmid:35849035
- 62. Murnyak B, Huang LE. Association of TP53 alteration with tissue specificity and patient outcome of IDH1-mutant glioma. Cells. 2021;10(8):2116. pmid:34440884
- 63.
Zhang L, Zhang H, Rekik I, Gao Y, Wang Q, Shen D. Malignant brain tumor classification using the random forest method. In: Bai X, Hancock ER, Ho TK, Wilson RC, Biggio B, Robles-Kelly A, editors. Structural, syntactic, and statistical pattern recognition. Cham: Springer International Publishing; 2018. p. 14–21.
- 64. Hao S, Gao M, Li Q, Shu L, Wang P, Hao G. Machine learning predicts cuproptosis-related lncRNAs and survival in glioma patients. Sci Rep. 2024;14(1):22323. pmid:39333603
- 65. Tasci E, Zhuge Y, Kaur H, Camphausen K, Krauze AV. Hierarchical voting-based feature selection and ensemble learning model scheme for glioma grading with clinical and molecular characteristics. Int J Mol Sci. 2022;23(22):14155. pmid:36430631
- 66. Du P, Liu X, Wu X, Chen J, Cao A, Geng D. Predicting histopathological grading of adult gliomas based on preoperative conventional multimodal MRI radiomics: a machine learning model. Brain Sci. 2023;13(6):912. pmid:37371390
- 67. Waqas M, Tahir MA, Qureshi R. Deep Gaussian mixture model based instance relevance estimation for multiple instance learning applications. Appl Intell. 2022;53:10310–25.
- 68. Waqas M, Tahir MA, Khan SA. Robust bag classification approach for multi-instance learning via subspace fuzzy clustering. Expert Syst Appl. 2023;214:119113.
- 69. Waqas M, Tahir MA, Qureshi R. Ensemble-based instance relevance estimation in multiple-instance learning. 2021 9th European Workshop on Visual Information Processing (EUVIP). Paris, France: IEEE; 2021. p. 1–6.
- 70. Cawood P, Van Zyl T. Evaluating state of the art, forecasting ensembles-and meta-learning strategies for model fusion. arXiv:220303279 [Preprint]. 2022.
- 71. McGowan CJ, Biggerstaff M, Johansson M, Apfeldorf KM, Ben-Nun M, Brooks L. Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016. Sci Rep. 2019;9:683.
- 72. Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ, Moore E, et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proc Natl Acad Sci U S A. 2019;116(8):3146–54. pmid:30647115
- 73. Huang J-C, Ko K-M, Shu M-H, Hsu B-M. Application and comparison of several machine learning algorithms and their integration models in regression problems. Neural Comput Appl. 2019;32(10):5461–9.
- 74. Ojha VK, Abraham A, Snášel V. Ensemble of heterogeneous flexible neural trees using multiobjective genetic programming. Appl Soft Comput. 2017;52:909–24.
- 75. Sagi O, Rokach L. Ensemble learning: a survey. WIREs Data Min Knowl Discov. 2018;8(4):e1249.
- 76. Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
- 77. Brennan CW, Verhaak RGW, McKenna A, Campos B, Noushmehr H, Salama SR, et al. The somatic genomic landscape of glioblastoma. Cell. 2013;155(2):462–77. pmid:24120142
- 78. Koo H, Choi SW, Cho HJ, Lee I-H, Kong D-S, Seol HJ, et al. Ethnic delineation of primary glioblastoma genome. Cancer Med. 2020;9(19):7352–9. pmid:32794373
- 79. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013;14:106. pmid:23522326
- 80. Sun W, Song C, Tang C, Pan C, Xue P, Fan J, et al. Performance of deep learning algorithms to distinguish high-grade glioma from low-grade glioma: a systematic review and meta-analysis. iScience. 2023;26(6):106815. pmid:37250800
- 81. Waqas M, Ahmed SU, Tahir MA, Wu J, Qureshi R. Exploring multiple instance learning (MIL): a brief survey. Expert Syst Appl. 2024;250:123893.
- 82. Waqas M, Tahir MA, Author MD, Al-Maadeed S, Bouridane A, Wu J. Simultaneous instance pooling and bag representation selection approach for multiple-instance learning (MIL) using vision transformer. Neural Comput Appl. 2024;36(12):6659–80.
- 83. Kha Q-H, Le V-H, Hung TNK, Le NQK. Development and validation of an efficient MRI radiomics signature for improving the predictive performance of 1p/19q co-deletion in lower-grade gliomas. Cancers (Basel). 2021;13(21):5398. pmid:34771562
- 84. Lam LHT, Chu NT, Tran T-O, Do DT, Le NQK. A radiomics-based machine learning model for prediction of tumor mutational burden in lower-grade gliomas. Cancers (Basel). 2022;14(14):3492. pmid:35884551