Machine learning driven biomarker selection for medical diagnosis

Divyagna Bavikadi; Ayushi Agarwal; Shashank Ganta; Yunro Chung; Lusheng Song; Ji Qiu; Paulo Shakarian

doi:10.1371/journal.pone.0322620

Abstract

Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer’s, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4 different methods for biomarker selection and 5 different machine learning (ML) classifiers for identifying correlations—evaluating 20 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3 and 10 biomarkers are permitted. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted.

Citation: Bavikadi D, Agarwal A, Ganta S, Chung Y, Song L, Qiu J, et al. (2025) Machine learning driven biomarker selection for medical diagnosis. PLoS One 20(6): e0322620. https://doi.org/10.1371/journal.pone.0322620

Editor: John Adeoye, University of Hong Kong, HONG KONG

Received: May 24, 2024; Accepted: March 24, 2025; Published: June 11, 2025

Copyright: © 2025 Bavikadi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Relevant data are within the paper and its Supporting Information files. The data underlying the results presented in the study are also available from https://pubmed.ncbi.nlm.nih.gov/33108201/

Funding: The author(s) received no specific funding for this work.

Competing interests: No authors have competing interests.

Introduction

Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes (biological analytes) simultaneously [1,2]. This has led to correlational studies that associated these molecular measurements with diseases such as Alzheimer’s [3], Liver [4], and Gastric Cancer [5]. However, it is generally considered undesirable to use thousands of biomarkers selected from the analytes for medical diagnosis for several reasons. First, large numbers of biomarkers increase the likelihood of spurious correlation. Second, the use of many biomarkers increases model complexity and hinders the interpretability of results. Further, from a practical standpoint, the use of fewer biomarkers is preferable from the standpoint of creating cost-effective diagnostic products.

As a result, previous studies have conducted two operations in tandem: the selection of candidate biomarkers thought to be associated with a given disease individually and the identification of correlations between the combination of selected candidate biomarkers and the target medical condition. The most commonly reported methodology in the literature has been logistic regression, often accompanied by a variant of univariate feature selection [6–8]. This paper looks to augment existing work by studying the effect of the feature selection method and model type. In particular, we examine causal-based feature selection [9] and a variety of machine-learning approaches, including gradient-boosted decision trees and neural networks. In all, we study 20 different combinations of feature selection and classification models in tests where the number of biomarkers K is restricted to a set of values 1,3,4,10,15,30 on a gastric cancer dataset that includes measurements from 3440 biological analytes [10]. We perform a cross-validation study and report results on training and test sets as well as examine hyperparameter sensitivity for the causal-based approaches. We found that contemporary machine learning methods outperform previously reported logistic regression in these experiments. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers).

The rest of the paper is organized as follows: We first provide a brief overview of related work, a description of the gastric cancer dataset, and machine learning methods. This is followed by reporting of the experimental results on the gastric cancer dataset and associated discussion. Finally, we conclude by discussing our findings.

Gastric cancer dataset

The dataset [10] used for the biomarker discovery contains information on 100 samples, each of which is associated with a case or control indicating the presence or absence of gastric cancer. The dataset is balanced with 50 samples labeled case and 50 samples labeled control. The age and gender of the samples are matched between cases and controls. Each instance is represented by 3440 corresponding molecular measurement values, which are used to assess the risk of gastric cancer and provide insight into the disease. The measurement values range from 0.00 to 260.65 with a median of 1.00. Molecular measurements were noted with IgG and IgA antibodies against the same set of proteins. The dataset contains data on clinical features, antibody reactions against Helicobacter pylori proteins, and demographic variables. Using the Nucleic Acid Programmable Protein Array (NAPPA) technology, the study assessed humoral responses to 1527 proteins or almost the whole H. pylori proteome. The total set of proteins nearly composes a complete H. pylori proteome. Measurement values were assessed on seropositivity. Seropositivity was defined as the median normalized intensity on NAPPA. Table 1 shows the breakdown of the dataset.

Download:

Table 1. Breakdown of gastric cancer dataset.

https://doi.org/10.1371/journal.pone.0322620.t001

Up to our knowledge, this dataset [10] has not been explored for different biomarker selection methods at the time of writing this paper.

For the training data, each sample has a vector of real values associated with each analyte measurement and a ground truth that indicates the actual presence of the disease to distinguish between gastric cancer patients from healthy controls.

Machine learning and feature selection methods

Overview of approaches

We employ a two-step process for each method: feature selection and classification, and will discuss each in turn. We will use the symbol K (let ) to denote the maximum number of biomarkers permitted after the feature selection step. The best K biomarkers are used to then classify a sample. We also explore the effect of binarizing biomarker inputs—the intuition being that rather than considering the biomarker measurement directly, we only consider if the biomarker exceeds some threshold (), which is specified as a hyperparameter. Note that even though the measurements range up to 260.65, most values are around 1.00. Due to this distribution of values, we accordingly set values for .

The whole process can be viewed in Fig 1. Considering the size of the data, we use Leave-one-out cross-validation (LOOCV). We compute the causal metric (replaced by a univariate selection method when needed) to select the top K biomarkers that have the highest causality measures (for univariate selection, the chi-squared test is based on higher scores of the relationship between the feature and the target variable- which correspondingly have a low p-value). We then use K biomarker measures to train and test the ML models. We then report the results for various hyperparameter settings. We also parallelize the causal computation for every analyte i in Equation 1 for runtime efficiency.

Download:

Fig 1. Overall workflow.

https://doi.org/10.1371/journal.pone.0322620.g001

Feature selection methods

We consider two types of feature selection methods: the univariate selection and the causal metric. Univariate feature selection evaluates the strength of the relationship between the feature and the response variable. In this paper, we use chi-square statistic-based univariate feature selection method. By contrast, the causal-based method examines the effect of a single analyte based on other analytes that may have a co-occurring measurement. A contribution of this work is an adaption of the causal measure of Kleinberg et al. [11] for biomarker selection. While Kleinberg et al. [11] computes causality as the average increase in the probability of the effect when the cause is present, here we propose a new metric based on the intuition of Gardner et al. [12] but adapted for biomarker selection as follows:

(1)

Here we still examine the average increase of a function when the biomarker is present based on co-occurring biomarkers. However, unlike in Kleinberg et al. [11] we do not use probability, but a measure more tuned to our domain. In Equation 1 the symbol causal(i) is the causal metric for the analyte i, R_i indicates the set of analytes that are related to the analyte i, and f indicates the measure calculated based on the product of sensitivity and specificity (also known as s2 metric) for every pair of a analyte i and its related analyte j. This makes it more suitable for the kind of protein biomarkers used from the dataset. We provide details as to how we derived this measure in the Supporting information.

Here in Table 2 is an example for calculation of the causal metric on a sample dataset of 4 biomarkers for 4 instances. The data is binarized with the threshold of and these binarized values are shown in the brackets in Table 2. The s2 metric is computed for all biomarkers and the ones with a value greater than the average s2 metric are considered as seen in the 6th row in Table 2. The related biomarkers for each biomarker is computed when there is at least one overlap of a case sample where the biomarker value is greater than the threshold . Finally, the causal metric is computed using Equation 1. Here, B3 and then B1 would have been picked during feature selection process. Note that in this example, few biomarkers got a NaN value like for biomarker B2 because fo lack of related biomarkers but similar case is not commonly observed when implementing on the complete gastric cancer dataset. Further details about the causal metric can be found in the supporting information.

Download:

Table 2. Causal metric computation on sample dataset.

https://doi.org/10.1371/journal.pone.0322620.t002

Machine learning classification methods

We examine four machine learning methods: logistic regression (LR), random forest (RF), deep multi-layer perceptron (MLP), gradient-boosted decision trees (GBT) [13], and XGBoost (XGB) [14]. The intuition for using logistic regression is to establish it as a baseline as it was used in previous biomarker studies [7,15], random forest for its ability to provide accurate results with minimal hyper-parameter tuning, a Deep Neural Network (DNN) due to their state-of-the-art performance in a variety of other tasks, and two variants of boosted trees which have been shown to provide state-of-the-art performance on tasks involving tabular data. For the DNN, we employ a dense, multi-layer perceptron with 4 layers, Rectified Linear Unit (ReLU) as an activation function, and a softmax output layer using the PyTorch [16] software package. For the boosted decision trees, we use the Scikit-learn implementation of gradient-boosted trees and the standard implementation of XGBoost. Summaries of these methods, along with hyperparameter settings can be found in Supporting information.

Results

Setup

We conducted experiments using an NVIDIA GTX1080 (2560 cuda cores, 10 Gbps memory speed). For evaluation, we used leave-one-out cross-validation (LOOCV) and examined values for Area Under the Curve (AUC) for both training and test data, as well as sensitivity on the test data with specificity fixed at 0.8 and 0.9 (sensitivity at specificity of 0.8 () and sensitivity at specificity of 0.9 ()). These metrics are selected based on standards employed in assessing diagnostic biomarkers; it also helps us have an overall understanding of performance across multiple confidence thresholds as well as judge the degree to which the model can discriminate between case and control. Evaluation of experiments is conducted on this standard based on other factors such as models, and hyperparameters. Throughout the discussion, we will treat logistic regression with univariate selection as the baseline, as logistic regression was employed in prior work [7,8].

We use the xgboost python package for the XGBoost model and the sklearn package for the other models in our experiments. The sklearn package also has the leave-one-out cross-validation split inherent to it. The model hyperparameters mentioned in Table 3 are considered as the default setting unless specified.

Download:

Table 3. Model hyperparameters used for each model.

https://doi.org/10.1371/journal.pone.0322620.t003

Selection of 3 biomarkers

Overall, the most performance in terms of test AUC was observed for the deep neural multilayer perceptron (MLP) classifier with causal metric for biomarker selection, which outperformed the baseline by 0.114, shown in Table 4. For Sensitivity at specificity of 0.9, XGB with causal metric (as seen in Fig 2) outperformed the baseline (as seen in Fig 3) by 0.240. The error bars in the Receiver Operating Characteristic (ROC) curves are for the confidence bands of error. Notably, the use of causality feature selection improved performance irrespective of classifier, providing a minimum improvement of 0.120 (binarized) over univariate feature selection for each classifier for (Table 4). Comparable results were noted for Sensitivity when Specificity was set to 0.8 along with test AUC.

Download:

Fig 2. ROC curve for XGB model with causality measure (3 biomarkers).

https://doi.org/10.1371/journal.pone.0322620.g002

Download:

Fig 3. ROC curve for the baseline (3 biomarkers).

https://doi.org/10.1371/journal.pone.0322620.g003

Download:

Table 4. Results for 3 biomarkers using 5 models with causal-based and univariate feature selection.

https://doi.org/10.1371/journal.pone.0322620.t004

We note that training AUC was strongest for random forest with univariate selection with a value of 0.997—however, this drops to 0.558 for testing. This is surprising, as random forest generally does not overfit [17]; however, it may indicate that univariate feature selection may cause overfitting when used in more complex models—as we observed the large discrepancies between training and testing AUCs when univariate feature selection was used in all cases except logistic regression. On the other hand, the average drop for the causality measure is 0.118 and a maximum of 0.186 while there is an average drop of 0.260 and a maximum of 0.439 for univariate feature selection which indicates a possibility of overfitting caused when causality is ablated.

Selection of 10 biomarkers

On the other hand, the best-performing model, with respect to test AUC, was MLP with univariate feature selection, which outperformed MLP with causality measure by 0.286, shown in Table 5. Furthermore, GBT with univariate feature selection (as seen in Fig 4) reported the highest sensitivity at a specificity of 0.9, that is 0.520 while GBT with causality measure reported sensitivity at a specificity of 0.9 as 0.22. Also, the baseline (as seen in Fig 5) gave a moderate test AUC of 0.599 but a low value. The error bars in the ROC curves are for the confidence bands of error. We found that, with a high number of biomarkers, univariate feature selection seems to be performing well with respect to test AUC compared to the causality measure for all methods by a minimum of 0.025 (binarized) and 0.029 (non-binarized).

Download:

Fig 4. ROC Curve for GBT model with univariate feature selection (10 Biomarkers).

https://doi.org/10.1371/journal.pone.0322620.g004

Download:

Fig 5. ROC curve for the baseline (10 biomarkers).

https://doi.org/10.1371/journal.pone.0322620.g005

Download:

Table 5. Results for 10 biomarkers using 5 models with causal-based and univariate feature selection.

https://doi.org/10.1371/journal.pone.0322620.t005

For a higher number of biomarkers, a more generic method like univariate seems to suffice. While increasing the historical data might help improve the performance of other approaches, the less data-hungry causal approach already performs well without inconsistent sensitivity at a specificity of 0.9,0.8.

Hyperparameter study

As shown in Tables 4 and 5, a few methods were classified based on the binarization of biomarker values before model training indicated by B; for example, Causal(B) means causality method with binarized inputs. We discretize all input measurements for a given sample based on a threshold . Tables 4 and 5 are for optimal hyperparamter settings among the chosen values. However, it is important to note that there is little variance in AUCs for most thresholds except for 1.0, showing the stability of the selected biomarkers as seen in Fig 6 for XGB with causal metric for 3 biomarkers. The figure shows the ROC curves across different threshold values (including 0.6,1.0,1.4,1.8). Similar trend is observed for other methods as well, where the threshold value of 1.0 typically results in lower test AUCs and the most optimal threshold is 1.4. Also, consistency is observed in the frequency of biomarker selection. Furthermore, by raising the value of K significantly, we get diminishing returns, suggesting a saturation point to pick the number of biomarkers, K.

Download:

Fig 6. Hyperparameter Sensitivity.

ROC Curve with multiple thresholds() for XGB model with causal-based biomarker selection.

https://doi.org/10.1371/journal.pone.0322620.g006

We found the biomarkers: Epstein-Barr virus capsid protein BFRF3, Holliday junction resolvase-like protein HP0334, hypothetical proteins HP1029, HP0386, HP0273, HP1065, hydrogenase expression/formation protein HP0898, acyl-CoA thioesterase HP0496 IgA antibodies and Epstein-Barr nuclear C-terminal Glutathione S-Transferase EBNA cGST IgG antibody, to be few of the most frequently selected biomarkers related to gastric cancer for the threshold 1.4. Fig 7 shows the high frequency of biomarkers for various threshold values. As seen in the figure, for low values of K, most biomarkers appear in above of all folds when evaluating with LOOCV, therefore supporting the stability of the model. These are the biomarkers that were consistently picked by the causality measure.

Download:

Fig 7. Hyperparameter sensitivity.

Frequency of Selected Biomarkers, where K = 10 for multiple thresholds().

https://doi.org/10.1371/journal.pone.0322620.g007

Notably, the test AUC increases with K and saturates after K = 10 as seen in Fig 8. However, K had a limited impact on the biologically relevant measure. Initially, increasing the value of K increased the test AUC by the magnitude of 0.2. As we gradually increased K, the test AUC levels out to a certain range, around 0.7 but the measure tends to get more sparse. We see diminishing returns by adding any more number of biomarkers. This relation has relevance based on the target application desired to make inexpensive diagnostic kits.

Download:

Fig 8. Hyperparameter sensitivity.

Effect of K for threshold 1.4 for GBT model with univariate selection.

https://doi.org/10.1371/journal.pone.0322620.g008

Discussion and limitations

We use biomarker measures for cancer prediction and leverage the causality measure to select causal biomarkers. We see the effects of ablating away causality measure with univariate feature selection in Table 4. We observe a higher AUC and consistent sensitivity values for the causal method as we decrease the number of biomarkers, and these benefits go away otherwise. This will be beneficial when applying this method to the industry considering the computational power and being less expensive as the method performs better with less number of biomarkers. This approach can also be applicable to similar domains for other disease prediction. Additionally, the experiments with the causal metric can be extended by adding a combinatorial way of picking the ranked causal biomarkers.

For the secondary analysis, we recorded the increase in the probability of cases with the presence of the causal-based biomarkers as a function of K as seen in Fig 9 for the threshold of 1.4. The error plots are of a confidence band of error. Similar to the performance of test AUC as a function of K, the probability of increase escalates till K = 10, but then drops and saturates with a further increase in K. This also supports our finding of benefits with a causal-based method for lower biomarkers. Specifically, for a lower number of biomarkers, the causal-based method gave a probability of increase up to , while on the other hand, for a higher number of biomarkers than K = 10, there is a decrease in the probability of cases with the presence of those biomarkers up to . Note that, here, the biomarkers are the most frequent with higher causal scores across all folds in LOOCV. Future inquiry of interest also includes empirical experimentation to validate ML models as well as, specific biological testing that can be performed on particularly selected causal biomarkers.

Download:

Fig 9. Increase in probability of cases (cancer) with the presence of causal-based biomarkers as a function of K.

https://doi.org/10.1371/journal.pone.0322620.g009

Our method for selecting a few biomarkers had the best performance, however, we found a limitation in that it didn’t perform as well as other baselines when allowed for over 3 times more biomarkers. We saw evidence of overfitting for most approaches with univariate in 3 biomarker settings with a drastic drop from train to test AUC, in particular, MLP for 10 biomarkers as well. Given the dataset size, it is known that approaches like MLP will overfit. However, we dint observe this for the causal-based method on a low number of biomarkers.

Experiments with the binarized inputs were conducted to get better causal explanations as it allows you to localize the sources of causality, by having clear contributions of which biomarkers are selected versus not. For a threshold, we can precisely extract biomarkers that are causal compared to all potential biomarkers. However, on average, it does almost the same as non-binarized experiments with respect to test AUC. Moreover, note that Fig 9 illustrates that we indeed obtained biomarkers that are causal.

Related work

Machine learning models, such as logistic regression, have been utilized with biological data for association purposes. In Islam et al. [8], the correlation coefficients of three biomarkers: body temperature, heart rate, and probable blood glucose level, were evaluated and associated with malaria detection using logistic regression. Similarly, in Direkvand-Moghadam et al. [7], univariate logistic regression demonstrated a substantial association between female sexual dysfunction and biomarkers, such as age, gravidity, and menarche age. Additionally, in Bursac et al. [6], the application of feature selection prior to model training showed the potential to maintain confounding variables, especially when dealing with macro biological data sets. Note that none of this prior work conducts an analysis of various machine learning classifiers, such as gradient-boosted trees or neural networks with causal-based and feature selection methods.

More specifically, machine learning models paired with feature selection for disease detection have proved significantly beneficial. Various hybrid optimization methods were used in combination with ML models like Decision Tree, Logistic Regression, Random Forest,etc. [18–20] and evaluated on cancer data based on sensitivity, specificity, and ROCs akin to our evaluation. However, note that they were applied to datasets with features of a maximum of 30, while we observe a dataset with 100 times more analyte measures. Typically, for a disentangled feature selection for pre-classification, univariate feature selection and dimensionality reduction methods are used for high dimensional datasets [21–23]. Also, a gradient-boosting decision tree, logistic regression was used with multivariate analysis and other feature importance methods on gastric cancer data based on a limited number of characteristics [24,25], we use a univariate analysis to be computationally less intensive over a huge number of analyses. Note that Recursive Feature Elimination methods are computationally expensive with growing data size while model-entangled feature importance (GBM Importance, Lasso) will not be a fair comparison to the causal-based method. Considering our dataset size and to give a fair comparison while retaining original feature measures, we set univariate selection as a baseline. In Sorino et al. [26], numerous machine learning techniques similar to ours, such as random forest classifier and boosted tree classifier, with cross-validation were used to diagnose non-alcoholic fatty liver disease. Similarly, in Díaz Álvarez et al. [27], a feature selection, evaluated on chi-squared statistic was paired with a Naive Bayes classifier to aid the diagnosis and classification of neuro-degenerative disorders. Moreover, vision-based machine learning techniques such as convolutional neural networks have been applied to a wide variety of medical diagnostic use cases [28–32]), even applied for gastric cancer image data [33]. Some techniques used k-fold cross-validation with chi-square test combined with other hybrid nature-inspired feature selection methods and found XGBoost to be effective when used on images of chest CT scans [34], although the authors concentrate on leveraging radiological image features and converge to minimum 90 features. Such diagnosis based on imagery would be complementary to biomarker-based diagnosis. However, to our knowledge, the application of such techniques to the use of biomarkers, specifically a large number of proteins, for the purposes of medical diagnosis has not been studied in the literature. We also allow for a parameter K to limit the number of biomarkers and find that our approach performs well for as low as 3 biomarkers.

The concept of causal-based methods, such as the one apparent in our findings, has been used in a variety of medical applications [11]. For example, in Richens et al. [35], the application of causal machine learning effectively increased clinical accuracy from the top to the top of doctors. However, to date, such methods have not been combined with recent advances in biomarker experimentation [36] for medical diagnosis based on biomarker measurements. We focus on the narrow area of leveraging biomarkers for cancer prediction

Conclusion

In this paper, we use a causality measure to select biomarkers paired with ML-based classifiers on a gastric cancer dataset for disease detection purposes. We pre-select biomarkers to reduce the number of biomarkers considered to be more practical, reduce overfitting, and to understand the causal effect of the set of biomarkers. With respect to , and , the XGB model with causality measure performed better when compared to the baseline for 3 biomarkers and has a hike of 0.114 on AUC. We found that approaches with the causal metric performed better when handling a smaller number of biomarkers, while conventional techniques like univariate feature selection performed better with a larger number of biomarkers. The causality measure compares co-occurring biomarkers, they could provide biological intuition enabling further empirical studies. We see evidence that this approach likely generalizes for the prediction of other diseases based on biomarkers, as our machine learning methods perform well across a variety of diseases.

Supporting information

S4 Table. Most frequent top 3 biomarkers for univariate selection and causal measure.

https://doi.org/10.1371/journal.pone.0322620.s004

S5 Table. Most frequent top 10 biomarkers for univaritate selection and causal measure.

https://doi.org/10.1371/journal.pone.0322620.s005

S1 Fig. Confusion matrix for XGB model for 3 biomarkers at threshold 1.4 with the causal metric.

https://doi.org/10.1371/journal.pone.0322620.s006

S2 Fig. Confusion matrix for GBT model for 10 biomarkers with the univariate method.

https://doi.org/10.1371/journal.pone.0322620.s007

S3 Fig. ROC Curve with 10% confidence interval for XGB model with causality measure (3 Biomarkers).

https://doi.org/10.1371/journal.pone.0322620.s008

S4 Fig. ROC Curve with 10% confidence interval for LR model with univariate selection (3 Biomarkers).

https://doi.org/10.1371/journal.pone.0322620.s009

S5 Fig. ROC Curve with 10% confidence interval for GBT model with univariate selection (10 Biomarkers).

https://doi.org/10.1371/journal.pone.0322620.s010

S6 Fig. ROC Curve with 10% confidence interval for LR model with univariate selection (10 Biomarkers).

https://doi.org/10.1371/journal.pone.0322620.s011

S7 Fig. Increase in probability of cases (cancer) with the presence of causal-based biomarkers as a function of K with 10% confidence interval.

https://doi.org/10.1371/journal.pone.0322620.s012

References

1. Rosado M, Silva R, G Bexiga M, G Jones J, Manadas B, Anjo SI. Advances in biomarker detection: alternative approaches for blood-based biomarker detection. Adv Clin Chem. 2019;92:141–99. pmid:31472753
- View Article
- PubMed/NCBI
- Google Scholar
2. Topkaya SN, Azimzadeh M, Ozsoz M. Electrochemical biosensors for cancer biomarkers detection: recent advances and challenges. Electroanalysis. 2016;28(7):1402–19.
- View Article
- Google Scholar
3. Blennow K, Zetterberg H. Biomarkers for Alzheimer’s disease: current status and prospects for the future. J Intern Med. 2018;284(6):643–63. pmid:30051512
- View Article
- PubMed/NCBI
- Google Scholar
4. Ahn JC, Teng P-C, Chen P-J, Posadas E, Tseng H-R, Lu SC, et al. Detection of circulating tumor cells and their implications as a biomarker for diagnosis, prognostication, and therapeutic monitoring in hepatocellular carcinoma. Hepatology. 2021;73(1):422–36.
- View Article
- Google Scholar
5. Lin L-L, Huang H-C, Juan H-F. Discovery of biomarkers for gastric cancer: a proteomics approach. J Proteomics. 2012;75(11):3081–97. pmid:22498886
- View Article
- PubMed/NCBI
- Google Scholar
6. Bursac Z, Gauss CH, Williams DK, Hosmer DW. Purposeful selection of variables in logistic regression. Source Code Biol Med. 2008;3:17. pmid:19087314
- View Article
- PubMed/NCBI
- Google Scholar
7. Direkvand-Moghadam A, Suhrabi Z, Akbari M, Direkvand-Moghadam A. Prevalence and predictive factors of sexual dysfunction in iranian women: univariate and multivariate logistic regression analyses. Korean J Fam Med. 2016;37(5):293–8. pmid:27688863
- View Article
- PubMed/NCBI
- Google Scholar
8. Islam M, Islam R. Exploring the impact of univariate feature selection method on machine learning algorithms for heart disease prediction. In: 2023 International Conference on Next-Generation Computing IoT and Machine Learning (NCIM), Gazipur, Bangladesh. 2023, pp. 1–5. https://doi.org/10.1109/NCIM59001.2023.10212832
9. Kleinberg S, Hripcsak G. A review of causal inference for biomedical informatics. J Biomed Inform. 2011;44(6):1102–12. pmid:21782035
- View Article
- PubMed/NCBI
- Google Scholar
10. Song L, Song M, Rabkin CS, Williams S, Chung Y, Van Duine J, et al. Helicobacter pylori immunoproteomic profiles in gastric cancer. J Proteome Res. 2021;20(1):409–19. pmid:33108201
- View Article
- PubMed/NCBI
- Google Scholar
11. Kleinberg S, Mishra B. The temporal logic of causal structures. arXiv, preprint, 2012.
12. Gardner M, Dorling S. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmosph Environ. 1998;32(14–15):2627–36.
- View Article
- Google Scholar
13. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(1):2825–30.
- View Article
- Google Scholar
14. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. Xgboost: extreme gradient boosting. R package. 2015;0.4-2(1):1–4.
- View Article
- Google Scholar
15. Ravi A, Gopal V, Preetha Roselyn J, Devaraj D, Chandran P, Sai Madhura R. Detection of infectious disease using non-invasive logistic regression technique. In: 2019 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS). IEEE; 2019, pp. 1–5. https://doi.org/10.1109/incos45849.2019.8951392
16. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2019, pp. 8024–35.
17. Breiman L. Random forests. Mach Learn. 2001; 45:5–32.
- View Article
- Google Scholar
18. Khanna M, Singh LK, Shrivastava K, Singh R. An enhanced and efficient approach for feature selection for chronic human disease prediction: a breast cancer study. Heliyon. 2024;10(5):e26799. pmid:38463826
- View Article
- PubMed/NCBI
- Google Scholar
19. Singh LK, Khanna M, Singh R. An enhanced soft-computing based strategy for efficient feature selection for timely breast cancer prediction: Wisconsin Diagnostic Breast Cancer dataset case. Multimed Tools Appl. 2024;83(31):76607–72.
- View Article
- Google Scholar
20. MunishKhanna, Singh LK, Garg H. A novel approach for human diseases prediction using nature inspired computing & machine learning approach. Multimed Tools Appl. 2023;83(6):17773–809.
- View Article
- Google Scholar
21. Abellana DPM, Lao DM. A new univariate feature selection algorithm based on the best–worst multi-attribute decision-making method. Decis Anal J. 2023;7:100240.
- View Article
- Google Scholar
22. McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. 2020. https://arxiv.org/abs/1802.03426
- View Article
- Google Scholar
23. Dev S, Wang H, Nwosu C, Jain N, Veeravalli B, John D. A predictive analytics approach for stroke prediction using machine learning and neural networks. arXiv, preprint, 2022.
24. Zhu S-L, Dong J, Zhang C, Huang Y-B, Pan W. Application of machine learning in the diagnosis of gastric cancer based on noninvasive characteristics. PLoS One. 2020;15(12):e0244869. pmid:33382829
- View Article
- PubMed/NCBI
- Google Scholar
25. Du H, Yang Q, Ge A, Zhao C, Ma Y, Wang S. Explainable machine learning models for early gastric cancer diagnosis. Sci Rep. 2024;14(1):17457. pmid:39075116
- View Article
- PubMed/NCBI
- Google Scholar
26. Sorino P, Caruso M, Misciagna G, Bonfiglio C, Campanella A, Mirizzi A, et al. Selecting the best machine learning algorithm to support the diagnosis of non-alcoholic fatty liver disease: A meta learner study. PLoS One. 2020;15(10):e0240867. pmid:33079971
- View Article
- PubMed/NCBI
- Google Scholar
27. Álvarez JD, Matias-Guiu JA, Cabrera-Martín MN, Risco-Martín JL, Ayala JL. An application of machine learning with feature selection to improve diagnosis and classification of neurodegenerative disorders. BMC Bioinformatics. 2019;20(1):491. pmid:31601182
- View Article
- PubMed/NCBI
- Google Scholar
28. Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1).
- View Article
- Google Scholar
29. Shaban M, Ogur Z, Mahmoud A, Switala A, Shalaby A, Abu Khalifeh H, et al. A convolutional neural network for the screening and staging of diabetic retinopathy. PLoS One. 2020;15(6):e0233514. pmid:32569310
- View Article
- PubMed/NCBI
- Google Scholar
30. Heenaye-Mamode Khan M, Boodoo-Jahangeer N, Dullull W, Nathire S, Gao X, Sinha GR, et al. Multi- class classification of breast cancer abnormalities using Deep Convolutional Neural Network (CNN). PLoS One. 2021;16(8):e0256500. pmid:34437623
- View Article
- PubMed/NCBI
- Google Scholar
31. Lopez-Garnier S, Sheen P, Zimic M. Automatic diagnostics of tuberculosis using convolutional neural networks analysis of MODS digital images. PLoS One. 2019;14(2):e0212094. pmid:30811445
- View Article
- PubMed/NCBI
- Google Scholar
32. Kundu R, Das R, Geem ZW, Han G-T, Sarkar R. Pneumonia detection in chest X-ray images using an ensemble of deep learning models. PLoS One. 2021;16(9):e0256630. pmid:34492046
- View Article
- PubMed/NCBI
- Google Scholar
33. Lee J, Lee H, Chung J-W. The role of artificial intelligence in gastric cancer: surgical and therapeutic perspectives: a comprehensive review. J Gastric Cancer. 2023;23(3):375–87. pmid:37553126
- View Article
- PubMed/NCBI
- Google Scholar
34. Singh L, Khanna M, Monga H, Singh D, Pandey G. Nature-inspired algorithms-based optimal features selection strategy for COVID-19 detection using medical images. New Gener Comput. 2024.
- View Article
- Google Scholar
35. Richens JG, Lee CM, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun. 2020;11(1):3923. pmid:32782264
- View Article
- PubMed/NCBI
- Google Scholar
36. Kleinbaum D, Dietz K, Gail M, Klein M, Klein M. Logistic regression. New York: Springer; 2002.

[ref1] 1. Rosado M, Silva R, G Bexiga M, G Jones J, Manadas B, Anjo SI. Advances in biomarker detection: alternative approaches for blood-based biomarker detection. Adv Clin Chem. 2019;92:141–99. pmid:31472753
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Topkaya SN, Azimzadeh M, Ozsoz M. Electrochemical biosensors for cancer biomarkers detection: recent advances and challenges. Electroanalysis. 2016;28(7):1402–19.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Blennow K, Zetterberg H. Biomarkers for Alzheimer’s disease: current status and prospects for the future. J Intern Med. 2018;284(6):643–63. pmid:30051512
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Ahn JC, Teng P-C, Chen P-J, Posadas E, Tseng H-R, Lu SC, et al. Detection of circulating tumor cells and their implications as a biomarker for diagnosis, prognostication, and therapeutic monitoring in hepatocellular carcinoma. Hepatology. 2021;73(1):422–36.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Lin L-L, Huang H-C, Juan H-F. Discovery of biomarkers for gastric cancer: a proteomics approach. J Proteomics. 2012;75(11):3081–97. pmid:22498886
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Bursac Z, Gauss CH, Williams DK, Hosmer DW. Purposeful selection of variables in logistic regression. Source Code Biol Med. 2008;3:17. pmid:19087314
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Direkvand-Moghadam A, Suhrabi Z, Akbari M, Direkvand-Moghadam A. Prevalence and predictive factors of sexual dysfunction in iranian women: univariate and multivariate logistic regression analyses. Korean J Fam Med. 2016;37(5):293–8. pmid:27688863
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Islam M, Islam R. Exploring the impact of univariate feature selection method on machine learning algorithms for heart disease prediction. In: 2023 International Conference on Next-Generation Computing IoT and Machine Learning (NCIM), Gazipur, Bangladesh. 2023, pp. 1–5. https://doi.org/10.1109/NCIM59001.2023.10212832

[ref9] 9. Kleinberg S, Hripcsak G. A review of causal inference for biomedical informatics. J Biomed Inform. 2011;44(6):1102–12. pmid:21782035
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Song L, Song M, Rabkin CS, Williams S, Chung Y, Van Duine J, et al. Helicobacter pylori immunoproteomic profiles in gastric cancer. J Proteome Res. 2021;20(1):409–19. pmid:33108201
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Kleinberg S, Mishra B. The temporal logic of causal structures. arXiv, preprint, 2012.

[ref12] 12. Gardner M, Dorling S. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmosph Environ. 1998;32(14–15):2627–36.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref13] 13. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(1):2825–30.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref14] 14. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. Xgboost: extreme gradient boosting. R package. 2015;0.4-2(1):1–4.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref15] 15. Ravi A, Gopal V, Preetha Roselyn J, Devaraj D, Chandran P, Sai Madhura R. Detection of infectious disease using non-invasive logistic regression technique. In: 2019 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS). IEEE; 2019, pp. 1–5. https://doi.org/10.1109/incos45849.2019.8951392

[ref16] 16. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2019, pp. 8024–35.

[ref17] 17. Breiman L. Random forests. Mach Learn. 2001; 45:5–32.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref18] 18. Khanna M, Singh LK, Shrivastava K, Singh R. An enhanced and efficient approach for feature selection for chronic human disease prediction: a breast cancer study. Heliyon. 2024;10(5):e26799. pmid:38463826
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref19] 19. Singh LK, Khanna M, Singh R. An enhanced soft-computing based strategy for efficient feature selection for timely breast cancer prediction: Wisconsin Diagnostic Breast Cancer dataset case. Multimed Tools Appl. 2024;83(31):76607–72.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref20] 20. MunishKhanna, Singh LK, Garg H. A novel approach for human diseases prediction using nature inspired computing & machine learning approach. Multimed Tools Appl. 2023;83(6):17773–809.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref21] 21. Abellana DPM, Lao DM. A new univariate feature selection algorithm based on the best–worst multi-attribute decision-making method. Decis Anal J. 2023;7:100240.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref22] 22. McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. 2020. https://arxiv.org/abs/1802.03426
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref23] 23. Dev S, Wang H, Nwosu C, Jain N, Veeravalli B, John D. A predictive analytics approach for stroke prediction using machine learning and neural networks. arXiv, preprint, 2022.

[ref24] 24. Zhu S-L, Dong J, Zhang C, Huang Y-B, Pan W. Application of machine learning in the diagnosis of gastric cancer based on noninvasive characteristics. PLoS One. 2020;15(12):e0244869. pmid:33382829
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref25] 25. Du H, Yang Q, Ge A, Zhao C, Ma Y, Wang S. Explainable machine learning models for early gastric cancer diagnosis. Sci Rep. 2024;14(1):17457. pmid:39075116
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref26] 26. Sorino P, Caruso M, Misciagna G, Bonfiglio C, Campanella A, Mirizzi A, et al. Selecting the best machine learning algorithm to support the diagnosis of non-alcoholic fatty liver disease: A meta learner study. PLoS One. 2020;15(10):e0240867. pmid:33079971
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref27] 27. Álvarez JD, Matias-Guiu JA, Cabrera-Martín MN, Risco-Martín JL, Ayala JL. An application of machine learning with feature selection to improve diagnosis and classification of neurodegenerative disorders. BMC Bioinformatics. 2019;20(1):491. pmid:31601182
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref28] 28. Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1).
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref29] 29. Shaban M, Ogur Z, Mahmoud A, Switala A, Shalaby A, Abu Khalifeh H, et al. A convolutional neural network for the screening and staging of diabetic retinopathy. PLoS One. 2020;15(6):e0233514. pmid:32569310
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref30] 30. Heenaye-Mamode Khan M, Boodoo-Jahangeer N, Dullull W, Nathire S, Gao X, Sinha GR, et al. Multi- class classification of breast cancer abnormalities using Deep Convolutional Neural Network (CNN). PLoS One. 2021;16(8):e0256500. pmid:34437623
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref31] 31. Lopez-Garnier S, Sheen P, Zimic M. Automatic diagnostics of tuberculosis using convolutional neural networks analysis of MODS digital images. PLoS One. 2019;14(2):e0212094. pmid:30811445
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref32] 32. Kundu R, Das R, Geem ZW, Han G-T, Sarkar R. Pneumonia detection in chest X-ray images using an ensemble of deep learning models. PLoS One. 2021;16(9):e0256630. pmid:34492046
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref33] 33. Lee J, Lee H, Chung J-W. The role of artificial intelligence in gastric cancer: surgical and therapeutic perspectives: a comprehensive review. J Gastric Cancer. 2023;23(3):375–87. pmid:37553126
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref34] 34. Singh L, Khanna M, Monga H, Singh D, Pandey G. Nature-inspired algorithms-based optimal features selection strategy for COVID-19 detection using medical images. New Gener Comput. 2024.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref35] 35. Richens JG, Lee CM, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun. 2020;11(1):3923. pmid:32782264
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref36] 36. Kleinbaum D, Dietz K, Gail M, Klein M, Klein M. Logistic regression. New York: Springer; 2002.

Figures

Abstract

Introduction

Gastric cancer dataset

Machine learning and feature selection methods

Overview of approaches

Feature selection methods

Machine learning classification methods

Results

Setup

Selection of 3 biomarkers

Selection of 10 biomarkers

Hyperparameter study

Discussion and limitations

Related work

Conclusion

Supporting information

S1 Table. Hyperparameters used for each model

S2 Table. Test AUC of various methods for feature selection.

S3 Table. Univariate selection with min-max scaling.

S4 Table. Most frequent top 3 biomarkers for univariate selection and causal measure.

S5 Table. Most frequent top 10 biomarkers for univaritate selection and causal measure.

S1 Fig. Confusion matrix for XGB model for 3 biomarkers at threshold 1.4 with the causal metric.

S2 Fig. Confusion matrix for GBT model for 10 biomarkers with the univariate method.

S3 Fig. ROC Curve with 10% confidence interval for XGB model with causality measure (3 Biomarkers).

S4 Fig. ROC Curve with 10% confidence interval for LR model with univariate selection (3 Biomarkers).

S5 Fig. ROC Curve with 10% confidence interval for GBT model with univariate selection (10 Biomarkers).

S6 Fig. ROC Curve with 10% confidence interval for LR model with univariate selection (10 Biomarkers).

S7 Fig. Increase in probability of cases (cancer) with the presence of causal-based biomarkers as a function of K with 10% confidence interval.

References