The efficacy of machine learning models in lung cancer risk prediction with explainability

Among many types of cancers, to date, lung cancer remains one of the deadliest cancers around the world. Many researchers, scientists, doctors, and people from other fields continuously contribute to this subject regarding early prediction and diagnosis. One of the significant problems in prediction is the black-box nature of machine learning models. Though the detection rate is comparatively satisfactory, people have yet to learn how a model came to that decision, causing trust issues among patients and healthcare workers. This work uses multiple machine learning models on a numerical dataset of lung cancer-relevant parameters and compares performance and accuracy. After comparison, each model has been explained using different methods. The main contribution of this research is to give logical explanations of why the model reached a particular decision to achieve trust. This research has also been compared with a previous study that worked with a similar dataset and took expert opinions regarding their proposed model. We also showed that our research achieved better results than their proposed model and specialist opinion using hyperparameter tuning, having an improved accuracy of almost 100% in all four models.


Introduction
Lung cancer is one of the most commonly diagnosed cancers worldwide to date.It's a severe disease that affects individuals from all aspects of life regardless of their lifestyle and surroundings environment.This cancer lies within the intricate network of the human respiratory system and arises as a silent menace, often remains unnoticed until it reaches an advanced stage [1].It can manifest in various forms, from the insidious creep of small cell lung cancer to more common non-small cell lung cancer, presenting its own challenges in diagnosing and treatment.The impact is profound, not just physically but emotionally and socially as well.One's journey through lung cancer is often fearful and uncertain, having to make some tough decisions.Starting from a strategic decision about the risk stratification of the screening program to decide eligibility [2,3].
In the United States, lung cancer is the second most common cancer in both men and women.The American Cancer Society estimates that there will be 234,580 new cases of lung cancer, including 116,310 in men and 118,270 in women, and about 125,070 deaths due to this in the year 2024 [4].They also reported that the proportion of developing lung cancer in men is 1:16, whereas in women, it is 1:17, including both smokers and non-smokers, but for smokers' people, the chances are much higher.Having a five-year relative duration to gather survival rate, the Surveillance, Epidemiology, and End Results (SEER) database shows 28% relative survival probabilities for non-small cell lung cancer and 7% survival probabilities for small cell lung cancer [5].Data collected by Cancer Research UK (2016-2018) shows that lung cancer is the third most common cancer in the UK, with around 48500 new cases in the UK every year, including 23,300 cases in women and 25,300 cases in men and only 10% survival rate is remoted in 2013-2017 [6].In Asian countries like China, a total of 19.2% of cancer-related deaths were reported due to lung cancer in 2017 [7]; India had 8.1% of cancer-related deaths [8]; Malaysia reports that lung cancer has a 10% contribution to all malignancies [7].Analysing these situations, the seriousness of the prevention and early diagnosis can easily be understood.
Lung cancer has many risk factors; smoking is number one, as reported by the Centers for Disease Control and Prevention (CDC) [8].According to the CDC, smoking is linked to 80-90% of lung cancer deaths in the USA.Secondhand smoking, which means people who don't smoke themselves but are exposed to smoking, are also highly prone to lung cancer.As reported by the CDC, around 14 million children were affected from 2013 to 2014.The second leading cause of lung cancer is Radon, which is a gas that forms in rocks, soil, and water and can enter buildings through cracks and cause lung cancer if breathed for a long time.Other factors include radiation therapy, abnormal diet, and family genetic history.Risk prediction is more important in lung cancer screening than clinical assessment, as demonstrated by many trials, including the Back model [9], the Lung Cancer Risk Assessment Tool (LCRAT) [10], the Lung Cancer Death Risk Assessment Tool (LCDRAT) [10], the Liverpool Lung Project (LLP) model [11,12], and the PLCOm2012 model [13,14].
This study addresses the relationship between lung cancer factors and early symptoms.Four machine learning models are tuned to detect the low, medium, or high lung cancer risk level.Most of the studies only do detection and provide no explanation due to the black-box nature of machine learning.This study overcame that limitation by explaining each model's interpretability using different explanation methods such as decision boundaries, Local Interpretable Model-agnostic Explanations (LIME), and tree extraction.The primary motivation of this study is to explain model results to non-technical people or patients so they can trust the process more.Significant contributions are mentioned in below points: 1. Exploring dataset to figure out relations between different features.
2. Tuning four machine learning algorithms to outperform previous best results.
3. Explaining the model behaviour and reasoning through explainable AI methods.
The rest of the paper is divided into multiple sections, including a literature review in section 3, materials and methods in section 4, result analysis in section 5, discussion in section 6, limitations in section 7 and conclusion in section 8.

Literature review
Though the development in medical science has advanced so far, cancer remains a highly critical and significant concern throughout medical aid, oncology, health care professionals, and AI-based medical science researchers.Diagnosis of lung cancer mainly relies on manual pathology screening, which is highly prone to error due to the human nature of manual film reading.A good number of algorithms and methods were developed using machine learning (ML) and deep learning (DL) to identify cancer from numerical datasets or image-based datasets [15].Early detection is still crucial for improving survival rates among patients.
Various ML and DL-based techniques have been applied to identify many kinds of cancer diagnosis, prognosis, and risk factors [16][17][18][19].Specifically, AI has been seen to be applied to lung cancer risk assessment, utilising diverse data sources such as medical imaging, genetic markers, clinical records, and environmental factors [20].ML models incorporating clinical data, such as patient demographics, smoking history, and symptoms, have demonstrated efficacy in predicting lung cancer risk [21].Among different models, the Support Vector Machine (SVM) showed the highest detection accuracy, but no reasoning has been provided, lacking model explainability.Four ML models, SVM, Naïve Bayes (NB), Decision Tree (DT), and Logistic Regression (LR), were applied to predict lung cancer from two datasets (collected from UCI and Data World), achieving the highest accuracy of 96.9% with LR on UCI and 99.2% with SVM on Data World [22].In both cases, no model explainability has been provided.Using five different data mining techniques: SVM, K-Nearest Neighbors (KNN), NB, DT, and Artificial Neural Network (ANN), lung cancer prediction has been done with three case scenarios [23].The best accuracy achieved was 93% using the ANN algorithm and SMOTE Upsampling technique on an unbalanced Kaggle dataset [24].Biomarkers are also used to identify early lung cancer by analysing the combination of metabolism factors with ML methods.Among the used models, Neural Network (NN) and NB achieved 100% classification accuracy [25].Among these machine learning models, none was found that explained the internal reasoning that resulted in the accuracy, causing the trustfulness of the application.A customised Lung Cancer Prediction Tool (LCPT) has been developed to predict lung cancer using the risk factor and compared with expert opinion to verify the result [26].They have shown an accuracy of 93.33% using LCPT, more significant than the specialist opinion of 86.66%.A Random Forest (RF) model was also used to generate ten random trees to compare the results with LCPT.They explained how the factors resulted in the decision using Degree of Importance (DOI).
Image processing of computed tomography (CT) screening has widely been used to diagnose lung cancer using different computer vision techniques [27].Previously, 2D images of CT were hugely popular for classification and segmentation.As computational power has increased, people are now exploring 3D images and have achieved excellent results.Researchers are using DL methods to identify high-risk smokers suitable for lung cancer screening CT using chest radiographs.Their model's performance was validated, showing promise in improving the selection process of lung cancer screening [28].A customised deep CNN has been proposed to classify interstitial lung diseases (ILDs) from CT image patches, achieving around 85.61% accuracy, higher than VGG-Net performance, which gained 78% [29].3D-VNet and 3D-ResNet architecture have been developed to train 3D CT image slices for segmentation and classification problems [30].They have achieved a 99.3% Dice Similarity Coefficient (DSC) for segmentation and 99.2% accuracy for classification on the LUNA16 dataset.Though their accuracy and segmentation results are promising, model explainability is very difficult to show due to the complexity of CNN architecture.Another study developed a cascade 3D-UNet to detect lung cancer bone metastases (LCBM) from CT images.Compared with five radiologists, their model outperformed in detecting LCBM, with higher AUROC (0.875 vs. 0.819) and sensitivity (0.894 vs. 0.892) in an observer-independent study [31].Combining with transformer and U-Net, an architecture has been developed named UNETR and used to segment 3D images of lung cancer using the Decathlon dataset [32], achieving an accuracy of 97.83% with DSC of 96.42% and a classification result of 98.77% [32].They have shown the different performances regarding different hyperparameters like optimiser, number of epochs, and activation functions.Still, the CNN explainability remains unnoticed.An ensemble multi-view 3D CNN model has been designed for risk stratification of invasive lung adenocarcinoma using thin slice CT scan images, with an AUC of 91.3% for benign/malignant diagnosis and 92.9% for pre-invasive/invasive nodule classification [33].It also outperforms senior doctors in risk assessment, having 77.6% accuracy, but lacks information on why it can outperform doctors.
Explainable AI, or XAI, is gaining significant attention because of its ability to understand the reasoning behind the model's prediction, classification, or segmentation.Various works are seen to explain the AI model to understand the importance of features; for example, in chronic wound images, LIME has been applied to understand ROI [34].Another custom XAI diagnostic model has been proposed to interpret the model using TabNet with causal graphs on mammography reports of breast cancer [35].Another ensemble learning framework with XAI has been developed by ref. [36] to determine breast cancer with explanations.The SHAP model has been used to explain lung cancer reasoning from biomarker values identified from CT scan reports by ref. [37], where they developed an AI CAD model using multiple ML methods.Many other works have used the XAI method to explain model behaviour on lung cancer detection on both ML and DL methods [38][39][40][41].
Regarding the risk factors, almost all image-based detection is the diagnosis of lung cancer, and most risk factors are considered in a numerical form of data with different vital factors and symptoms.This research is based on exploring the relationship between these factors and lung cancer chances using four popular machine learning models named SVM, KNN, DT, and RF.It also explains why the model behaved as it did.This work will improve a previous work [26] that addresses issues on the same dataset and proposes a custom LCPT model.In this work, we have also shown that higher classification accuracy can be achieved using parameter tuning.

Dataset description
The dataset is taken from Data World [42].There are 22 features: Age, Gender, Air Pollution, Alcohol use, Dust Allergy, Occupational Hazards, Genetic Risk, Chronic Lung Disease, Balanced Diet, Obesity, Smoking, Passive Smoker, Chest Pain, Coughing of Blood, Fatigue, Weight Loss, Shortness of Breath, Wheezing, Swallowing Difficulty, Clubbing of Fingernails, Frequent Cold, Dry Cough, Snoring.The details of the dataset have been shown in Table 1.The association of features with risk-level classes is significant.It has been seen that people aged around 35-40 are more likely to have lung cancer.Also, high air pollution levels cause a higher risk, along with alcohol use.People with less alcohol use are seen to have lower chances of getting lung cancer.A similar trend has been seen among other features; having high points generally leads to higher positive risk, except shortness of breath, wheezing, clubbing fingernails, and snoring.The exceptions are seen to have a range of values having higher risks.A total feature distribution histogram has been shown in Fig 1.
The dendrogram represents a hierarchical clustering of various features based on their similarity or dissimilarity.The y-axis measures the distance between clusters, with lower values indicating greater similarity and likeliness of occurring together.Each feature on the x-axis corresponds to a symptom or factor of lung cancer.Feature distance has been calculated using a dendrogram graph and is shown in Fig 2 .As we move up the hierarchy, clusters form by joining related symptoms and factors.For example, Fatigue and Snoring suggest that they are likely to occur together.Broader clusters emerge as we ascend the dendrogram, where Age and Gender show that they are unlikely to be related to each other.The red cluster indicates a close relation with risk level and other highly connected features.
Another correlation matrix has been generated to show the relationship between features in a better way in Fig 3 .Light color represents a higher correlation, and dark color means a lower correlation.It can be noticed that Age and Gender have the lowest correlation with Levels of 0.079 and -0.16, and Obesity and Coughing of blood have higher correlations of 0.82 and 0.77.Among features, a high (0.82-0.88) correlation value is seen among Occupational hazards, Genetic risk, Alcohol use, and Dust Allergy.Some more high correlations between 0.79-0.82are seen among Chest pain and Occupational hazard, genetic risk, lung disease, and balanced diet.The lowest correlation value is seen among smoking and weight loss as -0.27.

Model training and validation
Four widely known ML models (SVM, KNN, DT, RF) are selected to train and validate the data to predict the risk factors.The dataset has been split into train and test, with 70% for training and 30% for testing, which included 219 low, 235 medium, and 246 high-risk level datapoint in training and 84 low, 97 medium, and 119 high-risk level datapoint in test portions (details shown in Table 2).The model training speed is fast, and it took around 1-2 seconds to  train all four models.From the previous dataset analysis section, looking into the correlation matrix, it can be understood that Age and Gender might not have a major contribution to the risk factor.The Random Forest classifier is used to calculate the feature importance, having n_estimator as 20.It also shows that Gender has 0 feature importance and Age is the second lowest with a 0.0022 score.Hence, these two features have been deducted from the training dataset.The total feature importance is shown in Table 3. Parameter tuning has been applied to select the best parameters among given dictionaries using the Grid Search algorithm with cv = 5 and n_jobs = 5.The Fbeta scorer was selected with beta = 2, and the micro average was considered to create the scorer.For SVM, four parameters were given with suitable ranges as C = [0.1,1,10];Kernel = [rbf, poly, sigmoid, Though parameter tuning has also been tested on DT and RF, as they are already achieving 100% accuracy, after tuning, parameters are just the first value of the dictionary.Improved parameters after grid search algorithm that are applied: {'C': 0.1, 'class_weight': 'balanced', 'decision_function_shape': 'ovo', 'kernel': 'linear'} for SVM and {'algorithm': 'auto', 'leaf_size': 10, 'n_neighbors': 3, 'p': 1, 'weights': 'uniform'} for KNN.All four models have been trained again with selected parameters, and this time, SVM, KNN, and DT showed improved K-fold test accuracy.

Model explainability
Machine Learning model explainability refers to the understanding and interpreting how a model arrives at its predictions, in this case, risk levels.It is about demystifying the blackbox nature of complex scenarios and making their internal functionality more transparent and understandable to humans.Understanding why a model makes a specific prediction or classification can help build trust in its reliability and fairness and provide valuable insights for improving the model's performance or addressing biases.There are various techniques for enhancing model explainability, ranging from simple methods like feature importance analysis to more sophisticated approaches such as generating human readable explanations for individual predictions using popular algorithms like LIME or Shapley Additive Explanations (SHAP).

Support vector machine (SVM).
A decision boundary has been plotted and analyzed to explain the outcome of the SVM model and shown in Fig 4 .However, a decision boundary is a 2D plot, where the dataset has multiple features; hence, a principal component analysis (PCA) was done to reduce the dimensionality of the data.After doing 2D PCA, minimum and maximum values are extracted to create a mesh grid.SVM has been trained to get the Z-axis values using that grid data.A decision boundary has been drawn using the mesh grid and Z-axis values, and a scatter plot is used to plot the training and testing data points.

K-nearest neighbors (KNN).
The LIME method has been used to explain the KNN model.LIME begins by generating a dataset of perturbed instances around the instance of interest.For presenting a prediction made by a KNN model, LIME would generate new data points by perturbing the features of the instance being explained.The KNN model is then used to predict the output for these perturbed instances.Since KNN is a lazy learner, it directly uses the nearest neighbors from the training data to make predictions without explicitly building a model.Next, LIME fits an interpretable model to the generated dataset using the perturbed  Decision tree.A decision tree is a popular supervised learning algorithm for classification and regression tasks.It works by recursively partitioning the input space into smaller regions and assigning a label or value to each area.This process creates a tree-like structure where each internal node represents a decision based on a feature, each branch represents a possible outcome of that decision, and each leaf node represents the final prediction or value.The topmost node in the tree is the root node, representing the entire input space.Nodes that represent decisions based on features are called internal nodes.Each internal node splits the data into two or more subsets based on a feature value.The edges connecting nodes are called branches, representing the possible outcomes of decisions.Nodes at the end of the branches that do not split further are terminal nodes.They represent the final predictions or values.At each internal node, the decision tree algorithm selects the best feature and the corresponding threshold to split the data into subsets.The goal is to maximize the homogeneity or purity of the subsets regarding the target variable.After selecting the splitting criteria, the algorithm recursively applies the splitting process to each subgroup, creating a binary tree structure.The recursive partitioning process continues until a stopping criterion is met.Standard stopping criteria include reaching a maximum tree depth, achieving a minimum number of samples per leaf node, or no further improvement in homogeneity.Once the tree is constructed, it can be used to make predictions for new instances by traversing the tree from the root node to a leaf node based on the values of the input features.In our case, Frequent Cold is the root node that links to Air Pollution and Obesity for direct high-risk prediction.A complete tree explanation has been shown in Fig 6.
Random forest.Plotting a decision tree from a random forest ensemble can provide insights into how the individual decision trees within the random forest make predictions collectively.While each decision tree in a random forest is trained independently, understanding the structure of a single decision tree can help in understanding the overall behavior of the random forest algorithm.However, it's important to note that plotting a single decision tree from a random forest does not fully represent the complexity and diversity of the entire ensemble.Each decision tree in a random forest is trained on a bootstrapped subset of the original dataset and may use a random subset of features at each split.Plotting an individual decision tree can help understand the specific features and decision criteria used by that tree to make predictions.By analyzing the splits and decisions made by the particular tree, one can infer the importance of different features in making predictions.Features that appear higher in the tree and are used for multiple splits are likely more important in decision-making.While a single decision tree provides insights into the decision-making process, the strength of random forests lies in aggregating predictions from multiple trees.Plotting multiple decision trees from a random forest and analyzing their commonalities and differences can help understand how the ensemble combines diverse predictions to improve overall accuracy and generalization.Understanding the decision-making process of individual trees within a random forest can enhance the interpretability of the model, providing insights into why specific predictions are made and how different features contribute to those predictions.The first decision tree from the random forest model is shown in Fig 7.

Result analysis
Initially, all four models are trained with default parameters, having five k-fold cross-validation.SVM has achieved a cross-validation accuracy of 95% (+/-0.02),total accuracy of 96.33%, KNN achieved a cross-validation accuracy of 92% (+/-0.07),and total accuracy of 99.66%.On the other hand, the decision tree achieved 99% (+/-0.03)cross-validation accuracy, and the random forest achieved 100% accuracy in cross-validation and total accuracy.Detailed results, including precision, recall, and f1-score are shown in Table 4 for all four models.The confusion matrix of the test result is shown in Fig 8.
Though the classification accuracy with tree algorithms is highest at 100%, SVM seems to miss some values.For which tuned parameters are seen as an improvement in accuracy, achieving almost 100% test accuracy even with five k-folding tests.Similar improvements have been seen for KNN, which has come to 99% test accuracy from the previous 92%, improving almost 7%.The decision tree also improved on the 1 st K-fold test, increasing to 100% from 96%, but decreased 1% on the 3 rd K-fold test, but overall remains at 99% test accuracy.The random forest model remains unchanged on all K-folding.A detailed comparison of results is shown in Table 5.The learning curve has also been analyzed before and after parameter tuning.Before parameter tuning, SVM shows a slow learning curve in training and cross-validation.Similar but lower validation scores can also be seen with KNN.Comparatively, DT and RF show better training and cross-validation scores from the beginning.Learning curves for four models before parameter tuning have been shown in Fig 9.
After parameter tuning, much better learning curves are seen for SVM and KNN.Previously, reaching the peak accuracy took SVM around 400 training sets, but now it is achieved within 300 training sets.Similarly, KNN's previous peak performance was at 500 training sets, but now it is achieved within 400 training sets.Little performance disruption is seen for DT, but it can ultimately maintain its peak accuracy.No noticeable changes were seen for RF, as it has a stable learning curve and accuracy throughout the K-folding and fixed parameter tests.Details of learning curves after parameter tuning for four models are shown in Fig 10.

Discussion
Machine learning has always been one of the most popular techniques for analyzing numeric data.Specifically, it has been widely used in medical science to help healthcare professionals with prognosis, diagnosis, different factor analysis, and so on.In this paper, four ML models are used to predict lung cancer risk levels, as well as analyze the explainability of these models' outcomes.Previous studies that worked with similar datasets used some standard models but achieved lower performance and provided no logical explanation [22,26].Ahmad A. et al. even collected expert professional comments to compare with the ML predicted result, which shows the superiority of ML prediction.Even with CT scans, the CNN method shows better accuracy than senior doctors in the case of risk assessment [33].It refers to the importance of explainability in answering why these models can perform better than professionals.
This paper worked with a lung cancer risk assessment dataset, which has been trained with four machine learning models, achieving higher accuracy than previous work [26].Hyperparameter tuning has been done using the Grid Search algorithm to get the best parameters among given ranges for all four models.After training with those parameters, improvements were seen in the SVM, KNN, and DT models.Random Forest has remained the top scorer since the beginning due to its capability of generating multiple trees, each having split features based on feature weights.SVM, KNN, and DT are some of the most popular models for training numeric datasets, and some other Kaggle datasets were seen to be trained by using these models.One recent work achieved accuracy around 95.4% for SVM, 93.7% for DT, and 95.2% for KNN [21], another work [22] achieved 99.2% with SVM and 90% with DT on the same dataset.
In both cases, no hyperparameter tuning has been done, which might be able to increase their accuracy, and no model explanations were mentioned.On a custom-collected dataset, SVM, RF, and KNN were applied to train the data, and they gained 94.7%, 89.5%, and 89.5% model accuracy, respectively.A comparison of these results is shown in Table 6.This paper also explained the models' performance reasons using different explainability methods.A decision boundary has been drawn for both training and testing data for SVM.From the decision boundary, it can be understood that the data points are well-categorized and differentiable from one class to another.The LIME method has been used to extract the reasoning behind KNN's classification.For example, one "High" risk level category is selected as "High" due to having higher Fatigue, specifically having a value of more than 5, Passive Smoking value of more than 5, Obesity between 4 to 7, Alcohol use value of more than 7, and so on.Similarly, the reasons for other levels can also be determined.For the Decision Tree and Random Forest classifier, a single tree has been shown to explain why the model came to the particular decision of given feature values.

Limitations of the study
While machine learning (ML) models hold promise for lung cancer risk prediction, there are limitations to their efficacy, especially regarding explainability: Data Dependence: ML models are only as good as the data they are trained on.Biases in the data, such as underrepresentation of certain demographics, can lead to inaccurate predictions for those groups.
Focus on Established Risk Factors: Current models primarily focus on well-established risk factors like smoking history.They might miss subtle or emerging risk factors not yet incorporated into the training data.
False Positives and Negatives: ML models can generate false positives (identifying low-risk individuals as high-risk) and false negatives (missing high-risk individuals).This can lead to unnecessary procedures or missed opportunities for early detection.

Conclusion
Lung cancer remains the deadliest disease, with a high mortality rate throughout the world regardless of economic or social conditions.It is much better to care about early prevention, which could save not only one life but also a whole family.To understand the possibility of lung cancer, its factors are critical to understand and analyze and know early symptoms.In this work, a dataset having 22 such properties have been analyzed.The machine learning models used are very lightweight, easily reproducible, and usable in real life without having much technical knowledge.With parameter tuning, almost 99 to 100% test accuracy has been achieved for all four (SVM, KNN, DT, RF) models, even with 5 K-fold cross-validations.Later, each model's decisions were explained with valid and accessible reasoning through various methods such as decision boundary, LIME, and tree representation.This research has been done considering the low correlation between Age and Gender.This might not be true if any other dataset has a higher correlation.Hence, it can be regarded as a minor limitation.Future work can include multiple neural network-based model architectures to deal with more complex datasets and compare results and explainability with ML models.