Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Novel ensemble learning approach with SVM-imputed ADASYN features for enhanced cervical cancer prediction

  • Raafat M. Munshi

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    rmonshi@kau.edu.sa

    Affiliation Department of Medical Laboratory Technology (MLT), Faculty of Applied Medical Sciences, King Abdulaziz University, Rabigh, Saudi Arabia

Correction

12 Feb 2024: Munshi RM (2024) Correction: Novel ensemble learning approach with SVM-imputed ADASYN features for enhanced cervical cancer prediction. PLOS ONE 19(2): e0298980. https://doi.org/10.1371/journal.pone.0298980 View correction

Abstract

Cervical cancer remains a leading cause of female mortality, particularly in developing regions, underscoring the critical need for early detection and intervention guided by skilled medical professionals. While Pap smear images serve as valuable diagnostic tools, many available datasets for automated cervical cancer detection contain missing data, posing challenges for machine learning models’ efficacy. To address these hurdles, this study presents an automated system adept at managing missing information using ADASYN characteristics, resulting in exceptional accuracy. The proposed methodology integrates a voting classifier model harnessing the predictive capacity of three distinct machine learning models. It further incorporates SVM Imputer and ADASYN up-sampled features to mitigate missing value concerns, while leveraging CNN-generated features to augment the model’s capabilities. Notably, this model achieves remarkable performance metrics, boasting a 99.99% accuracy, precision, recall, and F1 score. A comprehensive comparative analysis evaluates the proposed model against various machine learning algorithms across four scenarios: original dataset usage, SVM imputation, ADASYN feature utilization, and CNN-generated features. Results indicate the superior efficacy of the proposed model over existing state-of-the-art techniques. This research not only introduces a novel approach but also offers actionable suggestions for refining automated cervical cancer detection systems. Its impact extends to benefiting medical practitioners by enabling earlier detection and improved patient care. Furthermore, the study’s findings have substantial societal implications, potentially reducing the burden of cervical cancer through enhanced diagnostic accuracy and timely intervention.

Introduction

Cervical cancer is a cancer that begins in the cervix, an opening of the uterus that links to the vagina. Frequently, it arises from an enduring infection with the sexually transmitted virus called human papillomavirus (HPV). H HPV has the ability to cause erratic changes in cervical cell structure. If these changes are not treated, they can eventually lead to cancer [1]. Cervical cancer ranks as the leading cause of mortality among women, following lung and breast cancer [2]. There is a widespread belief that cervical cancer is often considered untreatable when it reaches its advanced stages [3]. There have been notable recent advancements in enhancing this disease’s detection through imaging techniques. Based on the information given by the World Health Organization (WHO), cervical cancer is the fourth most commonly diagnosed cancer worldwide. Around 570,000 new cases were recorded in 2018 alone, making up 7.5% of all mortality from female cancer [4]. Roughly 85% of the projected 311,000 yearly cervical cancer-related fatalities are thought to take place in nations with lower- and middle-income economies. Saving lives is greatly aided by early identification of cervical cancer. In comparison to women without HIV, women living with HIV have a six-fold increased chance of getting cervical cancer, with HIV thought to be a factor in 5% of cases overall. The availability of essential equipment, consistent screening techniques, proper supervision, and the quick diagnosis and treatment of discovered lesions are some of the aspects that affect screening efficacy [5]. Squamous cell carcinoma, which accounts for around 70–80% of cases, and adenocarcinoma, which starts from epithelial cells in the cervical canal that secrete mucus, are the two main kinds of cervical cancer [6]. Adenocarcinoma has increased recently, despite the fact that squamous cell carcinoma is more prevalent and currently accounts for 10 to 15% of uterine malignancies. Due to the fact that adenocarcinoma grows in the cervical canal, its screening process is challenging. Depending on the kind and stage of the disease at diagnosis, several treatments and prognoses are available for cervical cancer. Early detection through regular screenings and HPV vaccination can significantly improve outcomes for people at risk.

Cervical cancer encompasses several types, including the most common squamous cell carcinoma and the second most common adenocarcinoma [7]. Each type has unique characteristics, aggressiveness, and treatment considerations, underscoring the importance of regular screenings for early detection and appropriate management. Specifically high-risk strains of the human papillomavirus (HPV) are the main cause of cervical cancer [8]. In HPV-infected women, there are several risk factors that might raise the probability of cervical cancer [9]. Smoking, genital herpes, having several sexual partners, a weaker immune system, poorer socioeconomic position, insufficient genital cleanliness, and a larger number of childbirths are among these risk factors. Cervical cancer symptoms might vary based on the size of the tumour and the stage of the disease. The biggest issue, however, is in the early stage, which frequently has no visible signs and is generally identified inadvertently during normal yearly check-ups. Approximately 90% of patients have evident symptoms in the latter stages [10]. The key sign connected with cervical cancer is irregular vaginal bleeding.

Early screening for cervical cancer is essential for detecting precancerous or cancerous changes in the cervix. Common screening methods include Pap smears and HPV tests [11]. However, these screenings have limitations, such as false positives and negatives, age-related factors, limited access to screening in some areas, and the focus on certain types of cervical cancer. Cervical cancer screening typically entails a gynaecological examination, which can cause discomfort and pain for patients [12]. The unease associated with this examination may lead to delays or avoidance, impeding the timely diagnosis of cervical cancer. Because of this, the death rate in these nations is substantially higher, with low-income countries accounting for nearly nine out of every ten cervical cancer-related fatalities [6]. It is crucial to increase cervical cancer screening rates since at an early stage it has relatively good 5-year survival rates, increasing to 90% [13]. Nevertheless, these screening rates vary among countries, with developed nations exhibiting higher rates, while developing countries face alarmingly low screening participation [14]. Cervical cancer prevention strategies differ, but depending on screening tests are inadequate. Early diagnosis is imperative to prevent fatalities from invasive forms of the disease.

Presently, ML, DL, and computer vision are emerging as valuable approaches for diagnosing medical conditions [15]. ML models have significantly improved the results of medical diagnosis [16]. Bhavani and Govardhan [17] proposed stacked ensemble model using SMOTE for predicting cervical cancer. Karamati et al. [18] also applied SMOTE and KNN features and applied multimodel approach for cervical cancer detection. Li et al. [19] used CNN for screening of cervical cancer patients. While recent studies have made significant strides in applying machine learning techniques, such as ensemble models and advanced sampling methods, to predict cervical cancer, there remains a need to comprehensively assess the combined impact of these approaches on improving accuracy, handling missing data, addressing class imbalances, and extracting complex features. Moreover, limited research has specifically evaluated the effectiveness of integrating Convolutional Neural Networks in conjunction with traditional ML methods for cervical cancer prediction. ML algorithms achieve accurate and reliable results by applying diverse preprocessing methods, like data cleansing and feature engineering, to the medical dataset. These discoveries can help medical practitioners diagnose illnesses quickly and give patients the best care possible. This paper employs ML techniques to develop a computer-aided diagnosis (CAD) system for the accurate and prompt identification of cancer. The following significant contributions are provided by this study:

  • An ML-driven framework is proposed for predicting cervical cancer in patients. A voting classifier is incorporated into the suggested model to improve prediction accuracy.
  • The SVM imputation method is employed to generate artificial missing values, aiming to mitigate the challenge posed by incomplete data.
  • ADASYN (Adaptive Synthetic Sampling), an oversampling technique that generates synthetic minority class samples, is applied to address class imbalance issues.
  • To handle complex features, this research employs a Convolutional Neural Network (CNN).
  • The effectiveness of the suggested method is assessed using four scenarios: the original dataset, the dataset with SVM imputation alone, the dataset with SVM imputation followed by ADASYN, and with CNN generated features dataset.

The paper is organised as follows: An extensive analysis of the current classification methods for cervical cancer diagnosis is given in Section. In Section further detail on the dataset and the suggested cervical cancer detection methodology is provided which employs several classification algorithms and up-sampling methods. The purpose of Section is to provide the findings of the research and promote discussion. The work is finally concluded in Section which also suggests possible lines of inquiry for further study.

Related work

During the recent years, there has been a surge in the development and application of ML models to accelerate research and innovation in various domains. Numerous investigations have been done in the categorization of cervical cancer [20, 21]. Various studies and their findings are summarized in this section.

ML and DL models are extensively being used in medical diagnostics, including breast cancer diagnosis [22], lung cancer detection [23], colorectal cancer [24], and numerous other healthcare applications [2527]. Some research works applied DL approaches in various tasks like ALzheimer’s diagnosis [28], medical imaging [29], and pathology image segmentation [30]. Medical diagnosis has been upgraded by different tools and techniques like surgical analysis [31], EEG encoding [32], CT imaging [33, 34], and surgical navigation [35]. The discrete wavelet and cosine transform were used by Kalbhor and colleagues in their research study [36] to extract characteristics. They used the fractional coefficient technique to effectively decrease the dimensionality of these characteristics. The reduced characteristics were then fed into seven different machine learning classifiers in an effort to discriminate between various cervical cancer subgroups. In another research investigation conducted by Devi and Thirumurugan [37], They used the C-means clustering technique to divide cervical cells. They then retrieved texture information and used Principal component analysis (PCA) for dimension reduction of the collected data. Following that, scientists used the K-nearest neighbours (KNN) method to categorise the cervical cells, attaining an excellent accuracy rate.

Alquran et al. [38] concentrated on cervical cancer classification. They integrated DL with a cascading SVM to attain precise outcomes. By combining methodologies, they effectively categorized cancer into seven groups, achieving a remarkable accuracy score. In another investigation, Kalbhor et al. [39] developed a novel hybrid approach that included DL and ML models, and a fuzzy network. Technique focused on feature engineering and Pap-smear picture categorization. They used transfer learning models such as AlexNet, GoogleNet, and ResNet. The experimental evaluation was done by utilising well-known datasets. Notably, the greatest classification accuracy was achieved by ResNet-50 architecture. Various other domains applied the advanced techniques [40, 41].

Radiation enteritis (RE) causes treatment intolerance or radiotherapy cessation, which severely lowers the patient’s quality of life. The adverse effects of radiation therapy in patients with cervical cancer can be greatly decreased if the RE in patients can be anticipated in advance and focused therapeutic preventative treatment can be implemented. Additionally, the optimisation of the radiotherapy strategy and the choice of a customised radiation dose depend on the precise prediction of RE. Ma et al. [42] investigated the relationship of RE and dose volume in cervical cancer patients. Cancer diagnosis involves invasion [43] and migration of cancer cells [4446]. The quality life of cervical cancer patient survivors has been analyzed in [47].

Tanimu et al. [48] conducted a research study with a primary focus on identifying risk factors linked to cervical cancer. They employed the decision tree (DT) classification algorithm and leveraged LASSO (least absolute shrinkage and selection operator) feature engineering methods and feature reduction techniques. These methods were used to pinpoint the critical characteristics to identify cervical cancer. The dataset utilized in their research presented challenges such as missing values and significant class imbalance. To tackle these issues, the research team adopted a technique known as SMOTETomek. The outcomes revealed that their proposed approach achieved outstanding accuracy. In a comparative analysis, Quinlan and colleagues [49] evaluated several ML classifiers for cervical cancer categorization. The dataset utilized in their analysis also showed signs of class imbalance, necessitating a method to solve this problem. To address the issue of class imbalance, the researchers used the SMOTE-Tomek in conjunction with a highly tailored RF. The findings showed that when combined with SMOTE-Tomek, the RF classifier attained an extraordinary accuracy rate.

Abdoh and colleagues [50] introduced a system for cervical cancer classification, employing the RF in conjunction with the SMOTE. hey also incorporated two feature reduction techniques. Their experiment used a dataset with thirty characteristics. The study looked at the effect of changing the feature size and discovered that utilizing SMOTE in conjunction with RF and other characteristics produced an excellent accuracy rate. Ijaz and colleagues [51] presented a data-driven approach for the detection of cervical cancer. The solution included both outlier identification and the SMOTE. The challenge was carried out utilizing the RF method with DBSCAN. According to their findings, when applied to a dataset of various features, their proposed method got a reasonable accuracy score.

In another work, Jahan et al. [52] developed an approach for detecting cervical cancer. Their study emphasized comparing the effectiveness of various classifiers in detecting the illness. The study included selecting multiple feature sets from the dataset and addressing missing data values using a mix of feature reduction approaches such as SelectBest, Chisquare, and RF. When applied to the top characteristics, the MLP algorithm achieved a remarkable accuracy rate. Mudawi et al. [53] proposed a complete research method for the prediction of cervical cancer that consists of four phases. They used a variety of ML classifiers in their research. According to the data, SVM achieved a phenomenal accuracy score in the cancer prediction job.

Following a thorough review of the literature, it is clear that multiple existing approaches have shown promising results in determining cervical cancer using diverse datasets. However, researchers have used a variety of optimisation methodologies to improve performance indicators including accuracy, precision, and recall. The major purpose of this work is to compare several ML algorithms in order to determine the best way to predict cervical cancer.

Materials and methods

This section provides a quick yet thorough overview of some essential areas of cervical cancer screening. It covers the fundamentals, such as an introduction to the dataset, the painstaking processes used for data preparation, the ML algorithms used to predict cervical cancer, and the approaches used to address the difficulty of class imbalance in this context.

Dataset

This study made use of a publicly accessible dataset from Venezuela’s Hospital Universitario de Caracas [54]. This is the only publicly accessible dataset appropriate for a thorough study of cervical cancer screening utilising questionnaires and AI techniques. On this dataset, the researchers assessed the suitability and efficacy of AI models and data-balancing strategies.

Fig 1 presents a description of the dataset, which includes 858 instances and 36 characteristics. The table shows details about the dataset’s input variables (35) and output variables (1). Fig 1 has a full explanation of each input variable. Notably, the dataset includes a variable called “Biopsy.” There is a large class imbalance in the dataset. Recognising the inherent difficulties of categorising unbalanced data, the researchers chose to resolve missing values by oversampling the minority class using the SVM imputer approach and the ADASYN technique.

Data preparation

Data preparation is critical for optimising the performance of machine learning models. It entails deleting unneeded or superfluous data, which can confound the models and reduce their efficiency. The data provided in Fig 1 highlights two primary concerns within the dataset.

  • Missing Values
  • Class Imbalance

Handling missing values.

During data preprocessing in this study, it was observed that the dataset contained numerous missing data values. Fig 1 shows how the missing data values are distributed across different classes. Missing values are handled by removing them or by applying any imputation technique. This study employs an SVM Imputer to handle missing values.

SVM imputer [55] is an ML-based model that can be utilized to impute missing values in a dataset. It works by training an SVM classifier to estimate the absent values using the available data within the dataset. SVM imputer is particularly well-suited for imputing missing values in categorical data. This is because SVMs are able to learn complex relationships between categorical features. Here is a step-by-step overview of how SVM imputer works:

  • Split the dataset into two parts: training and testing.
  • Train an SVM classifier on the training set, using the known values to predict the missing data values.
  • Use the trained SVM classifier to forecast the missing data values in the testing set.

SVM imputer can also be used to impute missing values in numerical data. However, it is important to note that SVM imputer is not as good at imputing missing values in numerical data as other imputation methods, such as mean imputation or median imputation.

Handling class imbalance.

Class imbalance in datasets emerges when one group considerably outnumbers others, providing issues for ML models by biassing them towards the majority class [56]. This imbalance can lead to worse detection of minority groups, for as when identifying illnesses like cancer, affecting model accuracy and patient outcomes. Addressing this imbalance is vital for ensuring fair and accurate learning, eliminating biases towards majority classes, and boosting the model’s capacity to recognise all classes successfully, especially in critical areas like medical diagnosis.

To address the dataset’s class imbalance issue, ADASYN [57] is implemented. ADASYN (Adaptive Synthetic Sampling) is an ML technique used to address class imbalance in datasets. Class imbalance occurs when there is a significant difference in the sample size from different classes in a dataset. This can be a problem for machine learning models, as they may learn to focus on the majority class and neglect the minority class.

ADASYN works by generating synthetic samples for the minority class. The amount of synthetic samples created for each instance of a minority class is determined by the difficulty of learning that instance. Instances that are more difficult to learn are assigned more synthetic samples. This helps to make the dataset balance and enhance the efficacy of ML models on the minority class. ADASYN is a reasonably basic method that has been demonstrated to improve the performance of ML models on unbalanced datasets.

Feature extraction

The CNN (Convolutional Neural Network) model is used in this study for feature engineering in the detection of cervical cancer. The CNN model, like other deep learning models, has many layers: the embedding, the max-pooling, and the convolutional layer.

The initial layer, referred to as the embedding layer, utilizes all attributes from the cervical cancer dataset, employing an embedding size of 25,000 and producing an output with a dimensionality of 300. Following the embedding layer is a Conv-1D layer containing 4,000 filters. This layer incorporates the ReLU (Rectified Linear Unit) activation function and employs a 2x2 kernel size. To capture pertinent features, the output of the 1D convolution is subjected to a 2x2 max-pooling layer. Lastly, a flatten layer is applied to convert the output back into a 1D array, ensuring compatibility with the ML model.

The cervical cancer dataset is structured as a tuple set (fsi, tci), with fs representing the feature set, tc indicating the target class column, and i denoting the tuple index. The embedding layer is employed to transform the training set into the intended input format in the following manner: (1) (2)

EOs represents the output generated by the embedding layer, serving as the input for the subsequent convolutional layer. The embedding layer’s parameters encompass the input lengths (I), vocabulary size (Vs), and output dimensions (Os).

Machine learning classifiers

This section gives an in-depth look at the machine learning methods utilised in this study, including implementation details. These algorithms are implemented using the scikit-learn and NLTK libraries. This job employs eight supervised ML techniques that are typically used for classification and regression problems. These algorithms were written in Python and implemented using the scikit-learn module.

XGBoost [58], or Extreme Gradient Boosting, is a powerful ensemble machine learning model based on gradient boosting. It excels in predictive modelling tasks by combining multiple decision trees in a way that corrects errors sequentially, leading to a robust ensemble model. XGBoost uses regularized tree-building algorithms to prevent overfitting and optimize model complexity. It is known for its efficiency, scalability, and versatility, making it suitable for various ML tasks. Features like built-in handling of missing values, feature importance analysis, and a supportive community contribute to its popularity. Careful hyperparameter tuning is essential to maximize its performance in specific applications. XGBoost has acquired robust results in ML competitions and is widely used in real-world applications.

Random Forest [59] is a versatile ensemble ML algorithm that combines multiple decision trees to make predictions. It employs a bagging technique to reduce overfitting and bias, making it robust and effective. Random Forest is known for its feature importance analysis, scalability, and versatility in handling various types of data. It’s widely used in classification and regression tasks due to its strong performance and ability to handle complex datasets.

The Stochastic Gradient Descent (SGD) [60] is an optimization algorithm commonly used for linear classification tasks. It incorporates randomness by selecting one training example at a time during each iteration, making it computationally efficient, especially for large datasets and online learning scenarios. SGD is versatile and can be employed with various regularization techniques. It’s valued for its scalability and suitability for real-time applications but requires careful tuning of hyperparameters for optimal performance.

K-Nearest Neighbors (KNN) [61] is an ML algorithm used for classification and regression tasks. It operates on the principle of finding the K nearest data points in the training dataset to make predictions for new data points. Key characteristics include its non-parametric nature, reliance on instance-based learning, and sensitivity to the choice of the hyperparameter K. KNN is versatile, used in various applications, and robust to outliers, but it can be affected by the curse of dimensionality in high-dimensional spaces. It’s a simple yet effective algorithm for both linear and non-linear data distributions.

Logistic Regression [62] is a versatile ML algorithm primarily used for binary categorization tasks. It predicts the probability of an input belonging to one of two classes (typically 0 and 1) by applying a logistic (sigmoid) function to ensure output values between 0 and 1. This algorithm is known for its simplicity, transparency, and interpretability, making it valuable in scenarios where understanding the impact of each input feature is crucial. However, it may not perform well with highly nonlinear relationships between features and outcomes, and it assumes independence among features.

The Extra Tree Classifier [63] is an ensemble machine learning algorithm that combines multiple decision trees. It stands out due to its high level of randomization during tree construction, reducing the risk of overfitting and making it robust against noisy data. Like Random Forest, it employs bootstrapping and provides feature importance scores. The Extra Tree Classifier is versatile, scalable, and applicable to both classification and regression tasks. Careful hyperparameter tuning is necessary for optimal performance, though it’s generally less sensitive to hyperparameters compared to some other algorithms.

Proposed approach.

The study made use of a dataset obtained from Kaggle, a recognised platform for publicly available datasets. A range of preprocessing processes were carried out to improve the efficacy of ML models and handle missing data. The SVM imputer was used to deal with missing data values. Following that, the dataset was divided into a 70:30 ratio, with 70% allotted as the train set and 30% as a test set. The suggested system used an ensemble technique known as RF + KNN + LR to identify cervical cancer. Voting classifiers are powerful strategies that aggregate results from many models to improve accuracy and durability. The models in the voting classifier have their own set of strengths and shortcomings, and their combined use results in greater overall performance. In this scenario, the suggested method for detecting cervical cancer incorporates three commonly used algorithms: RF, KNN, and LR. Fig 2 shows a process diagram explaining this strategy.

The voting classifier works by combining predictions from these three different ML methods. The typical strategy for building an ensemble/voting classifier is training numerous classifiers using the dataset and then combining their results. In this case, the RF, KNN, and LR models were all trained individually on the same dataset. Each model predicts the probability of each class inside the target variable. The estimated probabilities are then summed to get a final forecast for each instance in the dataset. A popular method for integrating the predictions is to compute a weighted average of the expected probabilities defined by the performance of each model on a validation dataset.

Algorithm 1 Ensembling of SV-CNN model.

Input: input data

MRF = Trained_RF

MKNN = Trained_KNN

MLR = Trained_LR

1: for i = 1 to M do

2:  if MRF ≠ 0 & MKNN ≠ 0 & MLR ≠ 0 & training_set ≠ 0 then

3:   ProbRF − 1 = MRF.probability(1 − class)

4:   ProbRF − 2 = MRF.probability(2 − class)

5:   ProbCNN − 1 = MKNN.probability(1 − class)

6:   ProbCNN − 2 = MKNN.probability(2 − class)

7:   ProbCNN − 1 = MLR.probability(1 − class)

8:   ProbCNN − 2 = MLR.probability(2 − class)

9:   Decision function =

   (Avg(ProbRF−1, ProbKNN−1, ProbLR−1)

   , (Avg(ProbRF−2, ProbKNN−2, ProbLR−2)

10:  end if

11:  Return final label

12: end for

Lines 3 to 6 of algorithm 1 indicate the probability scores of classes 1 and 2 from RF, KNN, and LR models, respectively. The probability score of a classifier, often used in classification tasks, represents the likelihood or confidence that a given data sample belongs to a particular class. It is calculated based on the output of the classifier model, such as in this case RF, KNN, and LR. This probability is not the final prediction of a specific target class. It is like a raw score (prediction confidence). To convert raw scores into probabilities, some classifiers employ a probability calibration step. This step ensures that the calculated scores are well-calibrated and can be interpreted as probabilities. Based on the probability scores and the chosen threshold (0.5 in this case), the classifier assigns a final predicted class label to the data sample. The decision function of line 7 of algorithm 1, decides the final class based on the class which has more probability score than the assigned threshold. The working example of the decision function is added below for further clarification.

The working of this ensemble can be explained using an example. Each sample that undergoes processing by both the SVM and CNN is assigned a probability score. Consider a scenario where the RF model assigns a probability of 0.4 and 0.7 for class 1 and class 2, KNN model assigns a probability of 0.5 and 0.8 for class 1 and class 2, respectively, and the LR model assigns probability scores of 0.5 and 0.4 for the same two classes. Denoting the probability value of x as P(x), where x ranges from 1 to 2, the final probability is calculated as follows:

  1. P(1) = (0.4 + 0.5 + 0.5)/3 = 0.46
  2. P(2) = (0.7 + 0.8 + 0.4)/3 = 0.63

This ensemble approach makes the final class label based on the probability scores for each class from both models used for voting. The final label is decided based on the highest average probability using line 7 of algorithm 1.

The proposed model uses the distinct capabilities of three separate ML models to provide results that are both accurate and resilient. To generalize the model while decreasing the overfitting by training models on the cervical cancer dataset and fusing their results of predictions. The suggested ensemble model’s fundamental functionality may be summarised as follows: (3) where , , and indicate the prediction probability for each test sample RF, KNN, and LR, respectively. After that, each test case’s probabilities acquired from RF, KNN, and LR are subjected to the soft voting criterion, as shown in Fig 3.

thumbnail
Fig 3. Structure of the proposed ensemble voting classifier.

https://doi.org/10.1371/journal.pone.0296107.g003

The proposed voting classifier identifies the optimal class prediction by evaluating the class with the highest average probability across all classes. This is achieved by amalgamating the projected probabilities from both models. The final prediction is made by selecting the class with the highest score of probability, as illustrated below: (4)

Evaluation metrics

The proposed system generates four crucial indices for evaluating its performance: True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP). The evaluation metric used in this study to evaluate the models is presented in Table 1.

Results

This section discusses the results of experiments and their consequences, with an emphasis on determining the efficacy of the suggested methodology in comparison to existing methodologies. The assessment includes a variety of test parameters used for the cervical cancer dataset, and the results are contrasted with alternative ML approaches. The tests are conducted using the original dataset, the SVM Imputed dataset, the dataset upsampled using AdDASYN and imputed with SVM, and the dataset containing CNN features.

Classifier performance with the original dataset

Initially, the experiments are conducted using the original dataset extracted from the cervical cancer dataset. The outcomes of all ML models utilizing these original features are presented in Table 2.

thumbnail
Table 2. Results of the machine learning models obtained using the original dataset.

https://doi.org/10.1371/journal.pone.0296107.t002

The findings demonstrate that LR and KNN had the best levels of accuracy among the classifiers, with rates of 73.41% and 72.98%, respectively. The precision of RF was 78.35%, the recall was 79.95%, and the F1 score was 79.91%. KNN exhibited precision and recall of 81.45%, resulting in an F1 score of 81.45%. Similarly, LR attained a precision of 80.15%, recall of 80.12%, and F1 score of 80.17%. XGB, on the other hand, performed the least successfully, with an accuracy rate of 64.57%, precision of 77.54%, recall of 79.64%, and F1 score of 78.51%.

The proposed Voting Classifier (VC), which merged RF, KNN, and LR, demonstrated superiority in terms of performance when compared to all individual models. It achieved an accuracy of 80.13%, precision of 84.56%, recall of 86.31%, and an F1 score of 85.71%. Nevertheless, when the individual machine learning models were assessed using the dataset without any missing values, their performance frequently lagged behind.

Classifier performance with SVM imputed dataset

The SVM imputer was used in the following phase of the trials to handle missing values in the dataset. Some values were missing during the data preparation step, forcing the employment of the SVM imputer to bridge these gaps. Following the imputation procedure, the amended dataset was used to train and evaluate several ML models. Table 3 describes the findings of these models.

thumbnail
Table 3. Results of the learning models using SVM imputer.

https://doi.org/10.1371/journal.pone.0296107.t003

According to the results, KNN, and LR attained accuracy rates of 83.10%, and 84.72%, respectively. However, the suggested VC (RF+KNN+LR) greatly beat them all, with an outstanding accuracy rate of 97.41%.

Classifier performance with ADASYN upsampled

The ADASYN approach was used in the third round of studies to address the class imbalance problem in the dataset. In the data preprocessing step, it became clear that only 58 of the total 858 samples were from the malignant class. To address the issue of class imbalance, ADASYN was utilized as an oversampling strategy. The enhanced dataset was utilised to evaluate the performance of several ML models. The results of these models is summarised in Table 4.

thumbnail
Table 4. Results of the learning models using ADASYN upsampled.

https://doi.org/10.1371/journal.pone.0296107.t004

The results show that the suggested voting ensemble model VC(RF + KNN + LR) surpasses all other models with an amazing accuracy of 94.24%. Individual classifiers LT, RF, and KNN all earned outstanding accuracy ratings of 85.37%, 83.48%, and 84.19%, respectively. Nonetheless, the VC ensemble of linear models (RF + KNN + LR) outperformed the other models on the up-sampled dataset.

Classifier performance with CNN generated features

The results from the fourth series of experiments, which involved utilizing the CNN-generated features with the SVM imputer for missing value handling and ADASYN for addressing class imbalance, can be found in Table 5. By leveraging both the CNN and ADASYN together, the objective was to simultaneously address missing values and class imbalance, aiming to improve the accuracy of the linear model. Following the application of the SVM imputer and ADASYN, ML models were trained and evaluated.

thumbnail
Table 5. Results of machine learning models with CNN generated features.

https://doi.org/10.1371/journal.pone.0296107.t005

Results of cross-validation

A 5-fold cross-validation was undertaken to further confirm the efficacy of the suggested technique. The findings are shown in Table 6. Notably, the suggested model has an average accuracy of 99.27, as well as average precision, recall, and F1 score values of 99.96%, 99.96%, and 99.97%.

thumbnail
Table 6. Significance of proposed methodology using k-fold validation.

https://doi.org/10.1371/journal.pone.0296107.t006

Discussions

The comparison of classifiers with the original dataset, with SVM Imputed dataset, With ADASYN upsampled and SVM Imputed dataset, and with CNN generated features dataset is depicted in Fig 4. When applied to ADASYN-balanced data, the Voting classifier demonstrates its superiority over all other classifiers. RF, KNN and LR perform better than other individual classifiers in every scenario. These findings underscore the significance of selecting the right combination of ML approaches for the effective operation of an ensemble ML model. In the context of analyzing imbalanced text data, employing data balancing techniques like ADASYN significantly enhances classifier performance.

This highlights the importance of utilizing the statistical technique ADASYN to balance the data before training, as it plays a pivotal role in enhancing classifier performance. It becomes apparent that classifiers may not achieve their optimal performance when dealing with imbalanced classes within the original dataset. Consequently, when coupled with the SVM imputer and the ADASYN technique for cervical cancer detection, the proposed model demonstrates improved generalization and outperforms other models when appropriately configured.

ADASYN and SVM imputer are very useful techniques in improving the performance of the models with class imbalance problems. When experiments are performed using CNN features, the results of the classifier have shown significant improvement in results. The proposed voting classifier has outperformed with a 99.99% score of accuracy, precision, recall and F1 score. Ensembling RF, KNN, and LR models offer a strategic advantage by combining their diverse learning approaches to improve predictive performance. This ensemble aims to leverage the strengths of each model: RF’s robustness, KNN’s pattern recognition, and LR’s probabilistic interpretation, thereby enhancing overall accuracy, reducing overfitting, improving robustness against outliers, and providing a more comprehensive analysis of the cervical cancer dataset. The ensemble, facilitated by a voting classifier, fosters a collective decision-making process, resulting in a more robust and accurate predictive model for cervical cancer detection.

Comparative analysis with cutting-edge methods

To assess the efficacy of the suggested method, a performance comparison with existing models specialised in cervical cancer diagnosis is performed. This review includes a selection of current research from the existing studies that serve as comparative points. According to one research [48], a cancer detection model using Recursive Feature Elimination (RFE) and DT using SMOTETomek obtains an accuracy of 98.82%, precision of 87.53%, recall of 100%, and an F1 score of 93.333%. Another research [51] uses 10 features for the same job and obtains an accuracy score of 97.72%. Furthermore, studies [52] applied MLP and [53] used SVM and revealed accuracy rates of 98.10% and 99%, respectively.

Despite the great accuracy reported in previous research works, the suggested models outperform them, as shown in Fig 5. The suggested method outperforms previous approaches due to three main factors: managing missing data, employing an ensemble voting classifier, and adding CNN-generated features. The novel mix of strategies, which includes correcting missing data, applying ensemble learning, and regulating class imbalance, is critical to the observed accuracy gains. While some earlier techniques may have failed to resolve the problem of missing data directly, this work uses an SVM imputation strategy in conjunction with ADASYN up-sampled features. In addition, the suggested technique makes use of a stacked ensemble voting classifier, which combines the results of three independent models. This collaborative approach is advantageous.

Conclusions

Cervical cancer poses a significant threat to women’s health, particularly in developing countries, where it ranks as a leading cause of mortality. Timely detection and treatment, guided by skilled medical professionals, are paramount in mitigating its devastating impact. Pap smear images have emerged as valuable diagnostic tools for identifying this form of cancer. However, numerous datasets designed for automated cervical cancer detection present a common challenge: missing values. These gaps in data can substantially hinder the performance of machine learning models, necessitating innovative solutions.

In response to these challenges, this study introduces an automated system tailored for cervical cancer prediction. This system demonstrates remarkable proficiency in managing missing values through the utilization of ADASYN features, ultimately achieving exceptional levels of accuracy. The cornerstone of the proposed approach is a stacked ensemble voting classifier model, strategically combining the predictive capabilities of three distinct machine learning models. Furthermore, SVM Imputer and ADASYN up-sampled features are integrated into the proposed framework to effectively address concerns related to missing values. The inclusion of CNN-generated features further bolsters the model’s robustness.

Notably, the outcomes of this study reveal the exceptional performance of the proposed model, boasting remarkable metrics such as 99.99% accuracy, 99.99% precision, 99.99% recall, and a 99.99% F1 score. To comprehensively assess the proposed model, a comparative analysis is conducted against various machine learning algorithms under four distinct scenarios: using the original dataset, employing SVM imputation, incorporating ADASYN features, and harnessing CNN-generated features. These comparative evaluations underscore the superior efficacy of the proposed model when compared to existing state-of-the-art approaches. The research has the potential to significantly benefit medical practitioners by enabling earlier cervical cancer detection and improving patient care. Future work aims to develop stacked ensembles of machine and deep learning models for enhanced performance on higher-dimensional datasets.

References

  1. 1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians. 2018;68(6):394–424. pmid:30207593
  2. 2. Arbyn M, Weiderpass E, Bruni L, de Sanjosé S, Saraiya M, Ferlay J, et al. Estimates of incidence and mortality of cervical cancer in 2018: a worldwide analysis. The Lancet Global Health. 2020;8(2):e191–e203. pmid:31812369
  3. 3. Pal A, Kundu R. Human papillomavirus E6 and E7: the cervical cancer hallmarks and targets for therapy. Frontiers in microbiology. 2020;10:3116. pmid:32038557
  4. 4. Dong N, Zhao L, Wu CH, Chang JF. Inception v3 based cervical cell classification combined with artificially extracted features. Applied Soft Computing. 2020;93:106311.
  5. 5. Zhang T, Luo Ym, Li P, Liu Pz, Du Yz, Sun P, et al. Cervical precancerous lesions classification using pre-trained densely connected convolutional networks with colposcopy images. Biomedical signal processing and control. 2020;55:101566.
  6. 6. Zhang S, Xu H, Zhang L, Qiao Y. Cervical cancer: Epidemiology, risk factors and screening. Chinese Journal of Cancer Research. 2020;32(6):720. pmid:33446995
  7. 7. Bedell SL, Goldstein LS, Goldstein AR, Goldstein AT. Cervical cancer screening: past, present, and future. Sexual medicine reviews. 2020;8(1):28–37. pmid:31791846
  8. 8. Jalil AT, Karevskiy A. The cervical cancer (CC) epidemiology and human papillomavirus (HPV) in the middle east. International Journal of Environment, Engineering and Education. 2020;2(2):7–12.
  9. 9. Kashyap N, Krishnan N, Kaur S, Ghai S. Risk factors of cervical cancer: a case-control study. Asia-Pacific journal of oncology nursing. 2019;6(3):308–314. pmid:31259228
  10. 10. Davies-Oliveira J, Smith M, Grover S, Canfell K, Crosbie E. Eliminating cervical cancer: progress and challenges for high-income countries. Clinical Oncology. 2021;33(9):550–559. pmid:34315640
  11. 11. Liang LA, Einzmann T, Franzen A, Schwarzer K, Schauberger G, Schriefer D, et al. Cervical cancer screening: comparison of conventional Pap smear test, liquid-based cytology, and human papillomavirus testing as stand-alone or cotesting strategies. Cancer Epidemiology, Biomarkers & Prevention. 2021;30(3):474–484. pmid:33187968
  12. 12. O’Laughlin DJ, Strelow B, Fellows N, Kelsey E, Peters S, Stevens J, et al. Addressing anxiety and fear during the female pelvic examination. Journal of Primary Care & Community Health. 2021;12:2150132721992195. pmid:33525968
  13. 13. Guimarãaes YM, Godoy LR, Longatto-Filho A, Reis Rd. Management of early-stage cervical cancer: a literature review. Cancers. 2022;14(3):575.
  14. 14. Maver P, Poljak M. Primary HPV-based cervical cancer screening in Europe: implementation status, challenges, and future plans. Clinical microbiology and infection. 2020;26(5):579–583. pmid:31539637
  15. 15. Aggarwal K, Mijwil MM, Al-Mistarehi AH, Alomari S, Gök M, Alaabdin AMZ, et al. Has the future started? The current growth of artificial intelligence, machine learning, and deep learning. Iraqi Journal for Computer Science and Mathematics. 2022;3(1):115–123.
  16. 16. Richens JG, Lee CM, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nature communications. 2020;11(1):3923. pmid:32782264
  17. 17. Bhavani C, Govardhan A. Cervical cancer prediction using stacked ensemble algorithm with SMOTE and RFERF. Materials Today: Proceedings. 2023;80:3451–3457.
  18. 18. Karamti H, Alharthi R, Anizi AA, Alhebshi RM, Eshmawi A, Alsubai S, et al. Improving Prediction of Cervical Cancer Using KNN Imputed SMOTE Features and Multi-Model Ensemble Learning Approach. Cancers. 2023;15(17):4412. pmid:37686692
  19. 19. Li X, Du M, Zuo S, Zhou M, Peng Q, Chen Z, et al. Deep convolutional neural networks using an active learning strategy for cervical cancer screening and diagnosis. Frontiers in Bioinformatics. 2023;3:1101667. pmid:36969799
  20. 20. Nithya B, Ilango V. Evaluation of machine learning based optimized feature selection approaches and classification methods for cervical cancer prediction. SN Applied Sciences. 2019;1:1–16.
  21. 21. Akter L, Islam MM, Al-Rakhami MS, Haque MR. Prediction of cervical cancer from behavior risk using machine learning techniques. SN Computer Science. 2021;2:1–10.
  22. 22. Islam MM, Haque MR, Iqbal H, Hasan MM, Hasan M, Kabir MN. Breast cancer prediction: a comparative study using machine learning techniques. SN Computer Science. 2020;1:1–14.
  23. 23. Srinivasulu A, Ramanjaneyulu K, Neelaveni R, Karanam SR, Majji S, Jothilingam M, et al. Advanced lung cancer prediction based on blockchain material using extended CNN. Appl Nanosci. 2021;13:1–13.
  24. 24. Foersch S, Glasner C, Woerl AC, Eckstein M, Wagner DC, Schulz S, et al. Multistain deep learning for prediction of prognosis and therapy response in colorectal cancer. Nature medicine. 2023;29(2):430–439. pmid:36624314
  25. 25. Zhuang Y, Jiang N, Xu Y. Progressive distributed and parallel similarity retrieval of large CT image sequences in mobile telemedicine networks. Wireless Communications and Mobile Computing. 2022;2022:1–13.
  26. 26. Lu S, Yang B, Xiao Y, Liu S, Liu M, Yin L, et al. Iterative reconstruction of low-dose CT based on differential sparse. Biomedical Signal Processing and Control. 2023;79:104204.
  27. 27. Lu S, Liu S, Hou P, Yang B, Liu M, Yin L, et al. Soft Tissue Feature Tracking Based on DeepMatching Network. CMES-Computer Modeling in Engineering & Sciences. 2023;136(1).
  28. 28. Puente-Castro A, Fernandez-Blanco E, Pazos A, Munteanu CR. Automatic assessment of Alzheimer’s disease diagnosis based on deep learning techniques. Computers in biology and medicine. 2020;120:103764. pmid:32421658
  29. 29. Aggarwal R, Sounderajah V, Martin G, Ting DS, Karthikesalingam A, King D, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ digital medicine. 2021;4(1):65. pmid:33828217
  30. 30. Wang S, Yang DM, Rong R, Zhan X, Xiao G. Pathology image analysis using segmentation deep learning algorithms. The American journal of pathology. 2019;189(9):1686–1698. pmid:31199919
  31. 31. Lu S, Yang J, Yang B, Yin Z, Liu M, Yin L, et al. Analysis and Design of Surgical Instrument Localization Algorithm. CMES-Computer Modeling in Engineering & Sciences. 2023;137(1).
  32. 32. Wang W, Qi F, Wipf D, Cai C, Yu T, Li Y, et al. Sparse Bayesian Learning for End-to-End EEG Decoding. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023;. pmid:37506000
  33. 33. Yi X, Guan X, Chen C, Zhang Y, Zhang Z, Li M, et al. Adrenal incidentaloma: machine learning-based quantitative texture analysis of unenhanced CT can effectively differentiate sPHEO from lipid-poor adrenal adenoma. Journal of Cancer. 2018;9(19):3577. pmid:30310515
  34. 34. He B, Lu Q, Lang J, Yu H, Peng C, Bing P, et al. A new method for CTC images recognition based on machine learning. Frontiers in Bioengineering and Biotechnology. 2020;8:897. pmid:32850745
  35. 35. Lin Q, Xiongbo G, Zhang W, Cai L, Yang R, Chen H, et al. A Novel Approach of Surface Texture Mapping for Cone-beam Computed Tomography in Image-guided Surgical Navigation. IEEE Journal of Biomedical and Health Informatics. 2023;. pmid:37490371
  36. 36. Kalbhor M, Shinde SV, Jude H. Cervical cancer diagnosis based on cytology pap smear image classification using fractional coefficient and machine learning classifiers. TELKOMNIKA (Telecommunication Computing Electronics and Control). 2022;20(5):1091–1102.
  37. 37. Lavanya Devi N, Thirumurugan P. Cervical cancer classification from pap smear images using modified fuzzy C means, PCA, and KNN. IETE Journal of Research. 2022;68(3):1591–1598.
  38. 38. Alquran H, Mustafa WA, Qasmieh IA, Yacob YM, Alsalatie M, Al-Issa Y, et al. Cervical cancer classification using combined machine learning and deep learning approach. Comput Mater Contin. 2022;72(3):5117–5134.
  39. 39. Kalbhor M, Shinde S, Popescu DE, Hemanth DJ. Hybridization of Deep Learning Pre-Trained Models with Machine Learning Classifiers and Fuzzy Min–Max Neural Network for Cervical Cancer Diagnosis. Diagnostics. 2023;13(7):1363. pmid:37046581
  40. 40. Hao S, Jiali P, Xiaomin Z, Xiaoqin W, Lina L, Xin Q, et al. Group identity modulates bidding behavior in repeated lottery contest: neural signatures from event-related potentials and electroencephalography oscillations. Frontiers in Neuroscience. 2023;17:1184601. pmid:37425015
  41. 41. Zhang R, Li L, Zhang Q, Zhang J, Xu L, Zhang B, et al. Differential Feature Awareness Network within Antagonistic Learning for Infrared-Visible Object Detection. IEEE Transactions on Circuits and Systems for Video Technology. 2023;PP:1–1.
  42. 42. Ma CY, Zhao J, Gan GH, He XL, Xu XT, Qin SB, et al. Establishment of a prediction model for severe acute radiation enteritis associated with cervical cancer radiotherapy. World Journal of Gastroenterology. 2023;29(8):1344. pmid:36925455
  43. 43. Chang QQ, Chen CY, Chen Z, Chang S. LncRNA PVT1 promotes proliferation and invasion through enhancing Smad3 expression by sponging miR-140-5p in cervical cancer. Radiology and Oncology. 2019;53(4):443. pmid:31626590
  44. 44. Li M, Xiao Y, Liu M, Ning Q, Xiang Z, Zheng X, et al. MiR-26a-5p regulates proliferation, apoptosis, migration and invasion via inhibiting hydroxysteroid dehydrogenase like-2 in cervical cancer cell. BMC cancer. 2022;22(1):876. pmid:35948893
  45. 45. Xie X, Wang X, Liang Y, Yang J, Wu Y, Li L, et al. Evaluating cancer-related biomarkers based on pathological images: a systematic review. Frontiers in Oncology. 2021;11:763527. pmid:34900711
  46. 46. Chen S, Chen Y, Yu L, Hu X. Overexpression of SOCS4 inhibits proliferation and migration of cervical cancer cells by regulating JAK1/STAT3 signaling pathway. European Journal of Gynaecological Oncology. 2021;42(3):554–560.
  47. 47. García JC, Ríos-Pena L, Rodríguez MCR, Maraver FM, Jiménez IR. Development and internal validation of a multivariable prediction model for the quality of life of cervical cancer survivors. Journal of Obstetrics and Gynaecology Research. 2023;. pmid:37435890
  48. 48. Tanimu JJ, Hamada M, Hassan M, Kakudi H, Abiodun JO. A machine learning method for classification of cervical cancer. Electronics. 2022;11(3):463.
  49. 49. Quinlan S, Afli H, O’Reilly R. A Comparative Analysis of Classification Techniques for Cervical Cancer Utilising At Risk Factors and Screening Test Results. In: AICS; 2019. p. 400–411.
  50. 50. Abdoh SF, Rizka MA, Maghraby FA. Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques. IEEE Access. 2018;6:59475–59485.
  51. 51. Ijaz MF, Attique M, Son Y. Data-driven cervical cancer prediction model with outlier detection and over-sampling methods. Sensors. 2020;20(10):2809. pmid:32429090
  52. 52. Jahan S, Islam MS, Islam L, Rashme TY, Prova AA, Paul BK, et al. Automated invasive cervical cancer disease detection at early stage through suitable machine learning model. SN Applied Sciences. 2021;3:1–17.
  53. 53. Al Mudawi N, Alazeb A. A model for predicting cervical cancer using machine learning algorithms. Sensors. 2022;22(11):4132. pmid:35684753
  54. 54. Fernandes K, Cardoso JS, Fernandes J. Transfer learning with partial observability applied to cervical cancer screening. In: Pattern Recognition and Image Analysis: 8th Iberian Conference, IbPRIA 2017, Faro, Portugal, June 20-23, 2017, Proceedings 8. Springer; 2017. p. 243–250.
  55. 55. Mallinson H, Gammerman A. Imputation using support vector machines. University of London Egham, UK: Department of Computer Science Royal Holloway. 2003;.
  56. 56. Rendon E, Alejo R, Castorena C, Isidro-Ortega FJ, Granda-Gutierrez EE. Data sampling methods to deal with the big data multi-class imbalance problem. Applied Sciences. 2020;10(4):1276.
  57. 57. Brandt J, Lanzen E. A comparative review of SMOTE and ADASYN in imbalanced data classification. DIVA. 2021;.
  58. 58. Sagi O, Rokach L. Approximating XGBoost with an interpretable decision tree. Information Sciences. 2021;572:522–542.
  59. 59. Schonlau M, Zou RY. The random forest algorithm for statistical learning. The Stata Journal. 2020;20(1):3–29.
  60. 60. Liu Y, Gao Y, Yin W. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems. 2020;33:18261–18271.
  61. 61. Dann E, Henderson NC, Teichmann SA, Morgan MD, Marioni JC. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nature Biotechnology. 2022;40(2):245–253. pmid:34594043
  62. 62. Shipe ME, Deppen SA, Farjah F, Grogan EL. Developing prediction models for clinical use using logistic regression: an overview. Journal of thoracic disease. 2019;11(Suppl 4):S574. pmid:31032076
  63. 63. Sharaff A, Gupta H. Extra-tree classifier with metaheuristics approach for email classification. In: Advances in Computer Communication and Computational Sciences: Proceedings of IC4S 2018. Springer; 2019. p. 189–197.