A hybrid feature extraction framework combining PCA and mutual information for gene expression based lung cancer classification

Syed Naseer Ahmad Shah; Kaartik Issar; Rafat Parveen

doi:10.1371/journal.pone.0342160

Abstract

Lung cancer remains a leading cause of cancer-related mortality worldwide, with early and accurate diagnosis posing a critical challenge for improving patient outcomes. Gene expression data provide crucial insights for lung cancer classification by revealing underlying biological mechanisms. However, the high dimensionality of such data presents challenges, including computational complexity and overfitting risks. This study proposes a hybrid feature extraction framework combining Principal Component Analysis (PCA) and Mutual Information (MI) to address these issues. PCA reduces dimensionality by capturing key variance patterns, while MI selects features highly relevant to the target class, ensuring an informative and concise feature set. Gene expression datasets from The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) were integrated, focusing on common genes. The hybrid PCA-MI framework was applied to rank genes, and the selected features were used to train a Convolutional Neural Network (CNN) for lung cancer classification. The genes ranked by the hybrid model were further analysed using protein-protein interaction (PPI) networks to identify hub genes, enhancing biological interpretability. The proposed framework was benchmarked against ten other feature extraction methods, including Lasso, Random Forest, Autoencoder, and PCA alone. The CNN classifier achieved superior performance with the PCA-MI features, attaining 98% accuracy and 98% precision. Training and validation curves demonstrated stable learning behaviour, and confusion matrix analysis confirmed robust predictions. Hub gene identification through PPI analysis validated the biological significance of the ranked genes. This study presents a robust framework for lung cancer classification by leveraging the strengths of PCA and MI, integrating deep learning and PPI analysis to address high-dimensional data challenges, and setting a foundation for future research in multi-omics data integration and enhanced diagnostic strategies.

Citation: Shah SNA, Issar K, Parveen R (2026) A hybrid feature extraction framework combining PCA and mutual information for gene expression based lung cancer classification. PLoS One 21(2): e0342160. https://doi.org/10.1371/journal.pone.0342160

Editor: Suyan Tian, The First Hospital of Jilin University, CHINA

Received: October 8, 2025; Accepted: January 19, 2026; Published: February 5, 2026

Copyright: © 2026 Shah et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data and code that support the findings of this study are available on https://github.com/SyeddNaseer/PCA_MI_GeneFeatureExtraction.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: PCA, Principal Component Analysis; MI, Mutual Information; TCGA, The Cancer Genome Atlas; ICGC, International Cancer Genome Consortium; CNN, Convolutional Neural Network; NSCLC, Non-Small Cell Lung Cancer; SCLC, Small Cell Lung Cancer; PPI, protein-protein interaction; MSE, Mean Squared Error; Lasso, Least Absolute Shrinkage and Selection Operator; ANOVA, Analysis of Variance

1. Introduction

Lung cancer is one of the most common as well as deadliest cancers in the world, and about 1.8 million people die annually due to this cause, accounting for nearly 18% of all deaths from cancer. It is also one of the cancers with the highest number of diagnoses annually and is quite rampant in high-income as well as low-income countries [1]. In general, Lung cancer is primarily classified into two major types: Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC). Each category has different biological features, treatment responses, and survival traits, so appropriate classification is crucial in dictating the process of proper treatment planning and favourable patient outcomes. NSCLC constitutes about 85% of all lung cancer, and it further can be categorised into three main classes based on the difference in their histopathological and molecular profiles: Adenocarcinoma It constitutes around 40% of lung cancer and is the most common type of NSCLC [2]. It commonly occurs at the periphery of the lung and, more commonly, in nonsmokers and younger patients. Adenocarcinoma grows relatively slowly compared to other carcinoma forms, and histological exams often show glandular structures. Squamous Cell Carcinoma accounts for approximately 25–30% of NSCLC cases; squamous cell carcinoma usually develops in the central regions of the lungs, most often in the larger bronchi [3]. It is strongly associated with a history of smoking and typically exhibits keratinisation and intercellular bridges in histopathological examination. Squamous cell carcinoma is a more aggressive tumour and typically invades the surrounding tissues aggressively. Large Cell Carcinoma accounts for approximately 10–15% of NSCLC. Large cell carcinoma appears in any lung location as a form of solid mass and is generally diagnosed once the disease is advanced [4]. In this category, large undifferentiated cells that do not resemble either adenocarcinoma or squamous cell carcinoma are seen. Large cell carcinoma is particularly aggressive, and the tendency to metastasise early is relatively high. The three major types of Non-Small Cell Lung Cancer (NSCLC), Adenocarcinoma, Squamous Cell Carcinoma, and Large Cell Carcinoma, are shown in Fig 1, illustrating their different growth patterns and histopathological features. SCLC accounts for approximately 15% of lung cancer diagnoses. The growth rate in SCLC is assertive, with rapid spread to distant sites.

Download:

Fig 1. Illustration of different types of lung cancer, highlighting the affected lung areas and unique cellular patterns for each type.

https://doi.org/10.1371/journal.pone.0342160.g001

It derives from neuroendocrine cells within the epithelium of the bronchial tree and occurs almost exclusively in the history of heavy smoking. The cells are small, round, and densely packed, hence the term “oat cell” carcinoma in histological terms. The aggressive nature of SCLC can mean different protocols regarding the treatment of SCLC compared to NSCLC and the use of chemotherapy and radiation therapy rather than surgery for resection [5,6]. The two problems, mortality and poor prognosis, have thus necessitated the evolution of a developing diagnostic and classification tool for lung cancer. In any case, the traditional classification methods have been based on a histopathological examination. Although effective, these can be supplemented and enhanced by molecular approaches like gene expression analysis [7]. Indeed, by expression profiling, one can study thousands of genes simultaneously and thus obtain a better understanding of the molecular underpinnings of the subtypes of lung cancer, allowing more accurate classification; it may also inform treatment decisions, identify novel therapeutic targets, and improve early detection strategies through intensive analysis [8]. However, gene expression datasets are incredibly challenging to analyse, primarily because of their high dimensionality. Gene expression profiles for each sample comprise thousands of gene measurements, but the number of samples is much smaller, generally several hundred in cancer studies. This is precisely what occurs in machine learning models due to an imbalanced number of features (genes) versus samples, so the model learns noise and not meaningful patterns and thus can’t generalise well to new data. Therefore, successful feature extraction means the reduction of dimensionality without losing informative signals about the biological processes underlying lung cancer [9]. Feature extraction is an essential preprocessing procedure in gene-expression data since it identifies relevant features from raw data and transforms them into a manageable, informative subset. Selecting appropriate features may improve the efficiency of computationally expensive classification algorithms while, at the same time, improving chances of getting better accuracy since focusing on the most relevant data to differentiate between, or to distinguish, the various classes or subtypes of lung cancer. Otherwise, if features are adequately extracted, models can be cluttered with redundant and unimportant features, which may further generate noise, reduce interpretability, and hide meaningful biological patterns [10]. Thus, selecting informative features is crucial in high-dimensional datasets such as gene expression data for improving classification models’ reliability and accuracy. A hybrid feature extraction framework proposes PCA and mutual information to overcome dimensionality challenges in lung cancer gene expression data. PCA is a well-established dimensionality reduction technique that converts high-dimensional data into a lower-dimensional space by identifying key features, orthogonal components that capture maximum variance. PCA reduces noise and overall data complexity by retaining only a subset of these principal components. However, it ignores the connection between the features and the target labels; therefore, the features are only sometimes ordered by relevance to the task of classification [11]. We further improve the relevance features through the mutual information method – the technique of mutual dependence on the rank target class. The idea was that mutual information would rank their nonlinear dependency on a target class, selecting features that reveal more perfectly, or best, what distinguishes types of lung cancer. So, PCA and MI use each technique’s best features. PCA should reduce dimensionality as much as possible since it captures only the main variance. In contrast, MI selects the feature with the highest relevance, ensuring an informative feature set for classification. The underlying motivation of this hybrid approach will be to optimise lung cancer classification accuracy through the correct balance of variance from PCA and feature relevance from MI [12,13]. This framework is supposed to improve classification accuracy, reduce noise, and enhance the interpretability of gene expression data. We test the proposed hybrid method with a CNN classifier and compare its performance against models based on individual feature extraction techniques. Further hub genes were identified along with the protein-protein interaction to know the biological relevance. Our results show that combining PCA and MI results in a more robust feature set for classification tasks in lung cancer, which makes hybrid feature extraction methods potentially useful for gene expression analysis in complex diseases like lung cancer.

2. Related work

PCA is one of the most essential tools for dimensionality reduction in gene expression studies. It has thousands of gene features but still uses relatively small sample sizes. PCA captures essential variance patterns effectively to reduce complexity. PCA has multiple applications in bioinformatics, mainly for noise removal and visualisation purposes. However, while PCA focuses its attention on variance, it is inherently not concerned with the relevance of features towards the target classification, which is critical in cancer studies. Recent studies further project the relevance of PCA through relevance-based techniques, such as mutual information. It is a technique that chooses the features most informative about the class label, as it balances the variance representation with information relevant to classification [14,15]. The limitation of the single-method-based feature selection has motivated the researchers to explore hybrid approaches. Hybrid approaches combine methods from both filter-based and wrapper-based approaches to address redundancy and noise problems common in gene expression data. Example: Hybrid methods which combined MI with the genetic algorithm were applied in gene selection and optimised to classify cancers whereby the classifiers found to have both precision and high computational efficiency of Hybrid methodology usually outperform any single-method techniques offering concise yet highly informative feature sets [16,17]. It thus enhances classification accuracy and controls overfitting risk. With the advent of deep learning, the hybrid feature selection approach is now making sense in preparing datasets with models like CNNs, which successfully find patterns in a complex dataset. In cancer classification tasks, the performance of these CNNs is high when equipped with a sound feature selection process. Recently, for example, it has been shown that by integrating MI-based feature selection with machine learning models, significant accuracy gains were achieved on gene expression data, mainly because hybrid approaches guarantee the selection of features that do not only reduce their dimensionality but also very discriminative in classification task [15,17]. Comparison studies have shown that hybrid approaches for feature selection, such as PCA-MI, outperformed single-method approaches like Lasso and ElasticNet. Such studies validate the improvement of classification accuracy and stabilisation of the learning behaviour of the model due to PCA-MI hybrids. A comparison of experiments between PCA-MI hybrids and similar techniques, such as Random Forest and ElasticNet, led to significant improvements in precision and accuracy, mainly when implemented along with deep models, namely CNNs [18].

3. Materials and methods

The study proposes a hybrid PCA-MI framework for the extraction of features and classification of lung cancer gene expression data. It includes preprocessing of data, dimension reduction using PCA, and feature selection based on mutual information. The reduced feature space is fed to the CNN model and further optimised for the classification. The performance across various feature extraction techniques is compared. A comprehensive PPI network analysis was conducted to identify hub genes. The analysis focuses on the biological significance of interpreting identified genes’ functions. The methodology is detailed as shown in Fig 2. which provides an overview of the workflow.

Download:

Fig 2. A Seven-Stage Framework for Gene Expression-based Lung Cancer Classification: From Data Collection to Model Deployment with Integrated Feature Engineering and PCA-MI Feature Selection.

https://doi.org/10.1371/journal.pone.0342160.g002

3.1. Dataset collection and preparation

The paper utilises gene expression data derived from two widely acknowledged public repositories: TCGA and the ICGC. The TCGA dataset comprises 1,153 samples, of which 541 are adenocarcinoma samples, 502 are squamous cell carcinoma samples, and 110 are normal samples. A Python script has been applied to ensure the data can be downloaded without errors. Similarly, we obtained the ICGC dataset of 543 samples, of which 488 belonged to adenocarcinoma samples and 55 were normal. These two datasets collected independently provide complementary information required to develop a robust and comprehensive benchmark dataset. [19].

3.1.1. Data preprocessing.

To prepare the data for analysis, we applied Z-score normalisation to standardise gene expression values. This normalisation ensured consistency across samples from different repositories by transforming each gene expression value into a mean of zero and having a standard deviation of one. The formula for Z-score normalisation [20],

(1)

Where X is the value being normalised, μ is the mean of the dataset, and σ is the standard deviation of the dataset. This step minimised any biases or variations stemming from differences in sequencing platforms or data processing protocols, making the datasets more compatible.

3.1.2. Merging.

We have combined TCGA and ICGC with their common genes into a single benchmark set after standardisation. This means that only the genes whose expression is reflected in the datasets allow a comparison for both sources. Mean has been used to impute the missing values after merging. For any gene, missing values were replaced by the average expression level for that particular gene across the rest of the samples because it lacked expression values for any given samples. It ensured no gaps within the dataset, thus permitting uninterrupted analysis. [21].

3.1.3. Class balancing.

The dataset showed class imbalance, specifically with fewer normal samples. Thus, the Synthetic Minority Over-sampling Technique was applied to rectify this problem. SMOTE is a popular technique for synthetic sample generation to balance class distribution. In this study, the SMOTE algorithm generates synthetic samples for the minority class, which are the normal samples. It does so by randomly selecting a minority class sample with one of the nearest neighbouring samples; the algorithm then interpolates these two samples to create the new synthetic sample. [22].

(2)

δ is some random number between 0 and 1; therefore, the generated samples would be different. and are the feature vectors of the minority samples. This created artificial samples that enriched the minority class, thereby balancing the data and improving the classification model’s performance on minority classes [23].

3.1.4. Labelling.

The samples were labelled for classification into their categories: adenocarcinoma, squamous cell carcinoma, and routine. This labelling provided an organised foundation for analysis, both in the tumour vs standard classification and the finer subtype classification within lung cancer. Therefore, these labelled and pre-processed data furnished a solid foundation for downstream machine-learning tasks and model development. [24].

3.2. Problem of high dimensionality and the need for feature extraction

The gene-expression profiles have high dimensionality, with thousands of features (genes) and very few samples for analysis. This causes an inequality problem, which has several issues. The risk of overfitting, where the model learns noise rather than the underlying patterns to poor generalisation in case new data sets are presented. Moreover, high dimensionality makes computations inefficient and makes it difficult to train some models properly. [25]. To overcome these problems, our study has used a hybrid feature extraction technique using PCA and mutual information with only helpful features retained with noise reduction and prevention of computational load in the process.

3.3. Proposed hybrid feature extraction method

3.3.1. Principal component analysis.

PCA is a high-dimensional data dimensionality reduction technique that brings any dataset originating in a high-dimensional space into a lower-dimensional space and allows one to identify principal components, which are the directions holding the maximum variance within the data. Working on these principal components, PCA effectively decreases the dimensionality of the dataset. At the same time, noise becomes minimised, and it gets to be computationally more efficient without somehow sacrificing the structure of the original data. In this context, PCA is used on the gene expression profiles to keep only those components retaining over 95% of the total variance from the dataset [26]. These principal components are essentially uncorrelated linear combinations of the original features that help reduce redundancy and improve the dataset’s general quality by filtering out less informative variables. Following PCA execution, the components were ranked systematically according to their respective contribution to the overall variance, allowing us to choose only the top-ranked components for further analysis. Thus, carefulness in the process ensured that the most informative aspects of the gene expression data were preserved, which provides a robust foundation for subsequent classification tasks [27].

3.3.2. Mutual information.

MI is one of the key metrics in the analysis of gene expression data that quantify the dependency between any individual features and the target variable, which, in the context of this study, are the class labels assigned to different types of lung cancer. MI allows for a process of relevance-based selection that would make the features most likely to describe the gene expression profile shared with the class label selected for classification. [28,29]. This is very relevant in classifying lung cancers because it makes a big difference in identifying the underlying types involved in the diagnosis and strategies in treatment that should be followed. In this work, the MI was computed for each transformed component using PCA and corresponding class labels of lung cancer. The analysis focused on finding components that led to the highest MI scores, as these components were considered to have the highest predictive capability towards classifying the various types of lung cancer. This ensures that the feature set resulting from such a procedure is not only representative of an underlying data structure but highly relevant for accurately predicting classification in lung cancer cases and, thus, enhances the overall effectiveness of a classification model [30].

3.3.3. Integration of PCA and MI.

The components preserved from PCA were ranked further by the MI scores that they achieved, which is an essential step in moving toward further tuning the features toward classification. Thus, what was demonstrated here is the merit of bringing together dimensionality reduction through PCA and relevance-based feature selection through MI. Firstly, it reduces the dimensionality of the high-dimensional gene expression dataset and converts the original features into a reduced set of uncorrelated principal components. The reduction not only aided in minimising the complexities attached to the dataset but also helped filter out noise and redundancy, thus conserving the critical structure of data and capturing the significant variance. Next, the retained components’ MI scores, a measure of component dependency on the class labels for lung cancer, were considered [31]. Then, an analysis of the ranking of the components by their MI scores was carried out to highlight the principal components that better informed the classification problem at hand. Only the most highly ranked component, the features whose MI scores are highest, have been selected for the final feature set. This results in a compact feature set after dimensionality reduction in PCA and is highly relevant to classifying lung cancer subtypes. Essentially, the integration of PCA and MI incorporates a more robust feature set, which would better optimise the ability of the model to classify the various classes of lung cancer while retaining the most informative characteristics of the underlying data. [32].

3.4. CNN classifier for lung cancer classification

CNN has become a vital tool for examining gene expression data with complex assignments such as classifying lung cancer. The gene-expression profiles are generally of a vast number of genes, hence high-dimensional: this gives a bad warning as it is known to impose overfitting and lacks clear interpretability. A CNN addresses both issues by automatically learning to extract essential features from convolutional layers while capturing intricate relationships between genes. This method is helpful because it does not demand extensive manual feature engineering; the CNN can instead identify patterns in the data that may indicate disease-specific gene-expression signatures [33,34]. For example, convolutional layers of the CNN recognize expression patterns, where pooling layers downsample data to emphasise critical features and reduce complexity, thus making analysis efficient yet precise. The CNN also uses non-linear activation functions like ReLU, thereby capturing complex and highly non-linear relationships between genes, often the states’ markers. Dropout layers help prevent overfitting by randomly turning off some neurons whilst training, making the model more robust and thus better generalizing on unseen data. Its architecture consists of convolutional, pooling, and dense layers that provide an all-inclusive representation of high-dimensional gene data for multiclass classification in lung cancer, and it can output the probabilities across various types of cancers, distinguishing between the several subtypes of lung cancer based on an understanding of even subtle distinctions in patterns of gene expression. The strength of the CNNs applied to a feature-reduced data set—for instance, hybrid methods such as PCA and mutual information filtering could be used—is to focus only on the features of biologically highest importance, thus enhancing classification accuracy. Lung cancer classification CNNs provide an effective instrument when traditional methods fail, eventually discovering key biomarkers and pathways that would be useful in precision medicine and developing novel therapeutic strategies [35].

3.5. CNN model architecture

The proposed CNN architecture is designed to classify lung cancer subtypes effectively, as shown in Fig 3. It integrates hybrid PCA-MI-based feature extraction with a multi-layered neural network structure to capture intricate patterns in the data. The architecture comprises input, convolutional, pooling, fully connected, and output layers, optimised through extensive hyperparameter tuning and evaluated using robust performance metrics.

Download:

Fig 3. A block diagram of the CNN architecture shows the flow from the input layer through convolutional pooling and fully connected layers to the output layer to classify the samples as either lung cancer or not.

https://doi.org/10.1371/journal.pone.0342160.g003

Input Layer: The input layer of CNN was designed in such a way that it will accept the final feature set produced by hybrid PCA-MI. Every sample was taken as a feature vector consisting of the selected components.

Convolutional Layers: The architecture consists of several convolution layers operating at different filter sizes to capture different patterns from the feature set. A ReLU activation function is applied for every convolutional layer so that non-linearity can be injected into the model. Max-pooling layers follow convolutional layers to downsample feature maps that reduce both the dimensionality and computation required for the forward and backward pass.

Fully Connected Layers: The fully connected layers presented after the convolutional layers have been added to include features extracted from the preceding layers and to make predictions related to the learned patterns. Dropout layers were used to prevent overfitting. The dropout rate is between 0.3 and 0.5 [33,36].

Output Layer: It is the final layer of the network-softmax layer. Units equal to the lung cancer classes like (adenocarcinoma, squamous cell carcinoma, large cell carcinoma, and small cell lung cancer). This layer represented class probabilities for each sample.

3.5.1. Training parameters.

The Adam optimiser is used with an initial learning rate of 0.001. Categorical cross-entropy loss measures the difference between predicted and known class labels. Early stopping based on validation loss was performed to avoid overfitting at train time [37].

3.5.2. Hyperparameter tuning.

Key hyperparameters, including the number of convolutional layers, filter sizes, learning rate, batch size, and dropout rate, were tuned using grid search accompanied by cross-validation. This iterative process helped identify the best model configuration.

3.5.3. Evaluation metrics.

The model’s performance was evaluated with a range of metrics, such as accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC). These metrics comprehensively assessed the classification performance across different lung cancer classifications [38].

3.6. Comparative analysis with individual feature extraction techniques

We independently compared our proposed PCA-MI hybrid feature extraction method against ten established techniques applied to the pre-processed dataset. These techniques provide other means of dimensionality reduction and feature selection, giving insight into how variations in methods impact CNN performance in classifying lung cancer from gene expression data.

3.6.1. AutoEncoder.

An autoencoder is a feedforward deep architecture-type neural network specially designed for unsupervised learning. It aims to discover compressed and meaningful information. This structure makes autoencoders useful for dimensionality reduction and feature extraction since they capture crucial information within the data without irrelevant details and noise [39]. Two primary stages work in the autoencoder: encoding and decoding. The encoding process compresses input data into a lower dimension latent representation called “bottleneck,” which captures the core informative aspects of the input. Mathematically, the encoder transforms input x through the function.

(3)

is the encoder’s weight matrix, is the bias, and is a non-linear activation function commonly chosen as ReLU or sigmoid. In the decoding stage, the network reconstructs the input data from the compressed representation. The decoder function models this process.

(4)

and are the decoder’s weight matrix and bias term, respectively. The output of the decoder, , represents the reconstruction of the original input , to make as close as possible to by minimising the reconstruction error. This task’s most common loss function is Mean Squared Error (MSE), which quantifies the difference between each input feature and its reconstructed value [40,41].

(5)

Where n is the number of features, the autoencoder learns its parameters at training time to minimise this loss. This enables it to learn compact and informative input representations, which is why autoencoders are so good at analysing complicated data like gene expression profiles that contain underlying nonlinear relationships critical to understanding the interaction of genes and significant biological patterns, reducing noise in such data.

3.6.2. PCA.

PCA is a very commonly applied linear dimension reduction method. It transforms the high-dimensional data into a smaller a collection of uncorrelated variables known as principal components. The principal components are ordered combinations of the features in terms of variance that explain the data set and are linear combinations of the original features. The first principal component captures the direction of maximum variance, followed by further components, each of which captures the most significant possible variance in the remaining orthogonal directions [42]. This is why PCA captures the most meaningful variance in data with only a few principal components, thus reducing its dimension and capturing the most significant patterns. Mathematically, PCA is done by calculating the covariance matrix of the data and then finding its eigenvectors (principal components) along with their eigenvalues. Eigenvectors are directions of maximum variance, and eigenvalues show the magnitude of variance in those directions. If X represents the original matrix of n samples and p features, then the covariance matrix C can be calculated as:

(6)

where is the transpose of . The eigenvectors of , denoted are the principal components, and the associated represent the variance each component explains. Sorting the eigenvectors in descending order of their eigenvalues enables us to retain only the components with the highest variance, thereby achieving dimensionality reduction. For gene expression data, PCA is beneficial, as it reduces the large feature space to a manageable size, capturing the main variations in gene expression patterns while filtering out noise and less relevant fluctuations [43,44]. This reduction simplifies data processing and improves the efficacy and performance of downstream machine learning models by focusing on the most informative aspects of the data.

3.6.3. Mutual Information.

MI is a statistical method that quantifies the dependency between two variables, providing insights into how much knowing one variable reduces uncertainty about the other. It is precious in feature selection, as it helps determine the relevance of each feature about a target class by measuring how much information the feature contributes toward predicting the target. In the context of gene expression data, MI is beneficial because certain genes have strong associations with specific cancer types, making it essential to identify those genes that carry the most predictive information [45]. MI between two discrete random variables X (e.g., a gene’s expression level) and Y (e.g., cancer type) is defined as:

(7)

represents the joint probability distribution of X and Y., and p(x) and are the marginal probability distributions of X and Y, respectively. The MI score is higher when X and Y have a strong dependency, meaning that knowing the value of X significantly reduces uncertainty about Y and vice versa. In feature selection for classification, this score can rank features by their relevance to the target class, allowing us to prioritise features (genes) that provide the most discriminative information for classification tasks [46,47]. Focusing on features with high MI scores can improve model performance by selecting a subset of features that maximise information gain, which is particularly advantageous when dealing with high-dimensional gene expression data.

3.6.4. AutoEncoder and Mutual Information.

The hybrid approach is a vital feature selection technique that harnesses AutoEncoders’ strengths with mutual information to create a concise yet highly informative feature set. This approach is beneficial in applications such as gene expression analysis, where the dimensionality is high, and choosing the most informative features. plays a crucial role in effective classification. The autoencoder is an artificial neural network wherein the training for unsupervised learning happens; hence, it compresses the information in the data by learning a compressed representation. It has two pieces: an encoder, which translates this high-dimensional input to a lower-dimension latent space and captures the essence of the structure in the data, and a decoder, which reconstructs the original input from this compressed form. Autoencoder filters out noise in the data, thus preserving the main patterns, and it results in a condensed representation that retains all the essential information provided by the original data set by minimising reconstruction error [48]. Once compressed, mutual information is utilised to decide the crucial features along this compressed representation of the target variable. Mutual information measures the dependency between the values of two variables; that is, how much knowing the value of one reduces the uncertainty in the value of the other. In such a scenario, MI scores help pick the most relevant features in the compressed representation for predicting the target class. The features have higher MI scores, so the low MI features are discarded to yield a compact, predictive representation, ending with the final list of features. AE combined MI by combining unsupervised feature reduction in auto-encoders with supervised feature relevance estimation using mutual information [49]. Then, a feature set that captures the critical data patterns is optimised for the specific classification tasks, improving the efficiency and accuracy of subsequent machine learning models. This hybrid approach is beneficial for challenging datasets, like gene expression data, where discovering a targeted subset of informative genes may lead to better classification and understanding of underlying biological patterns.

3.6.5. Lasso (Least Absolute Shrinkage and Selection Operator).

Lasso is one of the linear regression methods incorporating L1 regularisation that enforces sparsity on the model’s coefficients. Thus, it is essential as an application for feature selection. Unlike ordinary linear regression, which fits the model by minimising the residual sum of squares, Lasso regression incorporates a penalty based on the absolute sum of the coefficients so that some become zero [50]. This is one of Lasso’s properties. Using gene expression data, for instance, many features may be distant from having at least a weak relationship with the target variable, for example, a particular class of disease. Lasso, thereby reducing the dimension of the dataset, shrinks the coefficients of irrelevant or redundant features to zero. It retains those features which have the most potent predictive power.

(8)

the first term is the Mean Squared Error (MSE) between the observed target values and the predicted values , where is the number of samples. The second term represents the L1 regularisation penalty, with α\alphaα being a regularisation parameter that determines the strength of this penalty and denoting the coefficients of the features. When is set to zero, Lasso regression behaves like ordinary linear regression without any penalty [51]. However, as increases the penalty term grows, leading to more coefficients being shrunk toward zero. This process continues until only the most significant features—those with the largest effects on the target variable—retain non-zero coefficients. The choice of is thus crucial: a smaller allows more features to be retained, while a larger α forces more coefficients to zero, leaving only the most relevant features. Cross-validation often determines This tuning parameter to achieve optimal feature selection and predictive accuracy. By reducing the number of active (non-zero) coefficients, Lasso regression simplifies the model and enhances interpretability, as it isolates features that contribute most meaningfully to the model’s predictions [52,53]. This makes Lasso regression particularly advantageous for gene expression data analysis, where thousands of genes may be analysed, yet only a subset might be truly relevant to the classification or prediction task. Through its regularisation mechanism, lasso regression aids in identifying these essential genes, making it a highly effective approach for dimensionality reduction in genomics and other fields with high-dimensional data.

3.6.6. Random forest.

Forest is an ensemble learning technique that uses the outputs of multiple decision trees to produce a more robust and accurate prediction. Compared to a single decision tree, which is bound to overfit and becomes sensitive to data variations, Random Forest creates a “forest” of individual diverse decision trees, each trained on a randomly sampled subset of the original data and a random selection of features. Random Forest reduces variance and provides better generalising ability when it takes the average of the predictions of the individual trees. It is useful in high-dimensional and complex datasets. In a Random Forest, every decision tree is built recursively using feature values to create a random split at each node that will form the branches up to the predictions in the leaf nodes [54]. In this stage, the model picks those splits that lead to minimal impurity or disorder of the target variable. Probably the most popular measures of impurity are Gini impurity and entropy. This model would have assigned higher importance scores to features whose contribution significantly reduces node impurity for all the trees. Then, the importance of each feature is estimated as the average reduction in impurity caused by splits on that feature over all the trees in the forest. The process for computing the feature importance for Random Forests involves aggregating the decrease in impurity from all splits on a feature across all trees. If a feature splits a node t in tree T, the decrease in impurity for that split is defined as:

(9)

is the impurity of the node before the split, and are the impurities of the left and right child nodes after the split, is the total number of samples in node and are the numbers of samples in the left and right child nodes. The total feature importance score for the feature is obtained by summing the impurity decreases over all nodes where was used to split across all the trees in the forest. This ranking of features by importance helps prioritize variables most relevant to the target variable, allowing Random Forest to serve as both a predictive model and a feature selection method. Random Forest is particularly advantageous for complex datasets like gene expression data, where interactions among thousands of genes can influence the outcome [55,56]. By ranking genes based on importance scores, Random Forest helps identify those most associated with the target variable, making it a valuable tool in genomics, where identifying influential genes can aid in understanding biological pathways or diagnosing diseases.

3.6.7. ANOVA (Analysis of Variance).

ANOVA is a statistical technique that tests whether there is a significant difference in the means of a feature across multiple classes by comparing the variation between groups to the variation within groups. In gene expression analysis, ANOVA is particularly useful for identifying genes that show distinct differences in expression between groups, such as cancerous and non-cancerous samples. The test calculates two main types of variances: The between-group variance (which measures the difference between group means and the overall mean) and the within-group variance (which reflects the variation of individual observations within each group from their respective group mean) [57]. The formula gives the ratio of these variances, known as the F-ratio.

(10)

(10.1)

(10.2)

Here, tells the count of groups represents the number of observations in group , is the mean of group , is the overall mean, and is the total number of observations [58]. A large F-ratio indicates that the between-group variance is significantly higher than the within-group variance. This suggests that the feature in question has a distinct distribution across different classes, which is essential for identifying informative genes in complex datasets like gene expression profiles.

3.6.8. KL divergence.

KL Divergence It is also termed Kullback-Leibler Divergence. It measures how one probability distribution deviates from a reference distribution; hence, it is an essential measure for distinguishing features with distributional differences across classes in the context of feature selection. The usefulness of KL divergence in gene expression analysis: The features or genes whose expression distributions are disparate in the groups being examined, such as differences between cancer versus non-cancer samples, may reflect a difference in biological processes or specific disease expression patterns [59]. Mathematically, KL divergence from a distribution P (true distribution) to a distribution Q (reference distribution) is expressed as:

(11)

and represent the probability of the outcome under distributions and , respectively. The sum extends over all possible outcomes, and KL divergence provides a non-symmetric measure of the information lost when is used to approximate [60,61]. For gene expression data, features with high KL divergence values have distributions that diverge significantly between classes, making them informative candidates for classification tasks or biological interpretation.

3.6.9. Variance threshold.

Variance Threshold is a straightforward feature selection technique that removes features with low variance, assuming that low-variance features contribute minimal information for distinguishing between classes. In gene expression data, for example, features (genes) that show slight variation across samples are likely to be uninformative, as they remain constant or nearly constant regardless of sample type [62]. Removing these features simplifies the dataset, reduces noise, and can improve model efficiency without sacrificing classification performance. Mathematically, for a feature with observations the variance is calculated as

(12)

Is the mean of the feature across all samples. If is below a predefined threshold, the feature is considered low variance and is removed from the dataset [63,64]. This technique effectively filters out features that do not contribute meaningful variance, retaining only those that show sufficient variability across samples for downstream analysis.

3.6.10. Select from model.

Select From Model is a feature selection method that leverages feature importance scores from model-based estimators, such as Lasso or Random Forest, to retain only the most predictive features for a given dataset. This approach is highly adaptable, as it allows the selection of various estimators suited to different data types and classification tasks. For instance, Lasso regression uses L1 regularization, which assigns some coefficients a zero value, effectively removing non-informative features. In contrast, ensemble methods like Random Forest evaluate importance of features by calculating the decrease in node impurity, enabling the identification of features that are strongly correlated with the target variable [65]. SelectFromModel reduces dimensionality, simplifies data representation, and enhances interpretability without sacrificing model accuracy by focusing on the most predictive features. Mathematically, SelectFromModel relies on feature importance scores. from a chosen estimator. For example, in Random Forest, the feature importance of a feature is often calculated based on the mean reduction in Gini impurity or entropy across all trees:

(13)

is the total number of trees and represents the reduction in impurity when is used for splitting in tree . For feature selection, SelectFromModel ranks features by and discards those with importance scores below a specified threshold [66]. In applications like gene expression analysis, where many features may be irrelevant, SelectFromModel identifies a compact, informative subset of features, reducing computational burden and improving model performance.

3.6.11. PCA and MI.

The PCA-MI hybrid approach is a powerful feature selection technique that combines PCA with MI to produce a feature set optimized for machine learning models, such as CNNs. This approach leverages PCA’s ability to capture the main patterns of variance in the data and MI’s focus on selecting features that are most relevant to the target. variable [42]. This results in a compact and informative feature set that retains essential global and class-specific information, especially for high-dimensional data like gene expression profiles. In the first step, PCA is applied to the dataset to reduce its dimensionality while retaining the directions of the largest variance. If X represents the original dataset with n samples and p features, PCA works by computing the covariance matrix C of X:

(14)

is the transpose of . By calculating the eigenvalues and eigenvectors of , PCA identifies the principal components, the eigenvectors associated with the largest eigenvalues. These principal components are ordered by the amount of variance they explain, with the first principal component capturing the most variance, followed by the second, and so on. If we denote the eigenvectors by their corresponding eigenvalues by then, the principal components are chosen based on those eigenvectors with the largest eigenvalues, representing directions with maximum variance. For dimensionality reduction, we retain only the top principal components, reducing the dataset to a new matrix. of reduced dimensions:

(15)

is selected to capture a high percentage (e.g., 95%) of the total variance, ensuring that the most significant patterns are retained. After PCA reduces the feature space, MI is applied to refine the feature set based on its relevance to the target variable. Mutual Information measures the dependency between each feature and the target, quantifying how much information about the target variable Y is gained by knowing the feature for each feature in and target , MI is calculated as

(16)

Features in with high MI scores are retained as they provide the most helpful information for predicting the target class, while those with low scores are discarded. This selection process results in a refined feature set PCA-MI that combines PCA’s ability to capture overall data structure with MI’s emphasis on class-specific relevance. This PCA-MI hybrid approach optimizes the feature space for the CNN model by retaining both global patterns and class-specific information, which can significantly improve classification performance. Reducing dimensionality with PCA and refining relevant features with MI enables the CNN to focus on informative features, resulting in more efficient training and improved predictive accuracy, particularly in complex, high-dimensional datasets like gene expression profiles for cancer classification. Each technique was applied independently to the dataset, and features extracted through each method were used to train a CNN classifier. Performance was measured using accuracy, precision, recall, and AUC scores, providing a comprehensive comparison [67]. The results demonstrated the PCA-MI hybrid method’s superior performance, showcasing its ability to balance dimensionality reduction and feature relevance for effective lung cancer classification.

3.7. Protein-protein interaction analysis and Hub gene identification

To explore the biological significance of the key genes identified through our hybrid PCA-MI feature extraction framework, we conducted a detailed PPI analysis along with hub gene identification. These essential genes, derived from the hybrid framework’s dimensionality reduction and relevance selection, were subjected to further analysis using the STRING database (version 12.0, available at https://string-db.org/). This database provides a platform for constructing high-confidence interaction networks by mapping genes to known and predicted protein-protein interactions. We focused on the human-specific protein interaction network, applying a stringent confidence score cut-off of ≥0.7. This threshold ensured the inclusion of reliable and biologically meaningful interactions, minimizing noise [68]. The resulting network comprised nodes (representing genes or proteins) and edges (denoting their interactions), effectively capturing the complex relationships among the significant genes. Once the network was constructed, it was exported from STRING and further analysed using Cytoscape (version 3.10.3), a widely used tool for visualising and analysing molecular interaction networks. Within Cytoscape, we employed the CytoHubba plugin to identify the central genes, or “hubs,” in the PPI network. CytoHubba applies various centrality measures to rank nodes based on their importance or influence in the network. For our analysis, we used degree, betweenness, and closeness centrality measures to evaluate each gene’s connectivity and regulatory potential [69]. The top 20 genes with the highest degree of connectivity were identified as hub genes, representing pivotal regulators within the network. These hub genes are propounded to play critical roles in lung cancer progression due to their extensive interactions and likely influence on key biological processes. Their identification provides valuable insights into the molecular mechanisms underlying lung cancer and highlights potential targets for therapeutic intervention or biomarker development. This integrative approach strengthens the biological relevance of the hybrid PCA-MI framework and underscores its utility in cancer research [70,71].

4. Results and discussion

4.1. Dataset

The dataset used for this experiment is curated by merging the gene expression data of the TCGA and ICGC repositories. The final benchmark dataset comprises samples from both sources, consisting of lung cancer subtypes and normal samples, which are diverse. TCGA contributed 1,153 samples comprising 541 adenocarcinoma samples, 502 squamous cell carcinoma samples and 110 normal samples, while ICGC contributed 543 samples comprising 488 adenocarcinoma samples and 55 normal samples. We used data from both prominent sources to achieve a broad dataset of high biological variability, which would be the basis of robust classification models [72].

4.1.1. Data summary and statistics.

We have applied Z-score normalization on two datasets to ensure the compatibility of TCGA and ICGC samples. After normalization, we merged the two datasets based on common genes. We kept just those found in both datasets to form a coherent dataset that could be analysed reliably to pattern gene expression across the different types of cancer [73]. We, therefore, used mean imputation to smoothen out any missing values so that our dataset would have no missing or gap spots. Once pre-processing was done, SMOTE came in handy in class imbalance. This includes using artificially generated synthetic samples from the minority class data. For this, normal samples were developed. This improves the classification model’s credibility, especially for poorly represented classes [74].

Download:

https://doi.org/10.1371/journal.pone.0342160.t002

During the data preparation step, preprocessing and class balancing steps reduced potential biases and platform-dependent variations. Z-score normalization ensured a harmonization of data distributions for TCGA and ICGC samples, reducing differences from different sequencing protocols. Mean imputation further removed missing values without causing any form of interruption during analysis. Class balancing produced by SMOTE also eliminated an initially imbalanced nature of the dataset, offering a balance between normal and cancerous samples. The methodology proposed by increasing the number of underrepresented samples has prevented bias towards majority classes and supported the robustness of subsequent classification analysis [74].

4.2 Proposed hybrid model

The model proposes a novel hybrid approach for feature selection, combining PCA and Mutual Information to optimize the process, using a CNN model on lung cancer gene expression data for classification purposes. They both take advantage of the complements of PCA and MI: PCA reduces dimensions while MI evaluates feature relevance. Combined with both of these techniques, a lean, information-intensive feature set makes this CNN model better at properly classifying samples in lung cancer cases. The majority of the gene expression data have many high-dimensional features, including many redundant and irrelevant features not obviously relevant to classification. This can potentially degrade the model’s performance with increased computational load and chances of overfitting, where more noise is imprinted on the model rather than meaningful patterns [42,75]. Traditional feature selection methods are effective as they either focus on reducing dimensionality, like PCA, or capturing the relevance of features, like MI, separately. The hybrid approach, combining PCA and MI, gives a comprehensive feature selection process that balances the dimensionality reduction with selecting features most relevant to distinguish cancer and non-cancer cases. The hybrid approach begins by using PCA to capture the principal components, explaining more than 95% of the variance in the data. According to the idea of variances, the feature is significantly reduced, yet the main underlying patterns in the data are kept. The application of MI, after PCA, ranks the features by mutual dependence concerning target labels such that those retained provide meaningful contributions toward the classification task. The two-step approach reduces dimensionality robustly without impairing classification potential. Significantly, classification accuracy and computational efficiency improved using the PCA-MI hybrid model. The CNN model trained on PCA-MI selected features achieved high predictive performance without increased computational cost by reducing feature sets to the most relevant ones. Hybrid outperforms individual feature selection methods: The results indicate that hybrid approaches have higher accuracy and, importantly, more excellent stability when evaluated across validation sets compared with particular approaches. This represents the hybrid model of PCA-MI as an efficient and effective feature selection approach in high-dimensional gene expression data. It involves PCA and MI to develop a feature set that can simultaneously reduce dimensionality and improve classification accuracy by retaining only the most informative features [76,77]. The results of a CNN classifier trained on the optimized feature set provide excellent proof of the utility of this hybrid model in diagnosing lung cancer.

4.3.Algorithm description for hybrid PCA-MI feature selection

Step 1: Data Collection & Merging

Download TCGA and ICGC datasets

Merge datasets based on common genes

Step 2: Preprocessing

Impute missing values in merged dataset

Apply SMOTE to balance class distribution in the target variable

Step 3: Feature Preparation

Separate features (X) and target variable (y)

Encode target variable labels as integers

Convert target labels to one-hot encoding for CNN compatibility

Step 4: Data Splitting & Scaling

Split data into training and testing sets

Standardise features using a scaler

Step 5: Hybrid Feature Extraction

Apply PCA to reduce dimensionality, retaining the most informative components

Use Mutual Information to select top features from the PCA output

Save list of selected features for reference

Step 6: Data Reshaping

Reshape features to format compatible with CNN (samples, features, 1)

Step 7: Model Building

Define CNN architecture with appropriate convolutional, pooling, and dense layers

Compile model for multi-class classification with loss and metrics

Step 8: Training

Train model on training set with early stopping to prevent overfitting

Step 9: Evaluation

Evaluate model performance on test set

Calculate accuracy, precision, recall, and F1 score

Save performance metrics to a file named model_metrics.csv in the output directory.

Step 10: Visualization

Generate and save a confusion matrix to confusion_matrix.png in the output directory

Plot and save training and validation curves for accuracy and loss to accuracy_loss_curves.png

Plot and save the ROC curve to roc_curve.png for classification performance assessment.

4.5 Evaluation metrics

We evaluated the performance of our hybrid PCA-MI model in a set of classification metrics: accuracy, recall, precision, F1-score, and the area under the Receiver Operating Characteristic curve-AUC. These metrics express various aspects of the model’s ability to predict based on the given classification, with the need for both sensitivity and specificity. The metrics selected here are to provide an all-around view of model performance, balancing between the need for general accuracy and measures which account both for false positives and false negatives. Its importance is most dramatically seen in false negatives in cases related to cancer diagnosis, where a cancerous case is misclassified as being non-cancerous and severe consequences may follow [78]. This is why we used metrics that measure the model’s capability to correctly classify positive cancer cases as such, correct the negative ones as non-cancer, and balance precision with recall.

Accuracy: Accuracy evaluates the percentage of correctly classified instances (both cancerous and non-cancerous) out of the total samples.

(17)

In binary classification, accuracy provides a quick overview of the model’s correctness. However, it may not fully capture performance, especially in cases where the dataset classes are imbalanced (e.g., a more significant number of non-cancerous than cancerous samples).

Precision: Precision quantifies the number of true positive classifications among all samples predicted as positive by the model.

(18)

High precision indicates fewer false positives, meaning the model is less likely to classify non-cancerous samples as cancerous incorrectly. This is particularly relevant to avoid overdiagnosis or unnecessary follow-up procedures.

Recall (Sensitivity): Recall (also known as sensitivity) is the proportion of actual positive cases that the model correctly identifies as positive.

(19)

Recall is crucial as it represents the model’s ability to detect cancerous cases. High recall minimizes false negatives, reducing the chance of missing cancer diagnoses, which is vital in clinical applications.

F1-Score: The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both. It’s instrumental when there’s a trade-off between precision and recall.

(20)

The F1-score is beneficial when precision and recall are equally important, emphasising a balance between them. For example, in cases where it’s crucial to detect as many cancer cases as possible without significantly increasing false positives, the F1-score becomes a meaningful metric.

Area Under the ROC Curve (AUC): AUC checks the area under the Receiver Operating Characteristic (ROC) curve, plots the true positive rate against the false positive rate at different threshold levels. The AUC is a reliable metric that assesses the model’s ability to differentiate between classes at various thresholds. An AUC of 1 indicates perfect classification, while an AUC of 0.5 implies the model performs no better than random guessing. [78]. AUC is valuable in providing an aggregate measure of performance across all classification thresholds, making it particularly useful for assessing the reliability of models in medical diagnoses, where optimal decision thresholds are vital.

Download:

https://doi.org/10.1371/journal.pone.0342160.t003

4.5 CNN model performance on PCA-MI selected features

Our hybrid PCA-MI method was applied for feature selection, followed by training and testing the outcome of this feature subset by utilizing a CNN classifier. This section provides a detailed performance analysis of the CNN on classification metrics, thus showing that PCA-MI is a good feature selection technique. Various classification metrics, such as accuracy, precision, F1-score, recall, and area under the ROC curve (AUC), were used to validate the performance of the CNN model. These measures give an overall view of the CNN functionality in distinguishing between carcinogenic and noncarcinogenic samples: The CNN achieved PCA-MI selected features accuracy of 98%, which reflects the robustness. As can be seen, the model can identify many cancerous and non-cancerous samples [79]. Good precision suggests that PCA-MI has successfully retained the amount of information the CNN needs, enabling good generalization over the data [80]. Precision for the CNN model was 98%, which reflects the robustness as can be seen; the model can identify many cancerous and non-cancerous samples. Good precision suggests that PCA-MI has successfully retained the amount of information the CNN needs, enabling good generalization over the data. The precision for the CNN model was 98%. This high precision score means the model predicts cancerous samples correctly, which is important in clinical diagnostics, as false positives must be kept to an absolute minimum to avoid interference. The CNN had an F1-score of 97%, thus operating on a level playing field in calculating true positives without producing false positives and negatives. The result demonstrates the strength of PCA-MI boosting feature relevance and overall model performance. PCA-MI Feature Selection Impact on CNN Performance [81].

4.6 Impact of PCA-MI feature selection on CNN’s performance

The PCA-MI hybrid feature selection technique significantly impacted CNN’s classification performance, particularly its ability to accurately differentiate between cancer classes. It reduced the dimensions while preserving the significant features in prediction. This made CNN focus on quality information that would get class separation right. The high recall and AUC values indicate that the model can effectively distinguish the samples as cancerous, which is crucial for correct diagnosis and elimination of false positives. PCA-MI feature selection provided fewer features, thus reducing the overall complexity of the CNN model [82,83]. Reducing the volume of data made it possible to achieve faster training efficiency, wherein the convergence of the model was achieved with similar performance. The smaller feature set also makes the model architecture less complex, thus less prone to overfitting and improving generalizability. For clinical real-world applications, both precision and efficiency are desired in models. The hybrid approach with PCA and MI optimally balances the requirement of removing irrelevant features to focus on the most informative ones. In that way, the CNN model provides a practical solution by offering improved precision and recall and decreasing the chances of false positives and negatives while maintaining clinically relevant classifications with much reliability [84]. Confusion matrix Fig 4A demonstrates the performance of the CNN model in lung cancer subtype classification. All the classes are classified with high accuracy, demonstrating that the hybrid PCA-MI feature selection technique is robust for such complicated classification. For class Adenocarcinoma (A), the model accurately classified 301 out of 314 samples with an accuracy of approximately 96.9%. There were only minor misclassifications, with 4 samples incorrectly labelled as Normal (N) and 9 as Squamous (S). The Normal (N) class achieved a perfect classification rate, with all 306 samples correctly identified, showcasing the model’s exceptional sensitivity and specificity for non-cancerous cases. For the Squamous (S) class, the model correctly classified 297 out of 307 samples. However, 9 samples were classified under A and 1 under N [85]. These results show the model to be good in distinguishing cancer samples from non-cancerous samples especially between similar categories of cancer Fig 4B is the ROC curve for the ability of the model to classify normal versus lung cancer subtypes. AUCs for all classes are very high at over 0.98 values, hence making High values for these AUCs demonstrate the clear-cut discrimination capability of the model

Download:

Fig 4. Performance Evaluation of CNN Model, A) The confusion matrix highlights high classification accuracy, with minimal misclassifications among the Adenocarcinoma (A), Normal (N), and Squamous (S) classes.

B) The ROC curve shows an overall AUC of 0.99, indicating excellent discriminatory power across all classes. C, D) Training and validation accuracy and loss curves demonstrate consistent learning with minimal overfitting, achieving strong generalization and low error.

https://doi.org/10.1371/journal.pone.0342160.g004

between true and false positives, further consolidating the power of hybrid PCA-MI feature extraction in optimizing the input features to feed into the CNN. Training Accuracy vs. Validation Accuracy, as shown in Fig 4C. Reflects the learning of the CNN model at every training step. During epochs, accuracy for both training and validation datasets was consistently increasing, reflecting a monotonic learning curve [86]. The slight difference between the training and validation accuracies ensures that overfitting has been well controlled. There was stability provided by appropriate dropout rates, such as 0.3 to 0.5, and the hybrid dependency on PCA-MI features that cut down on redundancy in input data, hence allowing better recognition patterns. The training and validation loss curves shown in Fig 4D are smooth and somewhat equitably decreasing for training and validation with successive epochs. The sharp drop in the training loss during the first epochs reflected strong learning of the features, and the validation loss curves track with a similar trend [87]. The balance between the shape of the two loss curves is far from overfitting, yet it still retains robust generalisation. These low final loss values prove the appropriateness of the chosen parameters of the training, especially the Adam optimiser with the ideal learning rate of 0.001, for obtaining optimal performance. The results prove the CNN model to be highly robust and reliable for the classification of lung cancer, along with high accuracy in the confusion matrix along with a high level ROC curve, signify that the model is highly efficient in differentiating between cancerous and non-cancerous cases as well as among different types of cancers. The stability of the training and validation metrics reflect a well-optimized and generalizable training process [88].

4.7 Comparative analysis with individual feature extraction techniques

The proposed PCA-MI hybrid feature selection approach is compared with several individual feature extraction techniques, and in detail, how each technique impacts the performance of the CNN classifier is discussed. This analysis is very important to prove the effectiveness of the PCA-MI hybrid model compared to generally applied methods. To benchmark, we analyse the performance of the CNN model with respect to various known feature extraction techniques such as Autoencoder, PCA, Mutual Information, Autoencoder and Mutual Information, PCA-MI, Lasso, Random Forest, ANOVA, KL Divergence, Variance Threshold, Select from Model [79]. Each technique gives a different perspective on dimensionality reduction and feature selection in a specific aspect of the features’ relevance, redundancy, and predictiveness.

4.7.1 CNN performance metrics using each feature extraction technique.

We trained the CNN model for each extraction technique and reported results using all metrics above, including accuracy, precision, recall, F1-score, the area under the ROC curve, or simply AUC. These metrics provide a holistic view of the model’s classification performance, reflecting its ability to identify cancerous samples and minimize misclassifications accurately. Summary of results Below is a summary of the results, which reveals that the PCA-MI hybrid model consistently outperformed individual feature extraction methods across most evaluation metrics, demonstrating its advantage in effectively balancing dimensionality reduction with feature relevance [79].The Table 1. shows that the hybrid model Combining PCA-MI surpassed individual feature extraction techniques on all the metrics evaluated. The superiority might be because PCA is a feature reduction technique that retains significant components, while MI focuses on retaining the most relevant features for prediction. Thus, PCA-MI reduces the heavy computational load, improving classification accuracy and good reliability. [79].It enhances both the recall and the F1-score, which indicates the detection of precise cancer cases without losing any specificity, which is highly needed in such medical applications. In this case, the individual techniques, effective in various ways or others, either fought with reducing the dimensionality, such as SelectFromModel and Laso failed to capture relevant features due to redundancy, as in Autoencoder and Random Forest.

Download:

Table 1. The Accuracy, Precision, Recall, and F1-Score for various feature extraction methods. The proposed method achieves the highest accuracy across metrics.

https://doi.org/10.1371/journal.pone.0342160.t001

Download:

Fig 5. This chart shows the accuracy of various feature extraction methods.

https://doi.org/10.1371/journal.pone.0342160.g005

Download:

Fig 6. Showcases the significance of the top genes and the importance score identified by the hybrid approach, validating its effectiveness for biomarker discovery.

https://doi.org/10.1371/journal.pone.0342160.g006

The proposed hybrid method achieves the highest accuracy [79]. Fig 5A. Depicts the accuracy obtained by each feature extraction method in the form of a bar chart. The proposed hybrid method’s accuracy is much higher than that of the other techniques: MI (0.93) and Autoencoder (0.92). This shows the necessity of developing a fusion between dimensionality reduction methods and feature selection to improve the classification model. Fig 5B shows the metrics of the proposed model [83]. The hybrid PCA-MI also further facilitated the retrieval of specific significant genes with respect to high feature importance scores. The key genes found include CYP51A1 (score: 0.459), TNMD (score: 0.776), and NFYA (score: 0.306), among others. These genes are critical in the classification process and relate more to the biological aspects of lung cancer diagnosis. The hybrid approach demonstrates a unique advantage in prioritizing genes that contribute most significantly to predictive accuracy. The method minimises redundancy and noise while preserving key biomarkers by leveraging PCA to reduce dimensionality and MI to retain relevant features [89,90]. This is especially vital in gene expression datasets, where understanding the biological relevance of features is critical for translational research. Fig 6 presents gene scores in more detail, showing that PCA-MI could differentiate the most predictive genes from less relevant ones. Such insights can then be helpful in further biological validation and exploration of identified genes as potential biomarkers for lung cancer [91].

Following identifying significant genes with the PCA-MI hybrid feature extraction framework, an in-depth Protein-Protein Interaction (PPI) analysis and hub gene identification were performed to provide biological significance of these key genes associated with lung cancer. PPI network analysis offers insights into complex molecular interactions of proteins encoded by the most significant genes selected with the feature extraction process. These genes were mapped into known and predicted protein-protein interactions within a human-specific network using the STRING database (version 12.0), applying a stringent confidence score cutoff of ≥0.7 [68,92]. This ensured we captured high confidence interactions but allowed sufficient entries for biological meaningfulness. The network was exported into Cytoscape (version 3.10.3) for further visualization and analysis. Fig 7 shows the PPI network with proteins as nodes and edges for interaction. The complex interrelation seemed to indicate the central involvement of specific genes in the pathophysiology of lung cancer. To identify the most central or influential genes, we used the CytoHubba plugin of Cytoscape. Using centrality measures like degree, betweenness, and closeness, we ranked the genes with the highest connectivity and, thus, the most important for regulatory purposes [69].

Download:

Fig 7. The PPI network visualises interactions among significant genes identified through the PCA-MI framework, with nodes representing proteins and edges denoting interactions, constructed using STRING with a confidence score ≥0.7.

https://doi.org/10.1371/journal.pone.0342160.g007

Degree centrality measures how many direct connections a node owns; betweenness centrality identifies nodes that can act as bridges linking clusters, and closeness centrality is an assessment of how near a gene or a node is to all other elements in the network. The top 20 hub genes found with this analysis were considered critical regulators within the network, and hence, they were of potential importance in lung cancer progression. Such genes include the critical regulators BRCA1, CD4, RAD51, and CFTR, significantly affecting cellular signalling, extracellular matrix remodelling, and tumour progression. Fig 8 Results from hub gene analysis [93]. Part A of the Fig shows the top 20 hub genes selected in the CytoHubba plugin, focusing on the fact that these are central nodes in the PPI network. Gene nodes are colour-coded in a gradient from red to yellow; the darker the red, the more significant the connectivity. Part B of the Fig indicates the ranking of the hub genes regarding the degree of centrality: a quantitative way of representing their strength in the network. Having this dual representation of the importance of these genes helps us understand their roles in the molecular landscape of lung cancer. The use of PPI network analysis and hub gene identification points to the biological significance of the PCA-MI framework [94].

Download:

Fig 8. Hub Gene Analysis: A) Top 20 hub genes ranked by degree centrality are highlighted, indicating their pivotal roles in the network.

B) chart showing degree centrality scores of the hub genes, reflecting their connectivity and influence.

https://doi.org/10.1371/journal.pone.0342160.g008

The identified hub genes could become novel biomarkers for diagnosing lung cancer or its progression and will be considered critical therapeutic targets. This approach validates the significance of the hybrid feature extraction method and bridges computational and biological insights, allowing a pathway for findings in translation to clinical applications. The whole analysis enhances the entire framework so much that it can be used in other cancers or high-dimensional datasets.

5. Limitation and future work

This hybrid feature extraction strategy shows significant promise, especially in the context of gene expression data, where high dimensionality poses a considerable challenge. However, there are limitations to this work. While the feature extraction techniques we employed enhance accuracy, they also add complexity, which may limit the generalisability of the final model. Additionally, this study specifically focuses on lung cancer classification using gene expression profiles that may vary across populations and environments, potentially impacting the model’s adaptability. We demonstrated the effectiveness of our approach; real-world validation in clinical settings would be necessary to confirm its practical applicability, a step that remains beyond the current scope of this study.

5.1. Future prospects

One potential avenue is integrating multi-omics data such as proteomics, metabolomics, and clinical data alongside gene expression data. This could provide a more holistic view of cancer progression and help identify specific and reliable biomarkers across different data types. Combining various data types may also allow us to capture biological complexity better and lead to more personalized diagnostic and treatment options. Another important future direction is validating this approach with larger, more diverse datasets that include a more comprehensive range of patient demographics and environmental factors. This would help assess the model’s robustness and adaptability to real-world clinical scenarios. The hope is that such models will help support personalized medicine, allowing healthcare providers to tailor treatments and improve patient outcomes based on individual genetic profiles.

6. Conclusion

This study addresses a critical challenge in lung cancer diagnosis with a proposed hybrid feature extraction framework that effectively manages the dimensionality of gene expression data—a significant obstacle to accurate classification. Lung cancer is one of the deadliest cancers globally, and early, accurate detection is crucial for improving survival rates. While gene expression data capture the biological mechanisms underlying lung cancer, the high dimensionality of these datasets often increases computational demands and risks of overfitting. To overcome these challenges, we developed a hybrid approach combining PCA and MI for feature extraction. PCA reduces dimensionality by retaining components that explain over 95% of the variance, focusing on critical patterns while filtering out noise. Concurrently, MI identifies features highly relevant to the target class, ensuring the feature set is concise and biologically informative. Using this approach, we created a benchmark dataset by merging gene expression data from TCGA and ICGC, based on shared genes, to provide a robust basis for our classification model. A CNN trained on the PCA-MI reduced dataset demonstrated high classification performance, achieving 98% accuracy and precision, underscoring its effectiveness in distinguishing lung cancer samples. Comparative analysis with ten other feature extraction methods, including Lasso, Random Forest, and Others, confirmed the superiority of the PCA-MI hybrid approach. Training and validation curves highlighted stable learning behaviour, and confusion matrix analysis validated the model’s predictive accuracy. Additionally, genes ranked by the PCA-MI framework were analysed using PPI networks, identifying 20 hub genes like BRCA1, CD4, RAD51, and CFTR proposed to play pivotal roles in lung cancer biology. These findings reinforce the biological relevance of the selected features, bridging computational analysis with biological insights. This hybrid framework demonstrates the potential to form the foundation for advanced cancer diagnostic tools, particularly in multi-omics data integration, where managing large, complex datasets is critical. Future research could explore its application to other cancer types and the integration of additional data sources, such as proteomics and metabolomics, to further improve diagnostic accuracy and provide deeper biological insights.

References

1. Torre LA, Siegel RL, Jemal A. Lung cancer statistics. 2016.
- View Article
- Google Scholar
2. Liu S-YM, Zheng M-M, Pan Y, Liu S-Y, Li Y, Wu Y-L. Emerging evidence and treatment paradigm of non-small cell lung cancer. J Hematol Oncol. 2023;16(1):40. pmid:37069698
- View Article
- PubMed/NCBI
- Google Scholar
3. Herbst RS, Morgensztern D, Boshoff C. The biology and management of non-small cell lung cancer. Nature. 2018;553(7689):446–54. pmid:29364287
- View Article
- PubMed/NCBI
- Google Scholar
4. Durra H, Flieder DB. Peripheral Squamous Cell Carcinoma of the Lung. Pathology Case Reviews. 2012;17(5):211–6.
- View Article
- Google Scholar
5. Gazdar AF, Bunn PA, Minna JD. Small-cell lung cancer: what we know, what we need to know and the path forward. Nat Rev Cancer. 2017;17(12):725–37. pmid:29077690
- View Article
- PubMed/NCBI
- Google Scholar
6. Travis WD, Brambilla E, Nicholson AG, Yatabe Y, Austin JHM, Beasley MB, et al. The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. J Thorac Oncol. 2015;10(9):1243–60. pmid:26291008
- View Article
- PubMed/NCBI
- Google Scholar
7. Inamura K. Lung Cancer: Understanding Its Molecular Pathology and the 2015 WHO Classification. Front Oncol. 2017;7:193. pmid:28894699
- View Article
- PubMed/NCBI
- Google Scholar
8. Beer DG, Kardia SLR, Huang C-C, Giordano TJ, Levin AM, Misek DE, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002;8(8):816–24. pmid:12118244
- View Article
- PubMed/NCBI
- Google Scholar
9. Dey TK, Mandal S, Mukherjee S. Gene expression data classification using topology and machine learning models. BMC Bioinformatics. 2022;22(Suppl 10):627. pmid:35596135
- View Article
- PubMed/NCBI
- Google Scholar
10. Özcan ŞİmŞek NÖ, ÖzgÜr A, GÜrgen F. A novel gene selection method for gene expression data for the task of cancer type classification. Biol Direct. 2021;16(1):7. pmid:33557857
- View Article
- PubMed/NCBI
- Google Scholar
11. Rouhi A, Nezamabadi-Pour H. Feature selection in high-dimensional data. Optimization, learning, and control for interdependent complex networks. 2020:85–128.
- View Article
- Google Scholar
12. Bharti KK, Singh PK. Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Systems with Applications. 2015;42(6):3105–14.
- View Article
- Google Scholar
13. Lu H, Chen J, Yan K, Jin Q, Xue Y, Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing. 2017;256:56–62.
- View Article
- Google Scholar
14. Almazrua H, Alshamlan H. A Comprehensive Survey of Recent Hybrid Feature Selection Methods in Cancer Microarray Gene Expression Data. IEEE Access. 2022;10:71427–49.
- View Article
- Google Scholar
15. Ali W, Saeed F. Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data. Processes. 2023;11(2):562.
- View Article
- Google Scholar
16. Almugren N, Alshamlan H. A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification. IEEE Access. 2019;7:78533–48.
- View Article
- Google Scholar
17. Nagpal A, Singh V. Feature selection from high dimensional data based on iterative qualitative mutual information. Journal of Intelligent & Fuzzy Systems. 2019;36(6):5845–56.
- View Article
- Google Scholar
18. Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med. 2022;140:105051. pmid:34839186
- View Article
- PubMed/NCBI
- Google Scholar
19. Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics. 2022;23(1):175. pmid:35549644
- View Article
- PubMed/NCBI
- Google Scholar
20. Steinhoff C, Vingron M. Normalization and quantification of differential expression in gene expression microarrays. Brief Bioinform. 2006;7(2):166–77. pmid:16772260
- View Article
- PubMed/NCBI
- Google Scholar
21. Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights. 2020;14. pmid:32076369
- View Article
- PubMed/NCBI
- Google Scholar
22. Fernández A, et al. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research. 2018;61:863–905.
- View Article
- Google Scholar
23. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. Jair. 2002;16:321–57.
- View Article
- Google Scholar
24. Hou J, Aerts J, den Hamer B, van Ijcken W, den Bakker M, Riegman P, et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS One. 2010;5(4):e10312. pmid:20421987
- View Article
- PubMed/NCBI
- Google Scholar
25. Stephen D, et al. Feature selection/dimensionality reduction. Machine learning for healthcare systems. River Publishers. 2023:169–85.
26. Ringnér M. What is principal component analysis?. Nat Biotechnol. 2008;26(3):303–4. pmid:18327243
- View Article
- PubMed/NCBI
- Google Scholar
27. Lever J, Krzywinski M, Altman N. Principal component analysis. Nat Methods. 2017;14(7):641–2.
- View Article
- Google Scholar
28. Tabassum N, Kamal MAS, Akhand MAH, Yamada K. Cancer Classification from Gene Expression Using Ensemble Learning with an Influential Feature Selection Technique. BioMedInformatics. 2024;4(2):1275–88.
- View Article
- Google Scholar
29. Vergara JR, Estévez PA. A review of feature selection methods based on mutual information. Neural Comput & Applic. 2013;24(1):175–86.
- View Article
- Google Scholar
30. Abdelwahab O, Awad N, Elserafy M, Badr E. A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma. PLoS One. 2022;17(9):e0269126. pmid:36067196
- View Article
- PubMed/NCBI
- Google Scholar
31. Yu H, Zhan S, Liu S, Guo L, Huang R. Advancing Precision in Lung Cancer Subtyping: Integration of Machine Learning Feature Selection with MLP. In: 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE), 2024. 514–8.
- View Article
- Google Scholar
32. Patel T, Nayak V. Hybrid Approach for Feature Extraction of Lung Cancer Detection. In: 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), 2018.
- View Article
- Google Scholar
33. Mostavi M, Chiu Y-C, Huang Y, Chen Y. Convolutional neural network models for cancer type prediction based on gene expression. BMC Med Genomics. 2020;13(Suppl 5):44. pmid:32241303
- View Article
- PubMed/NCBI
- Google Scholar
34. Gunavathi C, Sivasubramanian K, Keerthika P, Paramasivam C. A review on convolutional neural network based deep learning methods in gene expression data for disease diagnosis. Materials Today: Proceedings. 2021;45:2282–5.
- View Article
- Google Scholar
35. Mathema VB, Sen P, Lamichhane S, Orešič M, Khoomrung S. Deep learning facilitates multi-data type analysis and predictive biomarker discovery in cancer precision medicine. Comput Struct Biotechnol J. 2023;21:1372–82. pmid:36817954
- View Article
- PubMed/NCBI
- Google Scholar
36. Almarzouki HZ. Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile. J Healthc Eng. 2022;2022:4715998. pmid:35035840
- View Article
- PubMed/NCBI
- Google Scholar
37. Zhang Z. Improved Adam Optimizer for Deep Neural Networks. In: 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), 2018. 1–2.
- View Article
- Google Scholar
38. Sun Y, et al. Completely automated CNN architecture design based on blocks. IEEE Transactions on Neural Networks and Learning Systems. 2019;31(4):1242–54.
- View Article
- Google Scholar
39. Pinaya WHL, et al. Autoencoders. Machine learning. Elsevier. 2020:193–208.
40. Bank D, Koenigstein N, Giryes R. Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, 2023:353–74.
- View Article
- Google Scholar
41. Xie R, Wen J, Quitadamo A, Cheng J, Shi X. A deep auto-encoder model for gene expression prediction. BMC Genomics. 2017;18(Suppl 9):845. pmid:29219072
- View Article
- PubMed/NCBI
- Google Scholar
42. Kurita T. Principal component analysis (PCA). Computer vision: a reference guide. 2019:1–4.
- View Article
- Google Scholar
43. Maćkiewicz A, Ratajczak W. Principal components analysis (PCA). Computers & Geosciences. 1993;19(3):303–42.
- View Article
- Google Scholar
44. Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics. 2001;17(9):763–74. pmid:11590094
- View Article
- PubMed/NCBI
- Google Scholar
45. Liu H, Sun J, Liu L, Zhang H. Feature selection with dynamic mutual information. Pattern Recognition. 2009;42(7):1330–9.
- View Article
- Google Scholar
46. Amiri F, Rezaei Yousefi M, Lucas C, Shakery A, Yazdani N. Mutual information-based feature selection for intrusion detection systems. Journal of Network and Computer Applications. 2011;34(4):1184–99.
- View Article
- Google Scholar
47. Vanitha CDA, Devaraj D, Venkatesulu M. Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection. Procedia Computer Science. 2015;47:13–21.
- View Article
- Google Scholar
48. Zhai J, Zhang S, Chen J, He Q. Autoencoder and Its Various Variants. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2018. 415–9.
- View Article
- Google Scholar
49. Noshad Z, Bouyer A, Noshad M. Mutual information-based recommender system using autoencoder. Applied Soft Computing. 2021;109:107547.
- View Article
- Google Scholar
50. Fonti V, Belitser E. Feature selection using lasso. In: VU Amsterdam research paper in business analytics, 2017.
- View Article
- Google Scholar
51. Muthukrishnan R, Rohini R. LASSO: A feature selection technique in predictive modeling for machine learning. In: 2016 IEEE International Conference on Advances in Computer Applications (ICACA), 2016. 18–20.
- View Article
- Google Scholar
52. Kim Y, Kim J. Gradient LASSO for feature selection. In: Proceedings of the twenty-first international conference on Machine learning, 2004.
- View Article
- Google Scholar
53. Ghosh D, Chinnaiyan AM. Classification and selection of biomarkers in genomic data using LASSO. J Biomed Biotechnol. 2005;2005(2):147–54. pmid:16046820
- View Article
- PubMed/NCBI
- Google Scholar
54. Rigatti SJ. Random Forest. J Insur Med. 2017;47(1):31–9. pmid:28836909
- View Article
- PubMed/NCBI
- Google Scholar
55. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
- View Article
- Google Scholar
56. Ram M, Najafi A, Shakeri MT. Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest. Iran J Pathol. 2017;12(4):339–47. pmid:29563929
- View Article
- PubMed/NCBI
- Google Scholar
57. Nandhini B, Josephine RM. Annova test using SPSS software to find out the morphological leaf traits of five different genera. 2023.
58. Vaidya M, Kulkarni P. A review on gene selection for cancer classification from microarray data.
59. Hershey JR, Olsen PA. Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 2007. IV-317-IV–320.
- View Article
- Google Scholar
60. Bu Y, Zou S, Liang Y, Veeravalli VV. Estimation of KL Divergence: Optimal Minimax Rate. IEEE Trans Inform Theory. 2018;64(4):2648–74.
- View Article
- Google Scholar
61. Noda T, Yano Y, Doki S, Okuma S. Adaptive Emotion Recognition in Speech by Feature Selection Based on KL-divergence. In: 2006 IEEE International Conference on Systems, Man and Cybernetics, 2006. 1921–6.
- View Article
- Google Scholar
62. Clarkson V, Kootsookos PJ, Quinn BG. Analysis of the variance threshold of Kay’s weighted linear predictor frequency estimator. IEEE Trans Signal Process. 1994;42(9):2370–9.
- View Article
- Google Scholar
63. Cuturi M, d’Aspremont A. Mean reversion with a variance threshold. In: 2013.
- View Article
- Google Scholar
64. Al Fatih Abil Fida M, Ahmad T, Ntahobari M. Variance Threshold as Early Screening to Boruta Feature Selection for Intrusion Detection System. In: 2021 13th International Conference on Information & Communication Technology and System (ICTS), 2021. 46–50.
- View Article
- Google Scholar
65. Li J, et al. Feature selection: A data perspective. ACM Comput Surv. 2017;50(6):1–45.
- View Article
- Google Scholar
66. Raafi’udin R, et al. Feature selection model development on near-infrared spectroscopy data. Journal of Spectroscopy. 2023.
- View Article
- Google Scholar
67. Asghari S, Nematzadeh H, Akbari E, Motameni H. Mutual information-based filter hybrid feature selection method for medical datasets using feature clustering. Multimed Tools Appl. 2023;82(27):42617–39.
- View Article
- Google Scholar
68. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47(D1):D607–13. pmid:30476243
- View Article
- PubMed/NCBI
- Google Scholar
69. Chin C-H, Chen S-H, Wu H-H, Ho C-W, Ko M-T, Lin C-Y. cytoHubba: identifying hub objects and sub-networks from complex interactome. BMC Syst Biol. 2014;8 Suppl 4(Suppl 4):S11. pmid:25521941
- View Article
- PubMed/NCBI
- Google Scholar
70. Wang Y, Zhou Z, Chen L, Li Y, Zhou Z, Chu X. Identification of key genes and biological pathways in lung adenocarcinoma via bioinformatics analysis. Mol Cell Biochem. 2021;476(2):931–9. pmid:33130972
- View Article
- PubMed/NCBI
- Google Scholar
71. Shah SNA, Parveen R. Lung cancer biomarker identification from differential expression analysis using RNA-seq data for designing multitargeted drugs. In: Biology and Life Sciences Forum, 2024.
- View Article
- Google Scholar
72. Yasrebi H, Sperisen P, Praz V, Bucher P. Can survival prediction be improved by merging gene expression data sets? PLoS One. 2009;4(10):e7431. pmid:19851466
- View Article
- PubMed/NCBI
- Google Scholar
73. Euachongprasit W, Ratanamahatana CA. Efficient multimedia time series data retrieval under uniform scaling and normalisation. In: Advances in Information Retrieval: 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings, 2008.
- View Article
- Google Scholar
74. Jeatrakul P, Wong KW, Fung CC. Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm. In: Neural Information Processing. Models and Applications: 17th International Conference, ICONIP 2010, Sydney, Australia, November 22-25, 2010, Proceedings, Part II, 2010.
- View Article
- Google Scholar
75. Chudong T, Xuhua S. Mutual information based PCA algorithm with application in process monitoring. CIESC Journal. 2015;66(10):4101.
- View Article
- Google Scholar
76. Kraskov A, Stögbauer H, Grassberger P. Estimating mutual information. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;69(6 Pt 2):066138. pmid:15244698
- View Article
- PubMed/NCBI
- Google Scholar
77. Taguchi Y, Murakami Y. Principal component analysis based feature extraction approach to identify circulating microRNA biomarkers. PLoS One. 2013;8(6):e66714. pmid:23874370
- View Article
- PubMed/NCBI
- Google Scholar
78. Naidu G, Zuva T, Sibanda EM. A review of evaluation metrics in machine learning algorithms. In: Computer Science On-line Conference, 2023.
- View Article
- Google Scholar
79. Guyon I, Eliseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research. 2003;3(Mar):1157–82.
- View Article
- Google Scholar
80. AL-Bermany HM, AL-Rashid SZ. Microarray Gene Expression Data for Detection Alzheimer’s Disease Using k-means and Deep Learning. In: 2021 7th International Engineering Conference “Research & Innovation amid Global Pandemic" (IEC), 2021. 13–9.
- View Article
- Google Scholar
81. Hossin M, Sulaiman MN. A Review on Evaluation Metrics for Data Classification Evaluations. IJDKP. 2015;5(2):01–11.
- View Article
- Google Scholar
82. Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002;46(1–3):389–422.
- View Article
- Google Scholar
83. Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28.
- View Article
- Google Scholar
84. Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In: 2015.
85. Townsend JT. Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics. 1971;9(1):40–50.
- View Article
- Google Scholar
86. Qasem SN, Saeed F. Hybrid Feature Selection and Ensemble Learning Methods for Gene Selection and Cancer Classification. IJACSA. 2021;12(2).
- View Article
- Google Scholar
87. Srivastava N, et al. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. 2014;15(1):1929–58.
- View Article
- Google Scholar
88. Sun R. Optimization for deep learning: theory and algorithms. arXiv preprint. 2019.
- View Article
- Google Scholar
89. Hochreiter S. Long Short-term Memory. MIT Press. 1997.
90. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17. pmid:17720704
- View Article
- PubMed/NCBI
- Google Scholar
91. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202. pmid:26953178
- View Article
- PubMed/NCBI
- Google Scholar
92. Nan KS, Karuppanan K, Kumar S. Identification of common key genes and pathways between Covid-19 and lung cancer by using protein-protein interaction network analysis. bioRxiv. 2021.
- View Article
- Google Scholar
93. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504. pmid:14597658
- View Article
- PubMed/NCBI
- Google Scholar
94. Shi Y, Li Y, Yan C, Su H, Ying K. Identification of key genes and evaluation of clinical outcomes in lung squamous cell carcinoma using integrated bioinformatics analysis. Oncol Lett. 2019;18(6):5859–70. pmid:31788059
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Torre LA, Siegel RL, Jemal A. Lung cancer statistics. 2016.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Liu S-YM, Zheng M-M, Pan Y, Liu S-Y, Li Y, Wu Y-L. Emerging evidence and treatment paradigm of non-small cell lung cancer. J Hematol Oncol. 2023;16(1):40. pmid:37069698
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Herbst RS, Morgensztern D, Boshoff C. The biology and management of non-small cell lung cancer. Nature. 2018;553(7689):446–54. pmid:29364287
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Durra H, Flieder DB. Peripheral Squamous Cell Carcinoma of the Lung. Pathology Case Reviews. 2012;17(5):211–6.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Gazdar AF, Bunn PA, Minna JD. Small-cell lung cancer: what we know, what we need to know and the path forward. Nat Rev Cancer. 2017;17(12):725–37. pmid:29077690
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Travis WD, Brambilla E, Nicholson AG, Yatabe Y, Austin JHM, Beasley MB, et al. The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. J Thorac Oncol. 2015;10(9):1243–60. pmid:26291008
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Inamura K. Lung Cancer: Understanding Its Molecular Pathology and the 2015 WHO Classification. Front Oncol. 2017;7:193. pmid:28894699
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Beer DG, Kardia SLR, Huang C-C, Giordano TJ, Levin AM, Misek DE, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002;8(8):816–24. pmid:12118244
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Dey TK, Mandal S, Mukherjee S. Gene expression data classification using topology and machine learning models. BMC Bioinformatics. 2022;22(Suppl 10):627. pmid:35596135
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref10] 10. Özcan ŞİmŞek NÖ, ÖzgÜr A, GÜrgen F. A novel gene selection method for gene expression data for the task of cancer type classification. Biol Direct. 2021;16(1):7. pmid:33557857
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref11] 11. Rouhi A, Nezamabadi-Pour H. Feature selection in high-dimensional data. Optimization, learning, and control for interdependent complex networks. 2020:85–128.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref12] 12. Bharti KK, Singh PK. Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Systems with Applications. 2015;42(6):3105–14.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref13] 13. Lu H, Chen J, Yan K, Jin Q, Xue Y, Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing. 2017;256:56–62.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref14] 14. Almazrua H, Alshamlan H. A Comprehensive Survey of Recent Hybrid Feature Selection Methods in Cancer Microarray Gene Expression Data. IEEE Access. 2022;10:71427–49.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref15] 15. Ali W, Saeed F. Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data. Processes. 2023;11(2):562.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref16] 16. Almugren N, Alshamlan H. A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification. IEEE Access. 2019;7:78533–48.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref17] 17. Nagpal A, Singh V. Feature selection from high dimensional data based on iterative qualitative mutual information. Journal of Intelligent & Fuzzy Systems. 2019;36(6):5845–56.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref18] 18. Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med. 2022;140:105051. pmid:34839186
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref19] 19. Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics. 2022;23(1):175. pmid:35549644
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref20] 20. Steinhoff C, Vingron M. Normalization and quantification of differential expression in gene expression microarrays. Brief Bioinform. 2006;7(2):166–77. pmid:16772260
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref21] 21. Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights. 2020;14. pmid:32076369
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref22] 22. Fernández A, et al. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research. 2018;61:863–905.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref23] 23. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. Jair. 2002;16:321–57.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref24] 24. Hou J, Aerts J, den Hamer B, van Ijcken W, den Bakker M, Riegman P, et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS One. 2010;5(4):e10312. pmid:20421987
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref25] 25. Stephen D, et al. Feature selection/dimensionality reduction. Machine learning for healthcare systems. River Publishers. 2023:169–85.

[ref26] 26. Ringnér M. What is principal component analysis?. Nat Biotechnol. 2008;26(3):303–4. pmid:18327243
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref27] 27. Lever J, Krzywinski M, Altman N. Principal component analysis. Nat Methods. 2017;14(7):641–2.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref28] 28. Tabassum N, Kamal MAS, Akhand MAH, Yamada K. Cancer Classification from Gene Expression Using Ensemble Learning with an Influential Feature Selection Technique. BioMedInformatics. 2024;4(2):1275–88.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref29] 29. Vergara JR, Estévez PA. A review of feature selection methods based on mutual information. Neural Comput & Applic. 2013;24(1):175–86.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref30] 30. Abdelwahab O, Awad N, Elserafy M, Badr E. A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma. PLoS One. 2022;17(9):e0269126. pmid:36067196
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref31] 31. Yu H, Zhan S, Liu S, Guo L, Huang R. Advancing Precision in Lung Cancer Subtyping: Integration of Machine Learning Feature Selection with MLP. In: 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE), 2024. 514–8.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref32] 32. Patel T, Nayak V. Hybrid Approach for Feature Extraction of Lung Cancer Detection. In: 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), 2018.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref33] 33. Mostavi M, Chiu Y-C, Huang Y, Chen Y. Convolutional neural network models for cancer type prediction based on gene expression. BMC Med Genomics. 2020;13(Suppl 5):44. pmid:32241303
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref34] 34. Gunavathi C, Sivasubramanian K, Keerthika P, Paramasivam C. A review on convolutional neural network based deep learning methods in gene expression data for disease diagnosis. Materials Today: Proceedings. 2021;45:2282–5.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref35] 35. Mathema VB, Sen P, Lamichhane S, Orešič M, Khoomrung S. Deep learning facilitates multi-data type analysis and predictive biomarker discovery in cancer precision medicine. Comput Struct Biotechnol J. 2023;21:1372–82. pmid:36817954
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref36] 36. Almarzouki HZ. Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile. J Healthc Eng. 2022;2022:4715998. pmid:35035840
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref37] 37. Zhang Z. Improved Adam Optimizer for Deep Neural Networks. In: 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), 2018. 1–2.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref38] 38. Sun Y, et al. Completely automated CNN architecture design based on blocks. IEEE Transactions on Neural Networks and Learning Systems. 2019;31(4):1242–54.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

[ref39] 39. Pinaya WHL, et al. Autoencoders. Machine learning. Elsevier. 2020:193–208.

[ref40] 40. Bank D, Koenigstein N, Giryes R. Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, 2023:353–74.
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref41] 41. Xie R, Wen J, Quitadamo A, Cheng J, Shi X. A deep auto-encoder model for gene expression prediction. BMC Genomics. 2017;18(Suppl 9):845. pmid:29219072
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref42] 42. Kurita T. Principal component analysis (PCA). Computer vision: a reference guide. 2019:1–4.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref43] 43. Maćkiewicz A, Ratajczak W. Principal components analysis (PCA). Computers & Geosciences. 1993;19(3):303–42.
View Article
Google Scholar

[143] View Article

[144] Google Scholar

[ref44] 44. Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics. 2001;17(9):763–74. pmid:11590094
View Article
PubMed/NCBI
Google Scholar

[146] View Article

[147] PubMed/NCBI

[148] Google Scholar

[ref45] 45. Liu H, Sun J, Liu L, Zhang H. Feature selection with dynamic mutual information. Pattern Recognition. 2009;42(7):1330–9.
View Article
Google Scholar

[150] View Article

[151] Google Scholar

[ref46] 46. Amiri F, Rezaei Yousefi M, Lucas C, Shakery A, Yazdani N. Mutual information-based feature selection for intrusion detection systems. Journal of Network and Computer Applications. 2011;34(4):1184–99.
View Article
Google Scholar

[153] View Article

[154] Google Scholar

[ref47] 47. Vanitha CDA, Devaraj D, Venkatesulu M. Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection. Procedia Computer Science. 2015;47:13–21.
View Article
Google Scholar

[156] View Article

[157] Google Scholar

[ref48] 48. Zhai J, Zhang S, Chen J, He Q. Autoencoder and Its Various Variants. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2018. 415–9.
View Article
Google Scholar

[159] View Article

[160] Google Scholar

[ref49] 49. Noshad Z, Bouyer A, Noshad M. Mutual information-based recommender system using autoencoder. Applied Soft Computing. 2021;109:107547.
View Article
Google Scholar

[162] View Article

[163] Google Scholar

[ref50] 50. Fonti V, Belitser E. Feature selection using lasso. In: VU Amsterdam research paper in business analytics, 2017.
View Article
Google Scholar

[165] View Article

[166] Google Scholar

[ref51] 51. Muthukrishnan R, Rohini R. LASSO: A feature selection technique in predictive modeling for machine learning. In: 2016 IEEE International Conference on Advances in Computer Applications (ICACA), 2016. 18–20.
View Article
Google Scholar

[168] View Article

[169] Google Scholar

[ref52] 52. Kim Y, Kim J. Gradient LASSO for feature selection. In: Proceedings of the twenty-first international conference on Machine learning, 2004.
View Article
Google Scholar

[171] View Article

[172] Google Scholar

[ref53] 53. Ghosh D, Chinnaiyan AM. Classification and selection of biomarkers in genomic data using LASSO. J Biomed Biotechnol. 2005;2005(2):147–54. pmid:16046820
View Article
PubMed/NCBI
Google Scholar

[174] View Article

[175] PubMed/NCBI

[176] Google Scholar

[ref54] 54. Rigatti SJ. Random Forest. J Insur Med. 2017;47(1):31–9. pmid:28836909
View Article
PubMed/NCBI
Google Scholar

[178] View Article

[179] PubMed/NCBI

[180] Google Scholar

[ref55] 55. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
View Article
Google Scholar

[182] View Article

[183] Google Scholar

[ref56] 56. Ram M, Najafi A, Shakeri MT. Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest. Iran J Pathol. 2017;12(4):339–47. pmid:29563929
View Article
PubMed/NCBI
Google Scholar

[185] View Article

[186] PubMed/NCBI

[187] Google Scholar

[ref57] 57. Nandhini B, Josephine RM. Annova test using SPSS software to find out the morphological leaf traits of five different genera. 2023.

[ref58] 58. Vaidya M, Kulkarni P. A review on gene selection for cancer classification from microarray data.

[ref59] 59. Hershey JR, Olsen PA. Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 2007. IV-317-IV–320.
View Article
Google Scholar

[191] View Article

[192] Google Scholar

[ref60] 60. Bu Y, Zou S, Liang Y, Veeravalli VV. Estimation of KL Divergence: Optimal Minimax Rate. IEEE Trans Inform Theory. 2018;64(4):2648–74.
View Article
Google Scholar

[194] View Article

[195] Google Scholar

[ref61] 61. Noda T, Yano Y, Doki S, Okuma S. Adaptive Emotion Recognition in Speech by Feature Selection Based on KL-divergence. In: 2006 IEEE International Conference on Systems, Man and Cybernetics, 2006. 1921–6.
View Article
Google Scholar

[197] View Article

[198] Google Scholar

[ref62] 62. Clarkson V, Kootsookos PJ, Quinn BG. Analysis of the variance threshold of Kay’s weighted linear predictor frequency estimator. IEEE Trans Signal Process. 1994;42(9):2370–9.
View Article
Google Scholar

[200] View Article

[201] Google Scholar

[ref63] 63. Cuturi M, d’Aspremont A. Mean reversion with a variance threshold. In: 2013.
View Article
Google Scholar

[203] View Article

[204] Google Scholar

[ref64] 64. Al Fatih Abil Fida M, Ahmad T, Ntahobari M. Variance Threshold as Early Screening to Boruta Feature Selection for Intrusion Detection System. In: 2021 13th International Conference on Information & Communication Technology and System (ICTS), 2021. 46–50.
View Article
Google Scholar

[206] View Article

[207] Google Scholar

[ref65] 65. Li J, et al. Feature selection: A data perspective. ACM Comput Surv. 2017;50(6):1–45.
View Article
Google Scholar

[209] View Article

[210] Google Scholar

[ref66] 66. Raafi’udin R, et al. Feature selection model development on near-infrared spectroscopy data. Journal of Spectroscopy. 2023.
View Article
Google Scholar

[212] View Article

[213] Google Scholar

[ref67] 67. Asghari S, Nematzadeh H, Akbari E, Motameni H. Mutual information-based filter hybrid feature selection method for medical datasets using feature clustering. Multimed Tools Appl. 2023;82(27):42617–39.
View Article
Google Scholar

[215] View Article

[216] Google Scholar

[ref68] 68. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47(D1):D607–13. pmid:30476243
View Article
PubMed/NCBI
Google Scholar

[218] View Article

[219] PubMed/NCBI

[220] Google Scholar

[ref69] 69. Chin C-H, Chen S-H, Wu H-H, Ho C-W, Ko M-T, Lin C-Y. cytoHubba: identifying hub objects and sub-networks from complex interactome. BMC Syst Biol. 2014;8 Suppl 4(Suppl 4):S11. pmid:25521941
View Article
PubMed/NCBI
Google Scholar

[222] View Article

[223] PubMed/NCBI

[224] Google Scholar

[ref70] 70. Wang Y, Zhou Z, Chen L, Li Y, Zhou Z, Chu X. Identification of key genes and biological pathways in lung adenocarcinoma via bioinformatics analysis. Mol Cell Biochem. 2021;476(2):931–9. pmid:33130972
View Article
PubMed/NCBI
Google Scholar

[226] View Article

[227] PubMed/NCBI

[228] Google Scholar

[ref71] 71. Shah SNA, Parveen R. Lung cancer biomarker identification from differential expression analysis using RNA-seq data for designing multitargeted drugs. In: Biology and Life Sciences Forum, 2024.
View Article
Google Scholar

[230] View Article

[231] Google Scholar

[ref72] 72. Yasrebi H, Sperisen P, Praz V, Bucher P. Can survival prediction be improved by merging gene expression data sets? PLoS One. 2009;4(10):e7431. pmid:19851466
View Article
PubMed/NCBI
Google Scholar

[233] View Article

[234] PubMed/NCBI

[235] Google Scholar

[ref73] 73. Euachongprasit W, Ratanamahatana CA. Efficient multimedia time series data retrieval under uniform scaling and normalisation. In: Advances in Information Retrieval: 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings, 2008.
View Article
Google Scholar

[237] View Article

[238] Google Scholar

[ref74] 74. Jeatrakul P, Wong KW, Fung CC. Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm. In: Neural Information Processing. Models and Applications: 17th International Conference, ICONIP 2010, Sydney, Australia, November 22-25, 2010, Proceedings, Part II, 2010.
View Article
Google Scholar

[240] View Article

[241] Google Scholar

[ref75] 75. Chudong T, Xuhua S. Mutual information based PCA algorithm with application in process monitoring. CIESC Journal. 2015;66(10):4101.
View Article
Google Scholar

[243] View Article

[244] Google Scholar

[ref76] 76. Kraskov A, Stögbauer H, Grassberger P. Estimating mutual information. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;69(6 Pt 2):066138. pmid:15244698
View Article
PubMed/NCBI
Google Scholar

[246] View Article

[247] PubMed/NCBI

[248] Google Scholar

[ref77] 77. Taguchi Y, Murakami Y. Principal component analysis based feature extraction approach to identify circulating microRNA biomarkers. PLoS One. 2013;8(6):e66714. pmid:23874370
View Article
PubMed/NCBI
Google Scholar

[250] View Article

[251] PubMed/NCBI

[252] Google Scholar

[ref78] 78. Naidu G, Zuva T, Sibanda EM. A review of evaluation metrics in machine learning algorithms. In: Computer Science On-line Conference, 2023.
View Article
Google Scholar

[254] View Article

[255] Google Scholar

[ref79] 79. Guyon I, Eliseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research. 2003;3(Mar):1157–82.
View Article
Google Scholar

[257] View Article

[258] Google Scholar

[ref80] 80. AL-Bermany HM, AL-Rashid SZ. Microarray Gene Expression Data for Detection Alzheimer’s Disease Using k-means and Deep Learning. In: 2021 7th International Engineering Conference “Research & Innovation amid Global Pandemic" (IEC), 2021. 13–9.
View Article
Google Scholar

[260] View Article

[261] Google Scholar

[ref81] 81. Hossin M, Sulaiman MN. A Review on Evaluation Metrics for Data Classification Evaluations. IJDKP. 2015;5(2):01–11.
View Article
Google Scholar

[263] View Article

[264] Google Scholar

[ref82] 82. Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002;46(1–3):389–422.
View Article
Google Scholar

[266] View Article

[267] Google Scholar

[ref83] 83. Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28.
View Article
Google Scholar

[269] View Article

[270] Google Scholar

[ref84] 84. Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In: 2015.

[ref85] 85. Townsend JT. Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics. 1971;9(1):40–50.
View Article
Google Scholar

[273] View Article

[274] Google Scholar

[ref86] 86. Qasem SN, Saeed F. Hybrid Feature Selection and Ensemble Learning Methods for Gene Selection and Cancer Classification. IJACSA. 2021;12(2).
View Article
Google Scholar

[276] View Article

[277] Google Scholar

[ref87] 87. Srivastava N, et al. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. 2014;15(1):1929–58.
View Article
Google Scholar

[279] View Article

[280] Google Scholar

[ref88] 88. Sun R. Optimization for deep learning: theory and algorithms. arXiv preprint. 2019.
View Article
Google Scholar

[282] View Article

[283] Google Scholar

[ref89] 89. Hochreiter S. Long Short-term Memory. MIT Press. 1997.

[ref90] 90. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17. pmid:17720704
View Article
PubMed/NCBI
Google Scholar

[286] View Article

[287] PubMed/NCBI

[288] Google Scholar

[ref91] 91. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202. pmid:26953178
View Article
PubMed/NCBI
Google Scholar

[290] View Article

[291] PubMed/NCBI

[292] Google Scholar

[ref92] 92. Nan KS, Karuppanan K, Kumar S. Identification of common key genes and pathways between Covid-19 and lung cancer by using protein-protein interaction network analysis. bioRxiv. 2021.
View Article
Google Scholar

[294] View Article

[295] Google Scholar

[ref93] 93. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504. pmid:14597658
View Article
PubMed/NCBI
Google Scholar

[297] View Article

[298] PubMed/NCBI

[299] Google Scholar

[ref94] 94. Shi Y, Li Y, Yan C, Su H, Ying K. Identification of key genes and evaluation of clinical outcomes in lung squamous cell carcinoma using integrated bioinformatics analysis. Oncol Lett. 2019;18(6):5859–70. pmid:31788059
View Article
PubMed/NCBI
Google Scholar

[301] View Article

[302] PubMed/NCBI

[303] Google Scholar

Figures

Abstract

1. Introduction

2. Related work

3. Materials and methods

3.1. Dataset collection and preparation

3.1.1. Data preprocessing.

3.1.2. Merging.

3.1.3. Class balancing.

3.1.4. Labelling.

3.2. Problem of high dimensionality and the need for feature extraction

3.3. Proposed hybrid feature extraction method

3.3.1. Principal component analysis.

3.3.2. Mutual information.

3.3.3. Integration of PCA and MI.

3.4. CNN classifier for lung cancer classification

3.5. CNN model architecture

3.5.1. Training parameters.

3.5.2. Hyperparameter tuning.

3.5.3. Evaluation metrics.

3.6. Comparative analysis with individual feature extraction techniques

3.6.1. AutoEncoder.

3.6.2. PCA.

3.6.3. Mutual Information.

3.6.4. AutoEncoder and Mutual Information.

3.6.5. Lasso (Least Absolute Shrinkage and Selection Operator).

3.6.6. Random forest.

3.6.7. ANOVA (Analysis of Variance).

3.6.8. KL divergence.

3.6.9. Variance threshold.

3.6.10. Select from model.

3.6.11. PCA and MI.

3.7. Protein-protein interaction analysis and Hub gene identification

4. Results and discussion

4.1. Dataset

4.1.1. Data summary and statistics.

4.2 Proposed hybrid model

4.5 Evaluation metrics

4.5 CNN model performance on PCA-MI selected features

4.6 Impact of PCA-MI feature selection on CNN’s performance

4.7 Comparative analysis with individual feature extraction techniques

4.7.1 CNN performance metrics using each feature extraction technique.

5. Limitation and future work

5.1. Future prospects

6. Conclusion

References