Figures
Abstract
High-dimensional data classification remains challenging for machine learning models due to sparsity and overfitting caused by the ‘curse of dimensionality‘. As the number of features increases, data points become sparse, hindering generalization in classification and leading to higher computational costs and reduced accuracy. To address these issues, we propose an ensemble classifier based on random subspaces implemented in the Spark framework. The proposed framework comprises three key stages. First, the high-dimensional data is normalised through min-max normalisation. Second, the master node partitions the data by using improved deep fuzzy clustering (IDFC). In contrast, the slave node applies support vector machine-modified recursive feature elimination (SVM-MRFE) for efficient feature selection, followed by feature fusion. Finally, we introduced an improved subspace-based ensemble classifier (ISSBEC) that comprises a feature-fusion-based random subspace (FF-RSS), mixed-space enhancement (MSE), and multiple base classifiers. The efficacy of the ISSBEC classifier was evaluated using a set of performance metrics and compared with state-of-the-art methods. Experimental results demonstrate that the proposed approach improves both accuracy and robustness, offering a scalable solution to the limitations of high-dimensional datasets.
Citation: Bhimineni VC, Senapati R (2026) Random subspace-based ensemble classifier for high-dimensional data Using SPARK. PLoS One 21(3): e0342408. https://doi.org/10.1371/journal.pone.0342408
Editor: Razieh Sheikhpour, Ardakan University, IRAN, ISLAMIC REPUBLIC OF
Received: September 1, 2025; Accepted: January 22, 2026; Published: March 11, 2026
Copyright: © 2026 Bhimineni, Senapati. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: https://schlieplab.org/Static/Supplements/CompCancer/datasets.htm.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
High-dimensional data refers to a dataset with a large number of features relative to the number of samples, which can degrade generalization and model performance [1]. High-dimensional data is pertinent to various fields, including biometrics, healthcare, e-commerce, network security, and industrial applications, and requires comprehensive methodologies for managing high-dimensional data. The curse of dimensionality [2], feature identification [3,4], model complexity, and overfitting [5,6] are among the issues in managing high-dimensional data. Several dimensionality reduction techniques and feature selection approaches are proposed, which become the foundation for high-dimensional classification. Feature extraction converts the original features into a lower-dimensional space, which generates new variables, often using methods such as principal component analysis (PCA) [7], LDA [8], t-SNE [9], LASSO [10,11], feature selection (FS) based on mutual information [12,13], recursive feature elimination (RFE) [14,15] among others.
Classification of high-dimensional data is crucial in numerous real-world applications, including text classification, pattern recognition, and image recognition. However, noise and feature redundancy in such data can severely hinder the effectiveness of the classification approaches. Therefore, it is essential to develop robust feature selection and transformation techniques that generate concise, discriminative, and computationally efficient feature representations. Feature selection strategies for high-dimensional data classification are categorised into filter-based and wrapper-based methods [16]. Wrapper-based approaches achieve better classification accuracy than filter-based approaches. However, these methods are algorithm-specific and computationally expensive [17]. Filter-based methods achieve lower classification accuracy but are computationally less expensive [18]. In wrapper-based methods, the feature space is reduced by eliminating the irrelevant features [19,20]. These methods effectively leverage a feature-relevance strategy to avoid irrelevant searches during the FS process.
To overcome the limitations, researchers have proposed various hybrid feature selection techniques to enhance the performance of high-dimensional data. These include a dual-phase hybrid FS approaches for high-dimensional medical datasets, Maximum Pattern Recognition—Multi-objective Discrete Evolution Strategy (MPR-MDES) [21], a hybrid FS employing a multi-objective genetic algorithm (MOGA) [22], optimized genetic algorithm-based FS [23], adaptive pyramid particle swarm optimisation (APPSO) [24], and binary enhanced golden jackal optimisation (BEGJO) [25], which leverages copula-entropy for increased accuracy. By incorporating filter and wrapper methods, these hybrid techniques improve performance. Still, they require adjusting multiple parameters and have a high computational cost, limiting their applicability to distributed or large-scale settings.
The efficacy of FS approaches is further improved by other methods, such as the dynamic crow search algorithm (DCSA) [26], an adaptive mechanism-based Grey Wolf Optimizer for FS [27], feature filter and group evolution hybrid feature selection (FG-HFS) [28], and a decomposition-based multi-population evolutionary algorithm for feature selection (MPEA-FS) [29]. Random subspace and random projection can improve nearest-neighbour classifier performance in the high-dimensional feature space [30]. It has been observed that data-handling methods remain deficient when utilising FS strategies in existing research. Therefore, further research on the FS approaches for high-dimensional classification is necessary. Machine learning (ML) algorithms, especially ensemble approaches [31], have effectively handled high-dimensional data. Methods such as random forests (RF) [32,33] and gradient boosting machines (GBM) aggregate predictions from multiple models to improve classification accuracy and robustness. These approaches effectively address the challenges of high-dimensional spaces.
Existing approaches face challenges, including class imbalance, redundancy, and computational inefficiency. Therefore, further investigation is needed into the development of scalable, distributed, and adaptive feature selection models. To address these limitations, this study introduces an ensemble classifier framework based on Apache Spark. This framework integrates an IDFC, an enhanced deep fuzzy classification using an SVM-MRFE approach, and an ISSBEC classifier to facilitate precise, reliable, and scalable classification of high-dimensional data.
The specific contributions of this study are as follows.
- A novel data partitioning technique, IDFC, is implemented in the master node of the Spark framework for efficient high-dimensional data management, which is designed to capture complex, nonlinear relationships between data points and outperform conventional clustering methods when applied to high-dimensional feature space.
- We propose the SVM-MRFE method at the slave node of the Spark framework for feature selection to minimize dimensionality in high-dimensional data. It removes less significant features while preserving the most important features for classification based on their rank.
- We propose the ISSBEC architecture for classifying high-dimensional data, which is organized into n number of blocks, each containing three essential components: FF-RSS, MSE, and a base classifier. It effectively classifies high-dimensional data and enhances data representation through a multi-block structure. It integrates various base classifiers for efficient processing to improve accuracy and robustness.
- We proposed an improved k-nearest neighbour (IKNN) as one of the base classifiers for enhanced generalization and reduced overfitting, achieving more accurate predictions in high-dimensional data classification.
- We conducted comparison experiments across various high-dimensional binary-class datasets to demonstrate the superiority of our method over other methods.
The remainder of the paper is organized as follows. Literature Review Section provides a summary of the state-of-the-art literature related to conventional high-dimensional data classification methods. Proposed Methodology Section provides a detailed implementation of the proposed ISSBEC architecture for high-dimensional data classification using Apache Spark. Result Section discusses experimental findings and highlights the efficacy of the proposed ISSBEC approach. Finally, Conclusion Section concludes the paper with a summary and review of the key findings.
Literature review
This section reviews existing research on high-dimensional classification and delivers recent enhancements in ML-based approaches. These approaches play a significant role in high-dimensional classification, as they enhance overall data analysis, improving the accuracy, and offer practical data-handling methods. By integrating advanced FS and dimensionality reduction techniques, ML approaches overcome various challenges while performing high-dimensional classification.
Several recent approaches have explored enhancements in high-dimensional data classification. For example, the sparse kernel K-means (SKKM) approach [34], which enhances clustering performance for high-dimensional data by selecting relevant features while penalizing redundant ones. Likewise, a depth-based nearest-neighbour approach [35] for effective high-dimensional classification, overcoming the limitations of the traditional k-nearest-neighbour method, by carefully identifying low-dimensional features in high-dimensional data using information-gain-based subspace clustering (IGSC). Furthermore, the enriched RF (ERF) [36] is used for weighted random sampling to prioritise informative features, thereby improving accuracy and computational efficiency, and leave-one-out cross-validation (LOOCV) is used to reduce complexity while maintaining accuracy. To balance computational speed and accuracy in feature selection, hybrid feature selection frameworks have also been examined. Hybrid Dimensionality Reduction Forest with Pruning Framework(HDRFPF) [37] for high-dimensional data classification, addressing issues such as information loss, noise, and redundant feature vulnerability, by incorporating bagging with tree-based FS for efficient feature splitting and diversity with training subsets. The hybrid feature selection algorithm [38] is based on improved interaction information, and a multitasking-based particle swarm optimisation (PSO) approach [39] was introduced to strengthen the feature relevance for high-dimensional data classification. These hybrid and evolutionary methods demonstrate that mutual information, clustering, and adaptive weighting can improve model discrimination in high-dimensional settings. Metaheuristic algorithms have effectively handled feature selection for high-dimensional data. For instance, the Dynamic Crow Search Algorithm (DCSA) [37] was proposed to improve the categorisation accuracy of high-dimensional biomedical data. The Feature filter and group evolution hybrid feature selection (FG-HFS) [28] uses spectral clustering to place features into groups based on their relationships, and symmetric uncertainty to eliminate features that do not belong to these groups. Similarly, a High-Dimensional Ensemble Learning Classification (HDELC) algorithm [31] produced a feature space reconstruction matrix that optimizes feature selection and reconstruction for high-dimensional data. This optimal feature space improves the representativeness of the ensemble model.
Recent research has also examined ensemble and semi-supervised frameworks for high-dimensional classification. The feature selection-based semi-supervised classifier ensemble (FSCE) [1] and adaptive semi-supervised classifier ensemble (ASCE) frameworks enhance the adaptability of weighted classifier ensembles. The semi-supervised random subspace classifier ensemble (SSRS) and adaptive semi-supervised random subspace classifier ensemble (ASSRS) approaches [40] are used to reduce the data and feature dimensions of complex datasets, identify subspaces, label samples, and assign classifier weights, thereby minimizing sample size for data-driven predictors. Meanwhile, Envelope rotation forest [41] for inadequate separability, limited diversity, and increased sensitivity, adaptive classifier ensemble learning method (AdaSPEL) [42] for local space perceptron and, Classifier Ensemble Method Based on Subspace Enhancement (CESE) [43] features a sophisticated SSE for efficient feature selection and transformation, creating various effective feature subspaces. It also incorporates an MSE to develop diverse feature representations through multiscale rotation and fusion. The random subspace and random projection techniques for ensembling nearest-neighbour classifiers [30] in the high-dimensional feature space were compared with the traditional nearest-neighbour approaches, and the methods were tested on microarray, image, and chemoinformatics data. The random projection method performs significantly better than the random subspaces for most datasets. However, these are limited to dual classification and require improvement for multi-class problems. Recent advancements have been achieved in adaptive subspace clustering and Spark-based ensemble systems. A novel adaptive multi-view subspace clustering method [44] addresses the challenges of using high-dimensional multi-view data, specifically the presence of irrelevant features, and assigns weights to data views based on the compactness of the clusters. Recent studies on multimodal learning [45] focus on integrating heterogeneous data to improve robustness. Artificial intelligence and multimodal learning analytics showed that feature fusion enhances the interpretation of challenging data.
Randomized optimization approaches [46] limit the number of variables to random subspaces by employing various data-adaptation policies. Recently, it has been demonstrated that combining SPARK, ML, and DL [47–49] provides a feasible solution for accurate, scalable classification. Based on deep subspace sequential clustering, a new online anomaly detection model, NADHS [50], was introduced for high-dimensional real-time data. For large-scale data clustering, a distributed subspace ensemble approach called the subspace cluster ensemble (SSCE) [51], which employs random partitioning and ball fusion, was introduced. To mitigate class imbalance, an enhanced broad learning system with adaptive locality preservation (IBLS-ALP) [52], an incremental adaptive subspace ensemble designed to preserve the local characteristics of small-sample datasets. However, ADMTSK [53], a fuzzy system that is an adaptive TSK and uses the Dombi T-norm to reason in high dimensions. A multilayer jointly The evolving and compressing fuzzy neural network (MECFNN) [54] is a multilayer fuzzy ensemble with self-adaptive compression, developed for high-dimensional classification. To learn self-representation matrices for multi-view data end-to-end, the Multi-view Deep Subspace Clustering Network (MvDSCN) [55] uses a dual network structure made up of a diversity network (Dnet) and a universality network (Unet). Deep convolutional autoencoders are used to create a latent space in which Unet finds a common matrix that applies to all views, while Dnet concentrates on view-specific matrices. The model successfully captures nonlinear and high-order relationships between various perspectives by using the Hilbert-Schmidt independence criterion (HSIC) as a diversity regularizer. The authors of [56] created an effective, scalable feature extraction technique based on Apache Spark that quickly and efficiently extracted highly significant features from millions of genome sequences. Five stages of feature extraction were carried out by their method: sequence length, nucleotide base frequency, nucleotide base pattern organisation, nucleotide base distribution, and sequence entropy. This procedure resulted in a 14-dimensional fixed-length numeric vector that allowed for the unique representation of every genome sequence. Medical prognostication is also performed using deep learning ensemble methods. Nested Ensemble Deep Learning for Gynaecological Cancer Prediction (NEDL-GCP) [57] is predicting the risk of cancer. In contrast to their deep framework, our ISSBEC method uses distributed, high-dimensional data, scaled using Spark, for ensemble classification. The comparison of FF-RSS and MSE with the recent approaches is demonstrated in Table 1. All of these findings suggest that current research on subspace and ensemble approaches is encouraging, but has significant limits due to data imbalance, repetition, and inadequate distributed system optimisation. The current study addresses this challenge using Spark, which combines clustering, feature selection, and ensemble learning into a single framework. The proposed Spark-based ISSBEC classifier addresses the above limitations. In the proposed system, Spark employs effective data partitioning using IDFC and FS with SVM-MRFE, and ISSBEC employs FF-RSS for subspace generation, MSE, and various base classifiers for high-dimensional classification. A summary of the notations employed in this study is provided in Table 2.
Proposed methodology
Handling high-dimensional data is significantly challenging due to the “curse of dimensionality,” which exponentially increases the volume of feature space, making it difficult to analyze. These challenges may lead to model overfitting, decreased efficacy, and increased computational cost. To address the challenges of high-dimensional data classification, this study proposed an ISSBEC framework leveraging the distributed computing capabilities of Spark, which enables efficient processing of high-dimensional datasets, which is illustrated in Fig 1. The proposed framework consists of three stages, namely 1) Pre-processing, 2) Spark framework, and 3) Classification. All the stages are structured to facilitate distributed, parallel, and scalable processing of high-dimensional data.
Preprocessing by Min-Max normalization
This phase ensures that raw data is useful for analysis by enhancing data quality and improving model performance, thereby ensuring reliable results. Effective preprocessing is critical for high-dimensional data due to the difficulties of handling large feature sets. Min-max normalization is a key preprocessing technique that addresses issues arising from differing scales and ranges across features in high-dimensional datasets. This method rescales the feature values to a standard range, typically [0, 1], ensuring consistency and comparability across features. The process involves transforming a vector of scores ……
…
, where
represents the index q score and n signifies the overall number of scores, into normalized scores
using Eq 1 [58].
where max(i) and min(i) denote the maximum and minimum feature values of the raw scores in the dataset, respectively. This process preserves the original distribution of the data while transforming all features to a common scale, effectively preparing high-dimensional data for classification.
Spark framework processing
The Spark framework [59] is used for robust distributed computing and efficient in-memory data processing. It contains two main components, namely 1) the master node, which is responsible for data partitioning using the IDFC, and 2) the slave node is responsible for FS by using the SVM-MRFE method, a modified version of the conventional SVM-RFE approach, which systematically removes less important features while retaining the most contributing features for the classification task, then followed by feature fusion, which is illustrated in Fig 1 and the Algorithm 1 depicts the entire Spark framework processs.
Data partitioning in master node.
In the Spark framework, the master node is the central coordinator for executing the Spark distributed process, resource allocation, and ensuring seamless collaboration among various nodes. In this phase, the master node used IDFC to partition the normalized data and it enhances the data partitioning by integrating a hybrid kernel function that blends Gaussian and exponential kernels, allowing it to effectively capture complex nonlinear manifold structures within high-dimensional data. This hybrid formulation enriches the similarity representation between samples, enabling the clustering mechanism to model both local smoothness and global variations more accurately.
The parameters for this process include , which denotes the number of clusters, and
, which denotes the batch size. In this case, the number of clusters was set to three, denoted as
,
and
, and each cluster contained data points. The partitioning of input normalized data (
) and the partitioned batches is specified as
such that each batch is denoted by
, where m = 1,2,3,...,
.
A deep neural network can essentially map the data into any desired distribution. Overfitting can be avoided by using an autoencoder in deep clustering methods [60,61]. The autoencoder recovers the original data by decoding them with another neural network [62].
Algorithm 1. Spark Framework Process
Input: Normalized dataset ()
Output: Fused feature set ()
Master Node:
1: Partition the data using IDFC.
2: The partitioned data is sent to the slave node for further processing.
Slave Node:
3: For each partition :
• Feature selection is performed by using SVM-MRFE.
• Store the selected features ().
4: End For
5: Generate fused features from the selected features (
) from Step 3.
The loss function for the autoencoder is expressed in Eq 2, where indicate the Euclidean norm, Z(W) represents a regularization term, and
specifies the reconstruction term.
Let us assume that the set of hidden representations be X = , where each
represents a hidden feature vector. The fuzzy membership
for each data point to the cluster is computed using Eq 3. In Eq 3, fuz is the fuzzifier parameter, and μ balances the distance within and between clusters.
Pseudo-labels are extracted from fuzzy memberships
, and the target
for the
batch is expressed in Eq 4.
The Kullback-Leibler (KL) divergence loss function is mathematically formulated in Eq 5. We minimize this loss function to enhance the accuracy of the fuzzy membership predictions.
The Eq 6, represents the affinity between data points
and
, with a higher affinity leading to smaller distances in the feature space. The expression for the loss function is formulated in Eq 7.
where the pseudo label is denoted as , the fixed kernel function in deep fuzzy clustering(DFC) method is indicated as ω, the hyperparameters are specified as
and
, the affinity hyperparameter, which controls the scale of the affinity, is indicated as δ, and the label is specified as
.
Traditional DFC [59,62] has a single fixed kernel, rendering it incapable of capturing non-linear features in high-dimensional spaces. The IDFC technique exploits the reproducing kernel Hilbert space (RKHS) property by combining Gaussian and exponential kernels (hybrid kernel), enabling complex, nonlinear similarities in the input space to be represented as linear features in a higher-dimensional feature space.
The formulation of the proposed hybrid kernel function is expressed in Eq 9.
In Eq 9, the Gaussian kernel [63] is indicated as , which is an example of a radial basis, and the exponential kernel is specified as
, which is closely related to the Gaussian kernel.
In Eqs 10 and 11, indicates the Euclidean norm, σ is a parameter that controls the kernel’s width, J and
are vectors in the input space.
The proposed IDFC convergence aligns with the properties of the fuzzy C-means (FCM) objective function, which is bounded by zero and monotonically reduced with each step. Since the hybrid kernel is positive semi-definite, adjusting the fuzzy membership and cluster centres
always results in an objective function
that does not decrease. It can be confirmed that IDFC guarantees convergence to any local optimum point of
while adhering to the constraint
. In high sparsity, the hybrid Gaussian-Exponential kernel is employed to avoid the instability of a regular fuzzy C-means in high-dimensional feature spaces. It avoids violating the Mercer condition and stabilises membership updates by smoothing sparse similarities. This analysis confirms that IDFC shares the convergence properties of fuzzy C-means, and exhibits enhanced optimality with sparse data due to the adaptive weighting factor in the kernel α.
The IDFC-generated partitioned data is denoted as , with each partition being processed by a slave node. Consequently, the hybrid kernel function enhances the effectiveness of the fuzzy clustering process while reducing computational complexity compared with the traditional DFC method. This advancement in the IDFC method makes it more efficient than the conventional DFC method for data partitioning.
Feature processing in slave node
The slave node is responsible for feature selection on the partitioned data and for feature fusion within each partition. The FS is performed using the SVM-MRFE method, which enhances the conventional SVM-RFE. After the FS process, the remaining features are fused within the slave node. This process combines features to generate a new set that enhances the classifier’s ability to differentiate between classes by leveraging multiple feature sets. The feature selection and feature fusion processes are as follows.
Feature selection by modified recursive feature elimination: In the Spark framework, FS plays a vital role in high-dimensional classification. Conventional SVM-RFE [64] has been widely used for FS. However, they have certain limitations, especially when dealing with datasets containing many correlated features. To address this issue, SVM-MRFE approach is proposed in this research. SVM-MRFE uses the discriminative power of an SVM to evaluate the importance of features based on their contribution to the decision boundary. Unlike traditional RFE, which may struggle with noisy or redundant features, SVM-MRFE accurately identify the most informative features and results in the removal of less relevant attributes while retaining those that maximally enhance class separability. The final selected feature set improves classification accuracy, reduces overfitting, and lowers computational cost by focusing the model on the most discriminative dimensions of the data.
The conventional SVM-RFE method follows a straightforward sequence of steps for FS:
- Training the SVM Classifier: The SVM model is trained on the partitioned data (
) obtained from the master node to create a classification model.
- The Ranking Criterion for all Features: The computed significance of each feature was evaluated based on the weights or support vectors generated by the SVM classifier.
- Remove Features with the Smallest Ranking Values: Features that contribute to the model‘s performance are removed iteratively, focusing on those with low ranking scores.
However, SVM-RFE may not be the most effective method for datasets with strong feature correlations and can be computationally expensive when applied to high-dimensional data. To address the above issues, hybrid feature selection approach SVM-MRFE was introduced as an enhanced FS approach, which outperforms conventional SVM-RFE, and relies solely on margin-based ranking and can be influenced by correlated features and local optima. This method integrates the Modified Fisher Score with SVM-RFE to balance feature evaluation by accounting for both their statistical separability and the classifier’s contribution. By using the Eq 12, the hybrid ranking score of each feature was determined.
where controls the contribution balance (set to 0.5 in this study),
is the modified Fisher score, and
is the SVM weight for feature
. Using backward elimination,
ranks and eliminates low-scoring features. This hybridisation combines SVM optimisation margins and Fisher-type separability, ensuring stability, generalisation, and local optimum reduction in high-dimensional data. Algorithm 1 describes the synchronization process between the master and slave nodes and illustrates the overall data flow of the hybrid feature selection and feature fusion within the Spark system.
The optimised Fisher score used in SVM-MRFE is derived from established principles of discriminant analysis, where the ratio of between-class to within-class variance quantifies the discriminatory efficacy of a given feature. By combining this score with SVM’s margin-based weights, SVM-MRFE creates a criterion that maximises both hyperplane margin and statistical separability. The proposed SVM-MRFE method improves on the conventional SVM-RFE method through the following steps:
- Train the SVM Classifier: Similar to SVM-RFE, the SVM model is initially trained on the partitioned data (
) obtained from the master node.
- The conventional Fisher score [65] and modified Fisher score expressions are represented in Eq 13 and Eq 14.
In the above equations, indicates the local mean of
the feature,
denotes the mean vector,
and
denotes the mean and standard deviation values of the
feature across the entire dataset; SE
specifies the whole data standard error and
indicates the whole data sample mean. The feature of partitioned data
is indicated as f, the total number of classes is denoted as V, and the partitioned data mean is denoted as
.
A modified version of Fisher ranking, defined by a convex, monotonically decreasing curve related to within-class variance, is used in the SVM-MRFE procedure to remove the least important features. Every iteration uses a convex quadratic algorithm to recalculate the SVM margin, ensuring that the optimization is always within a convex feasible region. This implies that it will eventually converge to a subset of features. Compared with traditional RFE, which can oscillate due to substantial differences among highly correlated features, SVM-MRFE will provide better convergence because it employs the modified Fisher score’s separability criterion, which introduces a strong convexity component. To avoid overfitting in high-sparsity, one can take advantage of the Fisher-denominator regularity, and the solution with the final subset is comparable to the one with minimal empirical risk. Therefore, this process attains asymptotic optimality in maximising the margin.
- Evaluation of the ranking criteria for all features: Using the modified Fisher score, each feature is ranked according to its relevance and contribution to the classification task.
- Remove Features with the Smallest Ranking Values: Features with the lowest ranking values are eliminated, as in the SVM-RFE process, but with improved criteria derived from the Modified Fisher score.
This enhanced method highlights the most significant features and simplifies the model, improving its interoperability and efficacy. Consequently, the SVM-MRFE method optimizes FS for HD data, retaining only the most relevant features for the subsequent classification stage, thereby improving both performance and computational efficiency compared to the conventional SVM-RFE method. The selected features (
) from different partitions are fused to form the fused(combined) feature set
. Feature fusion is a method used to combine various feature sets to create a more informative and comprehensive representation of data for ML tasks. Although feature selection reduces dimensionality by retaining only the most informative features, redundancy may still exist among the selected features. Feature fusion combines these selected features into a compact form, reducing redundancy and improving the stability and robustness of the classifier. This step can enhance generalization by capturing the most discriminative information in a smaller, fused feature space, leading to improved classification performance while maintaining computational efficiency. By fusing only the relevant features, the process avoids unnecessary information loss that would occur if fusion were applied to the entire high-dimensional feature set.
High-dimensional data classification by ISSBEC architecture.
The final classification phase is performed using the ISSBEC architecture, which employs a multi-block structure to improve performance via diverse classifiers and feature-space optimization methods. Each block contains FF-RSS, MSE, and a base classifier as illustrated in Fig 1. Algorithm 2 depicts the entire ISSBEC process and the description of each component as follows:
Algorithm 2. ISSBEC Process
Input: Fused feature set()
Output: Output()
1: for i in1, 2,..., n
2: Generate the number of subspaces() from the fused feature set(
).
3: Select the number of features (M) for each rotation subspace.
4: Randomly divide the subspace() features into ’l’ disjoint subspaces
(j = 1,2,—,l);
5: for j in 1,2,....,l
• Construct subspace dataset from
for training.
• Select 75% samples from using bootstrapping to get a new training set
;
• apply PCA on to obtain
coefficients. i.e.,
= PCA.fit(
)
6: end for
7: Design rotation matrix from
8: Concerning the original features rearrange to
9: Retrieve new train set using
10: Apply base classifier on every
11: end for all subspaces
12: Use majority voting for an ensemble of all the base classifier predictions
Feature Fusion based Random Subspace (FF-RSS).
Random Subspace [66] is an ensemble learning technique that creates various classifiers by selecting random subspaces from the feature space. This method enhances the model’s overall performance by increasing the diversity of classifiers constructed exclusively from the features in their respective subspaces. The FF-RSS operates on a fused feature space, which is generated from each Spark partition by integrating distributed feature sets into a single feature space, which contains inter and intra-partition feature information. By using this feature space, FF-RSS generates random subspaces for each learner. For every subspace is calculated by using the Eq 17.
Here, is denoted as a random weight vector,
reprasents randomly choosen sub-feature set, λ represents the control degree of overlapping between different feature subspaces. This design provides diversity over the subspaces, with the help of random subspace selection and uses global information in the dataspace.
Mixed space enhancement.
MSE [43] is an advanced feature enhancement method that combines multi-scale rotation reconstruction with subspace features to produce a diverse and robust set of features. By randomly selecting the number of features in each rotation subspace across different scales for each base classifier, the ensemble system’s diversity increased by MSE. This stage utilises multi-scale rotation reconstruction to produce diverse mixed-enhanced features. Inorder to enhance the diversity, MSE refines the random subspaces by gathering orthogonal transformed versions of its elements. The MSE defines as the enhanced version of random subspace, which is reprasented in Eq 18.
Here, reprasents the random rotation transformation and
is the mixing coefficient. This process balancing the ability to preserve the essential features with improved diversity by using random variations.
The subspace features obtained using FF-RSS offer a more compact and effective representation than the original feature space. This multi-scale feature fusion mechanism enhances feature expression and ensures high scalability, as rotation-based feature fusion can be replaced with other operations if needed. Thus, the mixed enhanced features obtained from the MSE components () in each block are specified as
.
Importance of base classifiers.
The base classifier is an ML algorithm that can produce predictions independently. That is the foundation for complex classification strategies, particularly ensemble learning. In the proposed ISSBEC classifier, the mixed enhanced features () generated from MSE components (
) for blocks
are fed into the base classifiers of respective blocks to produce predictions. We used three base classifiers in the proposed ISSBEC classifier to analyze the model; these are discussed below.
Random Forest (RF): The RF classifier [67,68] is a highly effective supervised learning model used for both regression and classification tasks. For high-dimensional data, overfitting was reduced by combining multiple trees’ results. It uses ensemble learning, combining predictions from multiple decision trees (DTs) to improve overall model performance. Overall, the DT classifier uses a diverse ensemble of DT to balance bias and variance, resulting in an ISSBEC architecture for high-dimensional classification. The predicted outcome is obtained by mixed enhanced features from MSE.
Support Vector Machine (SVM): The SVM [69] for high-dimensional classification is an effective two-class classifier that aims to construct an optimal separating hyperplane maximising the margin between classes. SVMs are effective at handling sparse data and can identify complex, non-linear decision boundaries, making them particularly suitable for high-dimensional datasets. In the wrapper framework, SVM evaluates subsets of features based on their predictive performance, ensuring that the selected features are directly optimized for classification. Although they can encode complex relationships, nonlinear kernels and metaheuristic feature selectors significantly raise processing costs and decrease interpretability. SVMs can handle both linear and non-linear decision boundaries, with the Gaussian rbf kernel used for non-linear cases to map features into a high-dimensional space where linear separation is possible. Leave-one-out cross-validation method was used to assess the performance of the SVM model and estimate the misclassification rate. This approach involves removing a single sample from the training set, training the model on the remaining data, and testing the excluded sample on the resulting hyperplane. By repeating this process for each sample, the total number of misclassifications was used to estimate the model’s risk. In summary, the SVM classifier leverages mixed enhanced features from MSE to build an optimal hyperplane for class separation and uses cross-validation to assess model performance, resulting in the predicted outcome.
Improved K-Nearest Neighbours (IKNN): In the ISSBEC architecture, the IKNN classifier is a critical component that effectively classifies high-dimensional data. It makes predictions based on a combination of enhanced features from MSE. This advanced classifier extends the conventional KNN algorithm [70] by integrating several enhancements that significantly improve its performance and generalization capabilities. The basic steps of the KNN algorithm involve computing the distances between a new instance (n) and all training instances () using a specified distance metric. The Euclidean distance is commonly used for this calculation, mathematically formulated in Eq 19.
However, Euclidean distance has limitations, such as treating all features equally essential and being sensitive to feature scaling. The IKNN classifier uses Weighted Minkowski Distance to address these issues. Unlike Euclidean distance, which does not account for the varying significance of different features, Weighted Minkowski Distance allows assigning different weights to features based on their relevance. This approach enables the IKNN classifier to better generalize from the training data to new, unseen instances, mitigate overfitting by considering feature weights, and achieve more accurate classifications by reflecting the true importance of the features in the distance metric. The mathematical formulation of the Weighted Minkowski Distance is expressed in Eq 20.
where is the weight and is determined using Eq 21.
denotes the value parameter and is determined using Eq 22
Algorithm 3. IKNN algorithm
Input: Features from MSE ()
Output: Class of sample ()
1: Find the value parameter by using Eq 22,
i.e.,
2: Find the weight of each feature, using Eq 21,
i.e.,
3: Evaluate the Weighted Minkowski distance using Eq 20,
i.e.,
4: Find the average distance of each class
5: Select the nearest neighbour based on the average distance
6: Return predicted outcome.
Here, indicates the average accuracy of the conventional KNN (k = 3,5,7), and
denotes the average accuracy of the conventional KNN without the
feature set. After computing these weighted distances, the IKNN classifier sorts the distances in ascending order and selects the k training instances closest to the new instance. The classification decision in the IKNN classifier is based on a majority vote among the k-nearest neighbours. Thus, the IKNN classifier of the ISSBEC architecture enhances the conventional KNN approach by using the Weighted Minkowski distance to improve generalization, reduce overfitting, and achieve more accurate predictions in high-dimensional classification. Enhanced features from the MSE were combined to obtain the predicted outcome. By introducing a weighted Minkowski distance, the IKNN classifier enhanced the generalization ability of the ISSBEC architecture. This improvement ensures that the ISSBEC approach can better handle the complexities of high-dimensional data and achieve more accurate and reliable predictions. Algorithm 3 depicts the entire IKNN process. Finally, the predicted outcomes
from Block-1, Block-2, _ _ _ _, Block-n are ensembled using majority voting and provide the final classification outcome, specified as
. Although traditional ensemble fusion methods such as weighted voting, stacking, and probability-based fusion can improve predictive performance, they introduce notable drawbacks in high-dimensional environments. These methods are computationally expensive and prone to overfitting, and add complexity by increasing communication overhead in Spark-based systems. In contrast, majority voting is simple, robust, and computationally efficient, requiring no extra training. Its low communication cost and stability make it particularly advantageous for scalable ensemble learning in high-dimensional data settings. In this study, majority voting yielded a more predictable, interpretable, and manageable ensemble fusion strategy that conserves model diversity and is accurate for binary data. The framework was also tested on multi-class data.
Results
The proposed high-dimensional data classification using the Spark framework was implemented using Python 3.7 on a 12th-generation Intel Core i5-1135G7 processor clocked at 2.4 GHz with 16GB RAM. To verify the effectiveness of our proposed model, we tested five datasets, i.e., Alizadeh-2000-v1 [71], Armstrong-2002-v1 [72], Chowdary-2006 [73], Golub-1999-v1 [74], and Gordon-2002 [75] respectively. The description of the datasets is provided in Table 3. All experiments were conducted using 80:20 train-test split across the benchmark data sets. Hyperparameters for the models were determined via 5-fold cross-validation on the training set to optimise model parameters. Averaged results from each run are then used as the basis for the metrics reported (accuracy, sensitivity, specificity, precision, f-measure, false negative rate, false positive rate, Matthews correlation coefficient, and negative predictive value), which were all calculated based on performance in the testing set.
A comprehensive analysis was conducted to evaluate the performance of the Spark framework-based ISSBEC approach by comparing it with the traditional methods. This analysis utilized an extensive set of metrics, including “Sensitivity, NPV, Specificity, F-measure, FNR, Precision, FPR, MCC, and accuracy,” to provide a thorough analysis of the method’s effectiveness in the ISSBEC. These evaluation metrics were mathematically presented through Eqs 23 to 31.
Additionally, we included an ablation study, ROC curve analysis, and the Wilcoxon test. We comprehensively evaluated the efficacy of our proposed model using various metrics and conventional methods. The primary objective was to enhance the classification accuracy of the model.
Table 4 represents the comparative analysis of the ISSBEC strategy against several established methods, including EfficientNet, HDELC [31], CESE [43], KNN, LinkNet, SVM, RF and MvDSCN [55]. The analysis focuses on key performance metrics such as accuracy, sensitivity, specificity, precision, F-measure, Matthews correlation coefficient(MCC), NPV, false positive rate(FPR), and false negative rate(FNR) across benchmark datasets. With consistency across all datasets and metrics, the ISSBEC classifier is the most reliable. The ISSBEC provides accuracy values of over 0.95 for all the datasets, proving its capability for high-dimensional classification with more reliability. The sensitivity and specificity outcomes demonstrate the strength of the ISSBEC. The specificity records its score over 0.9, highlighting the ability to decrease false positives, while sensitivity scores beyond 0.9 indicate the minimal rate of false negatives. The significance of the F-measure shows that ISSBEC provides well-adjusted recall grouping and precision suitable for typical classification tasks. By achieving effective scores among all datasets, the MCC confirmed the strength of the ISSBEC approach. The ISSBEC produced significant NPV values, demonstrating its efficacy for identifying true negatives. On behalf of various error metrics, ISSBEC continuously achieves minimal FNR and FPR values. Its FNR values are ineffective with all methods, highlighting its ability to decrease missed positive conditions, while its FPR remains below 0.05, indicating a nominal rate of false positives. Table 4 shows that the ISSBEC demonstrates the enhanced performance compared to other classification methods across all datasets and metrics, establishing it as the most reliable approach. Its slight error rates, balanced specificity and sensitivity, and high accuracy indicate its efficiency over benchmark datasets, and it does not overfit the training data and performs consistently over folds and repetitions. Overfitting is controlled by the ensemble diversitys(Using FF-RSS and MSE) and feature selection(Using SVM-MRFE). This study proposes IDFC for data grouping, SVM-MRFE for feature selection, and ISSBEC for classification. An ablation study evaluated the impact of these approaches on our proposed model’s performance, as shown in Table 5, and found that the proposed approach achieved better results in various metrics for all datasets. The performance of the proposed ISSBEC framework was used to analyse the sensitivity of its two most essential hyperparameters, the subspace dimensionality ratio γ in the FF-RSS and the hybrid kernel weighting coefficient α in the IDFC, in terms of evaluating its robustness and reliability. The number of clusters (C) in the IDFC instance was fixed at C = 3, as this was the point at which all Spark nodes experienced balanced data distribution and stable partitioning. All subsequent studies were conducted with C = 3, as selecting C = 2–4 during the evaluation phase yielded negligible performance differences across the three values. In contrast, higher cluster counts were associated with longer computation times and not much variance in performance accuracy. The sensitivity analysis varied and
, which can be reprasented in Fig 2 and from the figure it is observed that the accuracy remains within ±2% for every change in a parameter. This demonstrates that even when parameters and data sets are changed, the framework’s performance remains steady and consistent. This demonstrates ISSBEC’s scalability and resilience.
The ROC analysis for the ISSBEC model was compared with EfficientNet, HDELC [31], CESE [43], KNN, LinkNet, SVM, and RF for high-dimensional data classification in the Spark framework. This comparison across benchmark datasets is illustrated in Figs 3, 4, 5, 6, and 7, which shows the performance of each model. From Figs 3, 4, 5, 6, and 7, it is evident that the ISSBEC strategy demonstrated superior AUC performance in contrast to traditional approaches. This consistent superiority in AUC across benchmark datasets underscores the effectiveness of the ISSBEC strategy for high-dimensional data classification, demonstrating its ability to achieve better classification performance and greater overall effectiveness compared to conventional methods. This achievement is attributed to the strategic integration of IDFC for effective data partitioning, the application of SVM-MRFE for effective FS, and the utilization of an improved ensemble model combining SVM, RF, and IKNN, which together enhance the model‘s classification capabilities and yield enhanced ROC performance.
The Friedman test is a non-parametric statistical test used to compare three or more related groups, tested on the same datasets. It ranks the performance of each method within each dataset and calculates whether the differences in ranks are statistically significant. A p-value below 0.05 indicates a significant difference between methods. Table 6 presents the results of the Friedman test comparing the performance of ISSBEC with various classifiers across five datasets. The p-values indicate that, in most cases, ISSBEC performs significantly differently from the other models. In comparison with EfficientNet shows p-values ranging from 0.044 to 0.077, HDELC ranges from 0.020 to 0.073. Other models, such as CESE, KNN, LinkNet, SVM, RF, and MvDSCN, also show varying levels of significance across datasets, with several p-values below the 0.05 threshold, suggesting statistically significant differences. Overall, the Friedman test confirms that ISSBEC consistently exhibits distinct and superior performance compared to the majority of existing methods across all datasets.
The Wilcoxon test results in Table 7 show how the performance of ISSBEC compares statistically with other classifiers across five datasets. Using a significance threshold of p-value less than 0.1, several models display statistically significant differences when compared with ISSBEC. Moreover, the ISSBEC shows significant differences against CESE on Alizadeh2000-v1 (p = 0.051) and SVM on Alizadeh2000-v1 (p = 0.054). In Armstrong2002-v1, HDELC (p = 0.070), KNN (p = 0.075), SVM (p = 0.092), RF (p = 0.107, not significant). Across other datasets, multiple models, including LinkNet, and MvDSCN, record p-values close to below the 0.1 threshold, indicating consistent statistical separation between ISSBEC and competing methods. Overall, the Wilcoxon analysis confirms that ISSBEC‘s performance differs significantly from most existing classifiers on various datasets, supporting its robustness and superiority.
All the benchmark datasets have binary class labels, and our model was also tested with two multiclass label datasets with 3 and 10 as class labels, experimented in [43]. These experimental results are shown in Table 8, from which it is observed that our ISSBEC approach also shows improvement on multi-class label datasets. In future we extend our model for handling multi-class label as well as extreme high-dimensional data.
Complexity analysis
This section presents a detailed, explicit discussion of the theoretical analysis of the time complexity of the proposed spark-based and non-spark-based ISSBECs for the high-dimensional data classification problem. The model encompasses data preprocessing, IDFC clustering, SVM-MRFE feature selection, feature fusion, subspace transformation (FF-RSS + MSE), base-learner classification (SVM, RF, IKNN), and ensemble voting.
The non-spark (single machine) version performs all computations sequentially. Essential procedures like IDFC and SVM-MRFE have quadratic time complexity that depends on the size of the data and the feature dimension. This leads to an infeasibly long execution time even for small-sized high-dimensional datasets. On the other hand, the Spark implementation uses P partitions, leading to sublinear-time improvements in the processing modules. Significant performance improvements are achieved without any degradation in accuracy by parallelising feature selection, clustering, and subspace generation. The stage-wise complexity analysis of both phases is mentioned in Table 9. Where, n denotes the total number of data samples, d denotes the number of features, k denotes the number of clusters in IDFC, m denotes the number of features per subspace, and T denotes the number of decision trees in the random forest, l denotes the number of subspace rotations in MSE and C denotes the classifier.
Discussion
The proposed ISSBEC framework integrates IDFC partitioning and SVM-MRFE feature selection to address the challenges of high-dimensional data classification. The ablation analysis clearly shows that substituting either component with conventional methods leads to substantial performance degradation, confirming that both IDFC and SVM-MRFE contribute uniquely to the overall effectiveness of the system. The experimental results across all five benchmark datasets (Alizadeh2000-v1, Armstrong2002-v1, Chowdary2006, Golub1999, and Gordan2002) consistently demonstrate the superior performance of the proposed ISSBEC classifier compared to existing models such as KNN, SVM, RF, EfficientNet, HDELC, and CESE, showing varying degrees of effectiveness but struggling to maintain robustness and stability across high-dimensional datasets. Specifically, the ISSBEC records the highest accuracy, ranging from 0.956 to 0.970, significantly outperforming the compared models, which generally achieve accuracies between 0.74 and 0.89. Moreover, ISSBEC’s precision (0.904–0.994) and F-measure (0.928–0.975) indicate that the proposed framework achieves both high true-positive rates and excellent classification balance. Despite these strengths, its performance on multimodal data remains unexplored. Nevertheless, these findings reinforce the importance of combining advanced clustering with robust feature selection in ensemble-based learning.
The model proposed in this paper conforms its suitability for binary classification. However, we tried to verify this approach with multi class datasets and found a minor difference while comparing some baseline such as CESE in Table 8. Hence, we would like to extend the framework suitably for multi class and multi-modal high-dimensional data classification with improved accuracy and less complexity in future.
Conclusion
A novel high-dimensional classification approach is proposed in this study by combining the Spark framework with the ISSBEC architecture. The approach is divided into three main stages: preprocessing, processing through the Spark framework, and classification. During preprocessing, the high-dimensional data were min-max normalized to standardize them. The Spark framework was used to handle high-dimensional data, with the master node performing IDFC-based data partitioning and the slave node performing FS with SVM-MRFE, followed by feature fusion. The classification stage was employed in the ISSBEC approach, which requires several blocks, each block featuring FF-RSS, MSE, and a base classifier. We used RF in Block-1, SVM in Block-2, and IKNN in Block-3 as the base classifiers. The proposed approach was evaluated using several performance metrics and compared with existing methods, demonstrating significant improvements in the classification performance and efficacy of high-dimensional classification. In this study, we designed the proposed model for small-scale, high-dimensional data with continuous features and binary class labels. Further, the proposed model may be extended suitably to classify multi-class and multi-model high dimensional datasets.
References
- 1. Yu Z, Zhang Y, You J, Chen CLP, Wong H-S, Han G, et al. Adaptive semi-supervised classifier ensemble for high dimensional data classification. IEEE Trans Cybern. 2019;49(2):366–79. pmid:29989979
- 2.
Bellman R. Adaptive processes-A guided tour. Princeton University. 1961.
- 3. Le P, Gong X, Ung L, Yang H, Keenan BP, Zhang L, et al. A robust ensemble feature selection approach to prioritize genes associated with survival outcome in high-dimensional gene expression data. Front Syst Biol. 2024;4:1355595. pmid:39897528
- 4. Wang C, Hu Q, Wang X, Chen D, Qian Y, Dong Z. Feature selection based on neighborhood discrimination index. IEEE Trans Neural Netw Learn Syst. 2018;29(7):2986–99. pmid:28650830
- 5.
Zhao B. Analysis challenges for high dimensional data. London, Ontario: The University of Western Ontario. 2018.
- 6.
Nie J, Qin Z, Liu W. High-dimensional overdispersed generalized factor model with application to single-cell sequencing data analysis. Statistics in Medicine. 2024.
- 7.
Jolliffe IT. Principal component analysis for special types of data. Springer; 2002.
- 8. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936;7(2):179–88.
- 9. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9(11).
- 10. Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996;58(1):267–88.
- 11. Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27(8):1226–38. pmid:16119262
- 12. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002;46(1–3):389–422.
- 13.
Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity. 2015.
- 14. Huang L, Wang Y, Liu Y. An efficient recursive feature elimination algorithm based on random forests for high-dimensional data. Applied Intelligence. 2022;52(7):7624–38.
- 15. Estévez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information feature selection. IEEE Trans Neural Netw. 2009;20(2):189–201. pmid:19150792
- 16. Brankovic A, Falsone A, Prandini M, Piroddi L. A Feature selection and classification algorithm based on randomized extraction of model populations. IEEE Trans Cybern. 2018;48(4):1151–62. pmid:28371789
- 17. Xue B, Zhang M, Browne WN, Yao X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Trans Evol Computat. 2016;20(4):606–26.
- 18. Mafarja M, Aljarah I, Faris H, Hammouri AI, Al-Zoubi AM, Mirjalili S. Binary grasshopper optimisation algorithm approaches for feature selection problems. Expert Systems with Applications. 2019;117:267–86.
- 19. Huang Z, Yang C, Zhou X, Huang T. A hybrid feature selection method based on binary state transition algorithm and relieff. IEEE J Biomed Health Inform. 2019;23(5):1888–98. pmid:30281502
- 20. Hsu H-H, Hsieh C-W, Lu M-D. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications. 2011;38(7):8144–50.
- 21. Nematzadeh H, García-Nieto J, Aldana-Montes JF, Navas-Delgado I. Pattern recognition frequency-based feature selection with multi-objective discrete evolution strategy for high-dimensional medical datasets. Expert Systems with Applications. 2024;249:123521.
- 22. Bohrer J da S, Dorn M. Enhancing classification with hybrid feature selection: a multi-objective genetic algorithm for high-dimensional data. Expert Systems with Applications. 2024;255:124518.
- 23. Feng G. Feature selection algorithm based on optimized genetic algorithm and the application in high-dimensional data processing. PLoS One. 2024;19(5):e0303088. pmid:38723061
- 24. Jin X, Wei B, Deng L, Yang S, Zheng J, Wang F. An adaptive pyramid PSO for high-dimensional feature selection. Expert Systems with Applications. 2024;257:125084.
- 25. Askr H, Abdel-Salam M, Hassanien AE. Copula entropy-based golden jackal optimization algorithm for high-dimensional feature selection problems. Expert Systems with Applications. 2024;238:121582.
- 26. Jiang H, Yang Y, Wan Q, Dong Y. Feature selection based on dynamic crow search algorithm for high-dimensional data classification. Expert Systems with Applications. 2024;250:123871.
- 27. Li G, Cui Y, Su J. Adaptive mechanism-based grey wolf optimizer for feature selection in high-dimensional classification. PLoS One. 2025;20(5):e0318903. pmid:40378158
- 28. Xu Z, Yang F, Tang C, Wang H, Wang S, Sun J, et al. FG-HFS: a feature filter and group evolution hybrid feature selection algorithm for high-dimensional gene expression data. Expert Systems with Applications. 2024;245:123069.
- 29. Li W, Chai Z. MPEA-FS: a decomposition-based multi-population evolutionary algorithm for high-dimensional feature selection. Expert Systems with Applications. 2024;247:123296.
- 30. Deegalla S, Walgama K, Papapetrou P, Boström H. Random subspace and random projection nearest neighbor ensembles for high dimensional data. Expert Systems with Applications. 2022;191:116078.
- 31. Zhao M, Ye N. High-dimensional ensemble learning classification: an ensemble learning classification algorithm based on high-dimensional feature space reconstruction. Applied Sciences. 2024;14(5):1956.
- 32. Pes B. Learning from high-dimensional and class-imbalanced datasets using random forests. Information. 2021;12(8):286.
- 33. Wang Q, Nguyen T-T, Huang JZ, Nguyen TT. An efficient random forests algorithm for high dimensional data classification. Adv Data Anal Classif. 2018;12(4):953–72.
- 34. Guan X, Terada Y. Sparse kernel k-means for high-dimensional data. Pattern Recognition. 2023;144:109873.
- 35. Harikumar S, Aravindakshan Savithri A, Kaimal R. A depth-based nearest neighbor algorithm for high-dimensional data classification. Turk J Elec Eng & Comp Sci. 2019;27(6):4082–101.
- 36. Ghosh D, Cabrera J. Enriched random forest for high dimensional genomic data. IEEE/ACM Trans Comput Biol Bioinform. 2022;19(5):2817–28. pmid:34129502
- 37. Chen W, Xu Y, Yu Z, Cao W, Chen CLP, Han G. Hybrid dimensionality reduction forest With pruning for high-dimensional data classification. IEEE Access. 2020;8:40138–50.
- 38. Nakariyakul S. A solution to the high-dimensional classification problem using an improved hybrid feature selection algorithm guided by interaction information. IEEE Access. 2020;8:145909–17.
- 39. Chen K, Xue B, Zhang M, Zhou F. Evolutionary multitasking for feature selection in high-dimensional classification via particle swarm optimization. IEEE Trans Evol Computat. 2022;26(3):446–60.
- 40. Niu X, Ma W. Semi-supervised classifier ensemble model for high-dimensional data. Information Sciences. 2023;643:119203.
- 41. Ma J, Cheng H, Chen H, Zhang Y, Li Y, Shen Y, et al. Envelope rotation forest: A novel ensemble learning method for classification. Neurocomputing. 2025;618:129059.
- 42. Xu Y, Yu Z, Cao W, Chen CLP, You J. Adaptive classifier ensemble method based on spatial perception for high-dimensional data classification. IEEE Trans Knowl Data Eng. 2021;33(7):2847–62.
- 43. Xu Y, Yu Z, Cao W, Chen CLP. A novel classifier ensemble method based on subspace enhancement for high-dimensional data classification. IEEE Trans Knowl Data Eng. 2023;35(1):16–30.
- 44. Yan F, Wang X, Zeng Z, Hong C. Adaptive multi-view subspace clustering for high-dimensional data. Pattern Recognition Letters. 2020;130:299–305.
- 45. Mohammadi M, Tajik E, Martinez-Maldonado R, Sadiq S, Tomaszewski W, Khosravi H. Artificial intelligence in multimodal learning analytics: a systematic literature review. Computers and Education: Artificial Intelligence. 2025;8:100426.
- 46. Lacotte J, Pilanci M. Adaptive and oblivious randomized subspace methods for high-dimensional optimization: sharp analysis and lower bounds. IEEE Trans Inform Theory. 2022;68(5):3281–303.
- 47. Rajendra Kumar P, Chakrabarti P, Chakrabarti T, Unhelkar B, Margala M. Heart disease prediction using spark architecture with fused feature set and hybrid Squeezenet-Linknet model. Biomedical Signal Processing and Control. 2025;100:107070.
- 48. Kanchanamala P, Alphonse AS, Reddy PVB. Heart disease prediction using hybrid optimization enabled deep learning network with spark architecture. Biomedical Signal Processing and Control. 2023;84:104707.
- 49. Abirami S, Prasanna Venkatesan DrGKD. Deep learning and spark architecture based intelligent brain tumor MRI image severity classification. Biomedical Signal Processing and Control. 2022;76:103644.
- 50. Ma Q, Li Q, Liu X, Imran Khoshnobish T, Bai M, Wang X, et al. NADHS: online anomaly detection for high-dimensional data streams. IEEE Trans Instrum Meas. 2025;74:1–16.
- 51. Mahmud MS, Huang JZ, González-Almagro G, García S. Determination of the Number of Clusters in High-Dimensional Data With Subspace Clusters. IEEE Trans Big Data. 2025;11(6):3240–54.
- 52. Li G, Yu Z, Yang K, Fan Z, Philip Chen CL. Incremental Semisupervised Learning With Adaptive Locality Preservation for High-Dimensional Data. IEEE Trans Artif Intell. 2025;6(11):2990–3004.
- 53. Xue G, Hu L, Wang J, Ablameyko S. ADMTSK: A High-Dimensional Takagi–Sugeno–Kang Fuzzy System Based on Adaptive Dombi T-Norm. IEEE Trans Fuzzy Syst. 2025;33(6):1767–80.
- 54. Gu X, Ni Q, Shen Q. Multilayer Evolving Fuzzy Neural Networks With Self-Adaptive Dimensionality Compression for High-Dimensional Data Classification. IEEE Trans Fuzzy Syst. 2024;32(11):6314–28.
- 55. Zhu P, Yao X, Wang Y, Hui B, Du D, Hu Q. Multiview Deep Subspace Clustering Networks. IEEE Trans Cybern. 2024;54(7):4280–93. pmid:38517724
- 56. Dwivedi R, Tiwari A, Bharill N, Ratnaparkhe M, Mogre P, Gadge P, et al. A novel apache spark-based 14-dimensional scalable feature extraction approach for the clustering of genomics data. J Supercomput. 2023;80(3):3554–88.
- 57. Berahmand K, Zhou X, Li Y, Gururajan R, Barua PD, Acharya UR, et al. NEDL-GCP: a nested ensemble deep learning model for Gynecological cancer risk prediction. Array. 2025;27:100468.
- 58.
Bhanja S, Das A. Impact of data normalization on deep neural network for time series forecasting. arXiv preprint. 2018. https://arxiv.org/abs/1812.05519
- 59. Kantapalli B, Markapudi BR. SSPO-DQN spark: shuffled student psychology optimization based deep Q network with spark architecture for big data classification. Wireless Netw. 2022;29(1):369–85.
- 60.
Song C, Liu F, Huang Y, Wang L, Tan T. Auto-encoder based data clustering. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I, 2013. p. 117–24.
- 61. Song C, Huang Y, Liu F, Wang Z, Wang L. Deep auto-encoder based clustering. Intelligent Data Analysis: An International Journal. 2014;18(6_suppl):S65–76.
- 62. Feng Q, Chen L, Chen CLP, Guo L. Deep fuzzy clustering - a representation learning approach. IEEE Trans Fuzzy Syst. 2020:1.
- 63.
Smits GF, Jordaan EM. Improved SVM regression using mixtures of kernels. In: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290). p. 2785–90. https://doi.org/10.1109/ijcnn.2002.1007589
- 64.
Zeng X, Chen Y-W, Tao C. Feature selection using recursive feature elimination for handwritten digit recognition. In: 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing. 2009. p. 1205–8. https://doi.org/10.1109/iih-msp.2009.145
- 65.
Gu Q, Li Z, Han J. Generalized fisher score for feature selection. 2012. https://arxiv.org/abs/1202.3725
- 66. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Machine Intell. 1998;20(8):832–44.
- 67. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
- 68. Biau G. Analysis of a random forests model. The Journal of Machine Learning Research. 2012;13(1):1063–95.
- 69. Ghaddar B, Naoum-Sawaya J. High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research. 2018;265(3):993–1004.
- 70. Almomany A, Ayyad WR, Jarrah A. Optimized implementation of an improved KNN classification algorithm using Intel FPGA platform: Covid-19 case study. Journal of King Saud University - Computer and Information Sciences. 2022;34(6):3815–27.
- 71. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403(6769):503–11. pmid:10676951
- 72. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2002;30(1):41–7. pmid:11731795
- 73. Chowdary D, Lathrop J, Skelton J, Curtin K, Briggs T, Zhang Y, et al. Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. J Mol Diagn. 2006;8(1):31–9. pmid:16436632
- 74. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–7. pmid:10521349
- 75. Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 2002;62(17):4963–7. pmid:12208747
- 76. Bredel M, Bredel C, Juric D, Harsh GR, Vogel H, Recht LD, et al. Functional network analysis reveals extended gliomagenesis pathway maps and three novel MYC-interacting genes in human gliomas. Cancer Res. 2005;65(19):8679–89. pmid:16204036
- 77. Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 2001;61(20):7388–93. pmid:11606367