Random subspace-based ensemble classifier for high-dimensional data Using SPARK

Venkaiah Chowdary Bhimineni; Rajiv Senapati

doi:10.1371/journal.pone.0342408

Peer Review History

Original SubmissionSeptember 1, 2025
12 Nov 2025 Decision Letter - Razieh Sheikhpour, Editor Dear Dr. Senapati, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Dec 27 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Razieh Sheikhpour Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please note that PLOS One has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. 3. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? Reviewer #1: Yes Reviewer #2: Yes ******** 2. Has the statistical analysis been performed appropriately and rigorously? -->?> Reviewer #1: Yes Reviewer #2: Yes ****** 3. Have the authors made all data underlying the findings in their manuscript fully available??> The PLOS Data policy Reviewer #1: Yes Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English??> Reviewer #1: Yes Reviewer #2: Yes ****** Reviewer #1: This manuscript presents ISSBEC, a Spark-based ensemble classifier for high-dimensional data classification, incorporating min-max normalization, improved deep fuzzy clustering (IDFC) for partitioning, SVM-modified recursive feature elimination (SVM-MRFE) for selection, and an improved subspace-based ensemble with feature fusion random subspace (FF-RSS), mixed space enhancement (MSE), and diverse base classifiers to tackle the curse of dimensionality. Evaluated on datasets via metrics like accuracy and robustness, it claims superiority over state-of-the-art methods. While offering a scalable solution for overfitting and sparsity in big data environments, the work requires major revisions to address unclear novelty amid recent Spark-ensemble hybrids, incomplete methodological proofs, narrow experimental scope, and significant presentation flaws for PLOS ONE publication. The manuscript is riddled with grammatical errors, repetitive phrasing (e.g., "high-dimensional data" overused without variation), and unclear sections with abrupt transitions, severely affecting readability; a thorough professional edit is mandatory to enhance clarity and academic tone. The related work section lacks depth, ignoring 2024-2025 advancements like Spark-based ensemble for high-dimensional anomaly detection and adaptive subspace ensembles for imbalanced data; include a comparison table to highlight unique FF-RSS and MSE contributions. The IDFC partitioning and SVM-MRFE selection are innovative but lack theoretical justification; provide proofs on convergence and optimality under high sparsity, comparing to standard fuzzy C-means or RFE variants. FF-RSS and MSE mechanisms are vaguely described, risking instability in distributed environments; elaborate with detailed equations, pseudocode, and analyses of fusion impacts on ensemble diversity. Parameter choices for clustering depth and subspace sizes are not analyzed; conduct sensitivity studies to show effects on performance metrics across varying dimensions. Illustrative examples are absent; add toy datasets demonstrating partitioning, feature fusion, and ensemble steps to clarify the workflow. Experimental datasets are standard but lack diversity (e.g., no real-time streaming or multi-modal data); test on broader benchmarks from recent Spark ML surveys to validate scalability. Baseline comparisons are limited; incorporate recent methods like deep ensemble subspace clustering, with statistical tests (e.g., Friedman) over multiple runs for rigor. The Literature citation is not adequate, and the related work to machine learning should be discussed 1.Artificial intelligence in multimodal learning analytics: a systematic literature review 2.NEDL-GCP: A nested ensemble deep learning model for Gynaecological cancer risk prediction Reviewer #2: The manuscript presents an ensemble learning framework based on random subspaces integrated with feature selection and feature fusion using the Spark platform. The topic is relevant and potentially valuable for high-dimensional data analysis. However, the manuscript, in its current form, has several critical weaknesses in methodological clarity, experimental completeness, and presentation structure. These issues prevent the reader from fully understanding and evaluating the novelty and validity of the proposed approach. Strengths: - The paper addresses a significant problem: classification of high-dimensional data and scalability using distributed computing. - The proposed integration of Spark with ensemble learning and subspace-based methods could be promising if properly justified and evaluated. - The manuscript is written in generally clear language and follows a standard scientific format. Major Weaknesses: - Ambiguity in Feature Selection (SVM-MRFE): The description of the hybrid feature selection method is unclear. It is not explained how the Modified Fisher Score and SVM-RFE are combined or how the ranking is performed. The choice of a backward elimination strategy (wrapper-based) introduces risks of local optima and poor search diversity. Additionally, the use of a linear SVM as the wrapper classifier is questionable for non-linear data. The authors should justify this design choice and discuss alternatives such as metaheuristic feature selection algorithms or kernel-based SVMs. - Unclear Purpose of Feature Fusion: The rationale for performing feature fusion after feature selection is not convincing. Once the optimal subset of features has been selected, further dimensionality reduction using PCA may result in information loss. The authors should explain the purpose of this step, quantify the retained variance, and demonstrate its impact on performance. - Weak Methodology Organization: The methodology section is fragmented and repetitive. A clearer flow is needed—each stage should be fully described once, in logical sequence, rather than being introduced briefly and elaborated later. - Simplistic Ensemble Voting Strategy: The use of majority voting for combining classifiers is too simplistic and lacks justification. More robust approaches (e.g., weighted voting, stacking, or probability-based fusion) could lead to better performance, especially for multi-class scenarios. The authors should explain why this basic method was chosen and discuss its limitations. - Incomplete Evaluation on Multi-Class Data: The experiments are primarily limited to binary datasets. The multi-class evaluation is minimal, with only two datasets and incomplete comparisons. To substantiate the generalization capability of the proposed model, more multi-class datasets and comparative studies are required. - Lack of Evaluation Procedure Details: The paper does not specify how results were obtained (training/test split, cross-validation strategy, or prevention of overfitting). It is unclear whether reported metrics correspond to test or training results. A detailed evaluation methodology must be provided for reproducibility. - Model Complexity vs. Performance Trade-off: The proposed system is highly complex, combining several components (IDFC, SVM-MRFE, Feature Fusion, ISSBEC). However, Table 8 indicates only modest performance improvements over simpler methods. The authors should justify whether this level of complexity is necessary. - Absence of a Discussion Section: The manuscript lacks a Discussion section, which is essential to interpret results, highlight contributions, discuss limitations, and relate findings to existing literature. Minor Comments: - Improve figure references and algorithm readability (some symbols are not defined in context). - Clarify parameter settings and computational complexity in the experimental section. - Ensure consistency in notation (e.g., SVM-MRFE and MRFE are used interchangeably). - Revise grammatical issues in long sentences for better readability. ****** what does this mean? ). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy Reviewer #1: No Reviewer #2: Yes: Amin Hashemi ******** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation. NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications. https://doi.org/10.1371/journal.pone.0342408.r001
Revision 1
11 Dec 2025 Author Response Reviewer #1: Query 1. The manuscript is riddled with grammatical errors, repetitive phrasing (e.g., "high-dimensional data" overused without variation), and unclear sections with abrupt transitions, severely affecting readability; a thorough professional edit is mandatory to enhance clarity and academic tone. Reply: As per the suggestion, the manuscript is modified suitably. Query 2. The related work section lacks depth, ignoring 2024-2025 advancements like Spark-based ensembles for high-dimensional anomaly detection and adaptive subspace ensembles for imbalanced data; include a comparison table to highlight unique FF-RSS and MSE contributions. Reply: As per the suggestion, the literature review section is improved suitably. Further, the comparison of proposed approach with the recent state-of-the-art is presented in Table 1, available on Page 5. Query 3. The IDFC partitioning and SVM-MRFE selection are innovative but lack theoretical justification; provide proofs on convergence and optimality under high sparsity, comparing to standard fuzzy C-means or RFE variants. Reply: As per the reviewer suggestion the rationale behind the choice of using IDFC has been mentioned in the first paragraph of the section “Data partitioning in Master Node”, which can be found in Page 7 and the rationale behind the choice of using SVM-MRFE has been mentioned in the subsection “Feature Selection by Modified Recursive Feature Elimination”, which can be found in Page 9-10. The convergence justification is provided at the end of the section “Data partitioning in Master Node”, can be found on Page 9. Query 4. FF-RSS and MSE mechanisms are vaguely described, risking instability in distributed environments; elaborate with detailed equations, pseudocode, and analyses of fusion impacts on ensemble diversity. Reply: As per the reviewers' suggestion, we updated the FF-RSS and MSE subsections available in Page 12 and 13. Query 5. Parameter choices for clustering depth and subspace sizes are not analysed; conduct sensitivity studies to show effects on performance metrics across varying dimensions. Illustrative examples are absent; add toy datasets demonstrating partitioning, feature fusion, and ensemble steps to clarify the workflow. Reply: The proper justification of parameter choice for clustering depth is now suitably presented in the revised version of this manuscript, which can be found on page 16-18. Further, as per the advice we have conducted sensitivity analysis, which can be found from Figure 2, available on page 18 . Query 6. Experimental datasets are standard but lack diversity (e.g., no real-time streaming or multi-modal data); test on broader benchmarks from recent Spark ML surveys to validate scalability. Reply: The model proposed in this paper is also tested with multi-class dataset. The experimental results with baseline comparison are presented in Table 8 in the revised version of this manuscript. Query 7. Baseline comparisons are limited; incorporate recent methods like deep ensemble subspace clustering, with statistical tests (e.g., Friedman) over multiple runs for rigour. Reply: As pe the reviewer suggestion we have included recent baseline models while comparing our proposed model with them. It can be found in Table 4, available on page 17. Query 8. The Literature citation is not adequate, and the related work to machine learning should be discussed “1.Artificial intelligence in multimodal learning analytics: a systematic literature review, 2.NEDL-GCP: A nested ensemble deep learning model for Gynaecological cancer risk prediction”. Reply: As per the recommendation, the literature section is updated suitably. Reviewer #2: Query 1. The description of the hybrid feature selection method is unclear. It is not explained how the Modified Fisher Score and SVM-RFE are combined or how the ranking is performed. The choice of a backward elimination strategy (wrapper-based) introduces risks of local optima and poor search diversity. Reply: The proposed work uses SVM-MRFE approach for selecting the optimal feature. The SVM-MRFE approach is the improved version of the conventional SVM-RFE approach. The proposed SVM-MRFE uses the modified Fisher score, as mentioned in Eq. (14) and ranks each feature based on this Fisher score; here, the feature having the smallest ranking value gets eliminated. The detailed description is added in the section “Feature Selection by Modified recursive Feature Elimination”, which can be found on Page 9-10 of the revised manuscript. Query 2. Additionally, the use of a linear SVM as the wrapper classifier is questionable for non-linear data. The authors should justify this design choice and discuss alternatives such as metaheuristic feature selection algorithms or kernel-based SVMs. Reply: As per the suggestion, the rationale behind the choice of using the SVM classifier has been theoretically mentioned in the section “Importance of Base Classifiers” available in page 13. Further, the use of every components are validated through ablation analysis available in Table 5 on page 18, where the model omits the SVM-MRFE model and uses the conventional RFE, shows lower performance in comparison to the proposed ISSBEC approach. Query 3. The rationale for performing feature fusion after feature selection is not convincing. Once the optimal subset of features has been selected, further dimensionality reduction using PCA may result in information loss. The authors should explain the purpose of this step, quantify the retained variance, and demonstrate its impact on performance. Reply: The feature selection reduces dimensionality by retaining only the most informative features; redundancy may still exist among the selected features. Feature fusion combines these selected features into a compact form, reducing redundancy and improving the stability and robustness of the classifier. This step can enhance generalisation and lead to improved classification performance. The importance of employing the feature fusion approach after the feature selection process is now theoretically discussed in Page 11. Further, we have conducted ablation analysis to conform the same. Query 4. The methodology section is fragmented and repetitive. A clearer flow is needed—each stage should be fully described once, in logical sequence, rather than being introduced briefly and elaborated later. Reply: The revised version of this manuscript is modified suitably. Query 5. The use of majority voting for combining classifiers is too simplistic and lacks justification. More robust approaches (e.g., weighted voting, stacking, or probability-based fusion) could lead to better performance, especially for multi-class scenarios. The authors should explain why this basic method was chosen and discuss its limitations. Reply: As per the reviewer’s suggestion, the rationale behind the choice of using the majority voting approach has been mentioned at the end of the section “Importance of Base Classifiers”, which can be found on Page 13-15 of the revised manuscript. Unlike the existing ensemble fusion methods such as weighted voting, stacking, or probability-based fusion, the majority voting strategy is simple, robust and does not require any extra training and is highly advantageous in high-dimensional data settings. Query 6. The experiments are primarily limited to binary datasets. The multi-class evaluation is minimal, with only two datasets and incomplete comparisons. To substantiate the generalization capability of the proposed model, more multi-class datasets and comparative studies are required. Reply: The model proposed in this paper conforms its suitability for binary classification. However, we tried to verify this approach with multi class datasets and found a minor difference while comparing some baseline such as CESE in Table 8. Hence, we would like to extend the framework suitably for multi class classification with improved accuracy and less complexity in future. Query 7. The paper does not specify how results were obtained (training/test split, cross-validation strategy, or prevention of overfitting). It is unclear whether reported metrics correspond to test or training results. A detailed evaluation methodology must be provided for reproducibility. Reply: As per the reviewer's suggestion, the results section has been updated with corresponding information, and it can be found on Page 15 of the revised manuscript. Query 8. The proposed system is highly complex, combining several components (IDFC, SVM-MRFE, Feature Fusion, ISSBEC). However, Table 8 indicates only modest performance improvements over simpler methods. The authors should justify whether this level of complexity is necessary. Reply: The Table 8 is now Table 9 in the revised version of this manuscript. Through this table we have clearly demonstrated the complexity comparison between non spark and spark based framework. The details description is now presented in Complexity analysis section available in Page 20 and 21. Query 9. The manuscript lacks a Discussion section, which is essential to interpret results, highlight contributions, discuss limitations, and relate findings to existing literature. Reply: As per the reviewer’s suggestion, the Discussion section has been included in the revised manuscript, which can be found on Page 21. Query 10. Improve figure references and algorithm readability (some symbols are not defined in context). Reply: As per the reviewer’s suggestion, we have improved the references of figures, algorithms and all variables defined in the revised version of this manuscript. Query 11. Clarify parameter settings and computational complexity in the experimental section. Reply: The result section is suitably modified in the revised version of this manuscript. Further, the modified computational complexity analysis can be found on Page 20-21. Query 12. Ensure consistency in notation (e.g., SVM-MRFE and MRFE are used interchangeably). Reply: As per the reviewer's suggestion, the terminologies have been used consistently in the revised manuscript. Query 13. Revise grammatical issues in long sentences for better readability. Reply: As per the direction of reviewers’ the manuscript is revised suitably. Attachments Attachment Submitted filename: Response to Reviewers.pdf https://doi.org/10.1371/journal.pone.0342408.r002
22 Jan 2026 Decision Letter - Razieh Sheikhpour, Editor Random Subspace-Based Ensemble Classifier for High-Dimensional Data Using SPARK PONE-D-25-47457R1 Dear Dr. Senapati, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support . If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Razieh Sheikhpour Academic Editor PLOS One Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author Reviewer #3: All comments have been addressed ******** 2. Is the manuscript technically sound, and do the data support the conclusions??> Reviewer #3: Yes ****** 3. Has the statistical analysis been performed appropriately and rigorously? -->?> Reviewer #3: Yes ****** 4. Have the authors made all data underlying the findings in their manuscript fully available??> The PLOS Data policy Reviewer #3: Yes ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English??> Reviewer #3: Yes ****** Reviewer #3: This is the revised version of the manuscript that was previously submitted by the authors. After thorough consideration of the changes made in this version, I have concluded that the revisions have addressed the concerns and suggestions raised during the initial review process. The improvements made enhance the overall quality and clarity of the manuscript, making it a valuable contribution to the field. Therefore, I would like to recommend this paper for publication. ****** what does this mean? ). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy Reviewer #3: No ******** https://doi.org/10.1371/journal.pone.0342408.r003
Formally Accepted
Acceptance Letter - Razieh Sheikhpour, Editor PONE-D-25-47457R1 PLOS One Dear Dr. Senapati, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Razieh Sheikhpour Academic Editor PLOS One https://doi.org/10.1371/journal.pone.0342408.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .