Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction

Xianghao Zhan; Qinmei Xu; Yuanning Zheng; Guangming Lu; Olivier Gevaert

doi:10.1371/journal.pcbi.1012803

Peer Review History

Original SubmissionNovember 18, 2024
18 Nov 2024 Author Response https://doi.org/10.1371/journal.pcbi.1012803.r001
16 Jan 2025 Decision Letter - Mark Alber, Editor, Anders Wallqvist, Editor Dear Dr. Gevaert, We are pleased to inform you that your manuscript 'Reliability-Enhanced Data Cleaning in Biomedical Machine Learning Using Inductive Conformal Prediction' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Anders Wallqvist Academic Editor PLOS Computational Biology Mark Alber Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This manuscript describes a method to clean noisy training data for training a classifier. The method was tested on three biomedical datasets of different modalities. The results show that the method improved the classifier's performance in terms of AUROC and AUPRC compared to without applying any cleaning. There are some minor issues that should be addressed before the manuscript can be published: 1. The assumptions behind the method should be elaborated further instead of claiming that it has been "validated" in previous work (line 213, 272). Some assumptions were discussed in Section 4.1 but it's on ICP not on the shrunken centroid used to estimate reliability. It seems that a centroid based method assumes that class distribution is spherical or elliptical. 2. It is not clear why different visualizations were provided for different datasets? Fig 8 is nice to show how many data points the cleaning method corrected compared to the number of wrongly labeled data points. But this is only for DILI not the other two datasets? 3. Some minor English problems in line 568 and 578, Section 4.3. It's models that overfit data not that data are overfitting. Similarly it's models that are bias not data. Data may not be representative to the whole sample space. Those sentences should be revised. 4. Given that even though there are plenty of incorrect corrections by the method, the classification performance did not suffer, this begs a question that the datasets and tasks are so "easy" that even a strong baseline of cleaning can lead to an improvement? Comparison to a strong baseline (say, one of the previous work) may help answer if this is the case. Reviewer #2: This study introduces a novel approach to data cleaning for machine learning in the biomedical field, utilizing the inductive conformal prediction (ICP) method. The method employs a calibration set to identify and correct mislabeled data and outliers within large, noisy datasets. The approach was validated across three biomedical tasks: the filtering of drug-induced liver injury (DILI) literature, the prediction of ICU admission for patients with SARS-CoV-2, and the classification of breast cancer subtypes using RNA-seq data. The outcomes of this study demonstrate significant enhancements in classification accuracy, AUROC, and F1 scores, even in the presence of substantial noise. This method provides a pragmatic solution for enhancing model performance without the need for extensive data with minimal noise or strict assumptions. Overall, this is a solid paper with great potential. With minor refinements, it could better showcase the novelty and practical implications of the proposed method. 1.While the results are comprehensive, some figures (e.g., Figures 2–6) are dense and could be streamlined. Highlighting key trends or conclusions directly in the figure captions would help readers focus on the most important findings. 2.Although data and code availability is mentioned, providing more detailed replication instructions, such as preprocessing steps or hyperparameter tuning details, would enhance usability for researchers. 3.Certain parts of the text, especially in the abstract and introduction, are overly verbose and could be simplified to make the paper more accessible. 4.While the paper is technical, it should provide brief, reader-friendly explanations of terms like “nonconformity measures” and “calibration sets” to make it accessible to interdisciplinary audiences. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes:** Wenhao Ouyang https://doi.org/10.1371/journal.pcbi.1012803.r002
Formally Accepted
Acceptance Letter - Mark Alber, Editor, Anders Wallqvist, Editor PCOMPBIOL-D-24-02002 Reliability-Enhanced Data Cleaning in Biomedical Machine Learning Using Inductive Conformal Prediction Dear Dr Gevaert, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Anita Estes PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1012803.r003

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .