Peer Review History

Original SubmissionMarch 21, 2021
Decision Letter - Saeed Mian Qaisar, Editor

PONE-D-21-09268

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)

PLOS ONE

Dear Dr. Lötsch,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 16 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Saeed Mian Qaisar, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

  1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

  1. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

  1. Please remove your figures from within your manuscript file, leaving only the individual TIFF/EPS image files, uploaded separately.  These will be automatically included in the reviewers’ PDF.

4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear Authors,

Reviewers have now commented on your paper. They are advising that you revise your manuscript. If you are prepared to undertake the work required, I would be please to reconsider my decision.

The reviewer comments can be found at the end of this email or can be accessed online.

While revising your paper please also consider the following points.

1. It is recommended to add a work flow diagram in the beginning of section "Methods". It shows different processing stages in the system.

2. It is recommended that authors submit the studied datasets as supplementary zip file in the .csv OR .xls formats while submitting the revised version of paper.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present their proposal of a downsampling method of large biomedical data sets. The method, experiments and results are described clearly.

I have few comments and questions that should be answered in the revised version. The used datasets are rather similar concerning number of features used and also basic characteristics of individual examples.

How would the method work for a larger number of features (e.g. more than 50)?

Is feature selection necessary before starting the method?

What procedure would the authors recommend for outlier handling? In many cases outliers represent rare cases and not noise. Thus they should not be removed from the data set.

It would be welcome if the authors formulate a recommendation, for which type of data the proposed method is suitable.

Is it applicable to time series or signals in which we search for certain patterns?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Revision 1

Dear Professor Qaisar,

Thank you very much for having handled our manuscript PONE-D-21-09268 entitled “Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)" and considering it for publication in PLoS One after minor revision.

We would like to thank the reviewers for their helpful comments, which we have addressed as follows. Specifically, the reviewers’ comments are given in black Calibri font and our responses are given below each point in red Times Roman font.

Comments to the Author

Reviewer #1: The authors present their proposal of a downsampling method of large biomedical data sets. The method, experiments and results are described clearly.

I have few comments and questions that should be answered in the revised version. The used datasets are rather similar concerning number of features used and also basic characteristics of individual examples.

How would the method work for a larger number of features (e.g. more than 50)?

We have added a higher dimensional data set with d = 184 features as data set #6, described in the methods section as “The sixth data set was again of biomedical origin and consisted of a "miRNA and chronic pain" data set containing measurements of 184 microRNAs in 94 instances grouped into d = 5 classes of sizes [14, 24, 18, 17, 21]. The data set is freely available under the Creative Commons license CC By 4.0 at https://data.mendeley.com/datasets/37fnjc4yhm/2 (downloaded at June 14, 2021). Of its two versions, the "Profiling raw data.xlsx" file dated February 19, 2020 was selected [18].”

Is feature selection necessary before starting the method?

We would like to thank the reviewer for this comment. At an earlier stage of the project work, we had thought along the same lines and implemented a feature selection option in the R library. In later stages, we had hesitated to implement it in the final version and had removed it. The reviewer's comment encouraged us to add it back as an option in the library, and so we are grateful for that comment. The option is now described in the "Implementation" chapter of the "Methods" section as “Optionally, the library also provides (vii) a fast PCA-based feature selection ("PCAimportance") to exclude variables from the assessment of similarity to the distributions of the downsampled data and the original data to avoid unnecessarily optimizing the sample for irrelevant variables. This feature is disabled by default. The standard "prcomp" method implemented in base R [2] is used, and variables’ selection was based the loadings on relevant principal components according to the Kaiser-Gutman criterion, i.e., on PCs with eigenvalues > 1. Specifically, the relevant variables are selected as suggested in the R library "factoextra" (https://cran.r-project.org/package=factoextra [29, 30]) based on the expected value if the contributions were uniform, which is given as 100/length(contrib) with "contrib" denoting the list of contribution magnitudes of each variable to a given PC.”. In the discussion, we comment on this option, also including the reasons why we hesitated first to include it: “In addition, the R implementation provides feature selection via PCA with the idea that only relevant features should contribute to the selection of a representative data subset among many possible uniformly drawn subsets. If there is doubt that PCA projection provides an adequate perspective on the data set to select its relevant variables, but alternatives such as independent component analysis or other feature selection methods are preferable, it is recommended that feature selection should be addressed only at a later stage of data analysis; therefore, the library’s feature is disabled by default. In addition, applying complex statistical procedures to the entire data set could run counter to the goal of the present method to shrink a data set before any further data processing because its size exceeds computer capacity. In the present experiments, the use of feature selection produced no clear improvements, so the corresponding tests are not reported in detail.”

A new version of the R library including this feature has been uploaded to CRAN.

What procedure would the authors recommend for outlier handling? In many cases outliers represent rare cases and not noise. Thus they should not be removed from the data set.

We thank the reviewer for the comment, as it points out a detail to which we had not given the necessary attention. Since we were asked to make a minor revision, we felt it was inappropriate to change the structure of the report and have included the response to the reviewer's question in the discussion, including the insertion of an additional Figure 3: “A special case is small classes or outliers in a data set. If the class information is present, the method samples the defined fraction of the data from each class, as demonstrated with data set #1 that contains a moderately small class of 10 % of the cases. However, prior class information is often incomplete or non-existent, while it is only the task of future data analysis with unsupervised methods to detect a class structure in a data set. To further discuss the behavior of the proposed downsampling method in such settings, the resampling experiment of data set #1 was repeated with the class information omitted, i.e., all instances were assigned to a single class #1. The results indicated that small hidden classes were likely to be adequately represented in the downsampled data (Figure 3 A and D). In another experiment, outliers were added as 10 consecutive integer numbers starting at x = 15, which were far to the right of the maximum of the data at x = 7.169319. Application of the downsampling algorithm resulted in consistent omission of these added outliers (Figure 3 E), while in the experiments where the first random sample was always taken without further control, some outliers were captured (Figure 3 B). This could be reproduced with the advanced downsampling method when the outliers were assigned to a separate class (Figure 3 C and F). So, if outliers are suspected, this procedure can be used. Of course, the researcher must choose a valid outlier definition, which might be only possible after adequate data transformation. However, these considerations are standard in data science and are not part of the present proposal for improving uniform downsampling.”

It would be welcome if the authors formulate a recommendation, for which type of data the proposed method is suitable. Is it applicable to time series or signals in which we search for certain patterns?

We do not think that time series should be downsampled using our method. To respond to the reviewer’s questions, we have added to the discussion: “However, these considerations are standard in data science and are not part of the present proposal for improving uniform downsampling. In fact, if downsampling and especially uniform downsampling, i.e., every data point has the same chance of being drawn, is contraindicated, the presented method is obviously not applicable either. The proposed method is best suited for independent, numerical, tabular data. For sequential data as an example, such as continuous Markov processes, reducing the number of data points can lead to information loss, since the respective values depend on their predecessors.”

Again, we would like to thank you for having processed the manuscript and you and the reviewers for many helpful suggestions to improve it. We hope that the manuscript now meets the criteria for publication in PLoS One and are very much looking forward to hearing back from you.

Sincerely yours

Jörn Lötsch and Sebastian Malkusch

Attachments
Attachment
Submitted filename: ReplyLetter.docx
Decision Letter - Saeed Mian Qaisar, Editor

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)

PONE-D-21-09268R1

Dear Dr. Lötsch,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Saeed Mian Qaisar, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear Authors,

I am pleased to tell you that your work has now been accepted for publication in the PLOS ONE Journal.

Thank you for submitting your work to this journal.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript has been significantly improved and all questions and comments adequately addressed. I have no additional comments or suggestions.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Formally Accepted
Acceptance Letter - Saeed Mian Qaisar, Editor

PONE-D-21-09268R1

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)

Dear Dr. Lötsch:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Saeed Mian Qaisar

Academic Editor

PLOS ONE

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .