Peer Review History
| Original SubmissionOctober 24, 2023 |
|---|
|
PONE-D-23-34859MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generationPLOS ONE Dear Dr. Siddharthan, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Mar 01 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Mohd Amril Nurman Mohd Nazir Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex. 3. Thank you for uploading your study's underlying data set. Unfortunately, the repository you have noted in your Data Availability statement does not qualify as an acceptable data repository according to PLOS's standards. At this time, please upload the minimal data set necessary to replicate your study's findings to a stable, public repository (such as figshare or Dryad) and provide us with the relevant URLs, DOIs, or accession numbers that may be used to access these data. For a list of recommended repositories and additional information on PLOS standards for data deposition, please see https://journals.plos.org/plosone/s/recommended-repositories. 4. We notice that there is a MIT license on your data. We would encourage you to consider using a license that is no more restrictive than CC BY, in line with PLOS’ recommendation on licensing (http://journals.plos.org/plosone/s/licenses-and-copyright). 5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments: Please make all the revisions as suggested by reviewers. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The present manuscript by Kumari and co-authors introduces a new clustering method (named MMM) and a new method for the generation of synthetic data (named MMMSynth). The manuscript presents the mathematical formulations for their methods introducing the reader to the underlying concept of the proposed method. The performance (from the perspective of two metrics) of the proposed clustering method has been evaluated against other five clustering algorithms on three synthetically generated datasets with varying values of standard deviations (decreasing and increasing the difficulty of clustering). Moreover, the authors have validated their method (in three variants, with two approximations for the number of clusters and the true number of clusters) is in comparison to five other clustering algorithms (that have been given the true number of clusters) on multiple datasets that have binary (seven datasets) and multiclass (eight datasets) labels from the perspective of a performance metric. Major points: Section 3.1.1: As you mention, the increase of the δσ on the x-axis will result in more similar clusters, nevertheless, all methods are becoming worse for Figure 2 panels A and D with higher values of δσ (that indicate more similar clusters). This is not discussed in the manuscript. The performance metric choice changing across different data sets (Figure 2 panel A vs panels B,C,D) is again not discussed. I would recommend either keeping ARI across all as the normalized accuracy appears just in Figure 2A or extending all analyses to include both metrics. Section 2.6: Multiple perspectives are offered from multiple metrics, such as Adjusted Mutual Information/V-Measure, Purity, Fowlkes-Mallows. And even from internal metrics such as Silhouette, Davies-Bouldin, Calinski-Harabasz. This section has two sentences at this moment, I would recommend extending and discussing the choice of metrics as the normalized accuracy is used in a single analysis. Overall, the figures are hardly discussed/interpreted in the manuscript, they are only referenced. I would recommend extending the manuscript with an interpretation of the results. Figures 2 and 3 could be improved, as the same methods are used, you could have the same legends (the addition of sklearn in the names of Figure 3 is not relevant as they are exactly the same methods as in figure 2). From Figure 2 to Figure 3, the name "MMM, true nClust" has been changed into "MMM (fixed)", making the manuscript harder to read. As a question, why is Figure 3 lacking the MMM, HMβ option? This is not discussed either. Figure 3 could have “ARI” added as the x-axis label. Regarding the claims in section 4. Discussion, that the proposed methods outperform others: • In Figure 3 panel A, GMM actually shows a better performance than the proposed method (even when the proposed method is given the true number of clusters) for all datasets. • The results from Figure 4 do not exactly show this. TVAE might even outperform the proposed method on average for RandomForest and GC is a close contender as well. By averaging the performances from Figure 4, panel A, it seems that TVAE has overall a higher value. For panel B, the proposed method will probably have a slightly higher value but not by much, I would estimate around 3%. I would recommend more analyses, as at this point, the proposed generation method does not seem to bring a considerable improvement in comparison to TVAE. Minor points: Section 3.1: As far as I understand δσ will result in more dispersed clusters as it changes the standard deviation of the cluster. Thus, by similar do you mean that they will have more similar densities? Section 2.8: You specify that the first list of datasets have binary output variables. But the abalone dataset has an integer value in the number of rings which is not binary. At least that is the case of the abalone dataset shown in the link. Section 2.3: The mathematical formulations seem sound, although I have a question, shouldn’t it be μn in equation (5)? Section 2.3: In the first paragraph, there is a period added after reference [13], while the sentence continues after it. Section 2.7.1: In the second paragraph, “available” written as “availbale”. Section 4: The GC acronym has been defined above, yet it is not used in the 4. Discussion section I would recommend extending the github code. It is hard to use with the only output available as a file of labels. A plot (2D/3D through PCA) of the result of clustering a synthetically generated dataset in comparison to another clustering method (such as K-Means), would be a helpful addition. At this point, validation of the code requires additional code to be written by others. Reviewer #2: MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation The manuscript’ MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation’ is recommended for minor correction. The manuscript requires to address in the following aspects: 1. Introduction section is not followed the standard. This should be rewritten. 2. Section 2.3 is not clear, not identified why the equations are require for this method? 3. In section 4, equation 14 is your own expression, If not please use the proper reference. Please use references for other section when use equations. 4. TI method, abbreviation? 5. The manuscript needs to proofread by Authorised affiliations. 6. Conclusion needs to rewrite has a lack of summarise of findings. 7. You may include similar additional references: i) Review on the Evaluation and Development of Artificial Intelligence for COVID-19 Containment https://doi.org/10.3390/s23010527; ii) Bio-activity prediction of drug candidate compounds targeting SARS-Cov-2 using machine learning approaches, https://doi.org/10.1371/journal.pone.0288053 ; iii) Evaluating the Brexit and COVID-19’s influence on the UK economy: A data analysis https://doi.org/10.1371/journal.pone.0287342; 8. I recommend to use the comparison with existing models/methods. Reviewer #3: Summary: In this work, authors proposed novel methods to cluster heterogeneous tabular data and generate synthetic datasets. Based on the likelihood for clustering heterogeneous datasets, the proposed method employed EM algorithm to derive accurate clustering results. This work would be interesting for related research field and the manuscript is well-written as well. However, there are few comments to enhance the quality of the manuscript. Please check out the following comments: Major: 1. It would be recommended to introduce related works (or literature review) in the introduction section. Although the introduction section well describes the background of the proposed work, if the related works or publications are introduced, it helps potential readers in making a follow-up study (or extension) of the proposed work. 2. Brief description of the proposed work may not appropriate in the introduction section. “Here we propose an algorithm, which we call the Madras Mixture Model (MMM),……..Eq (1). ……….. . Our performance in many cases approaches the quality of prediction from training on real data.” It would be recommended to make a new section such as “overview” to describe the compact explanation of the proposed work. 3. In the section 2.1: “If there are missing data, they should first be interpolated or imputed via a suitable method” It would be good to provide more detailed description of interpolation or imputation methods if there are missing data. If the proposed method cannot handle the datasets including missing information, please clearly discuss its limitation in the discussion section. 4. In the sections 2.2 & 2,3: Please provide more explanation why you employ “Dirichlet prior” for the discrete data and “normal-gamma prior” for continuous data. 5. In the figure 3 A: Overall performance of MMM is not comparable to scikitlearn_gmm for most cases and scikitlearn_birch for yeast, statlog, ecoli. It would be good, if you can describe the reasons (or acceptable explanations) for inferior performance. Additionally, there is no results of MMM for ecoli. Do you have any reason for skipping the specific result? If it is, please clearly describe why it cannot the ARI for ecoli. 6. It would be recommended to present a brief result on how the proposed method can accurately predict the true number of clusters. For instance, given different datasets, you can set x-axis as the true number of clusters and y-axis as the estimated number of clusters through the proposed method. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Eugen-Richard Ardelean Reviewer #2: No Reviewer #3: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
|
| Revision 1 |
|
MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation PONE-D-23-34859R1 Dear Dr. Siddharthan, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Zeyar Aung Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #3: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: My comments have been addressed. Nevertheless, I would like to mention the following: 1. If you check the ClustBench documentation, the datasets you have chosen (ecoli, yeast, wine, … - for example in Fig4A) are actually from UCI as well: “A selection of 8 high-dimensional datasets available through the UCI (University of California, Irvine) Machine Learning Repository [12]. Some of them were considered for benchmark purposes in, amongst others, [30]. They are also listed in the sipu battery. However, their original purpose is for testing classification, not clustering algorithms. Most clustering algorithms find them problematic; due to their being high-dimensional, it is difficult to verify the sensibleness of the reference labels.” 2. The writing style could be improved Reviewer #3: In this revision, all raised issues are well addressed and properly discussed. One minor recommendation is that, if you give a tutorial page (or instruction) in the GitHub, it would increase the usability of the proposed method. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Eugen-Richard Ardelean Reviewer #3: No ********** |
| Formally Accepted |
|
PONE-D-23-34859R1 PLOS ONE Dear Dr. Siddharthan, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Zeyar Aung Academic Editor PLOS ONE |
Open letter on the publication of peer review reports
PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.
We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.
Learn more at ASAPbio .