Where is your field going? A machine learning approach to study the relative motion of the domains of physics

Andrea Palmucci; Hao Liao; Andrea Napoletano; Andrea Zaccaria

doi:10.1371/journal.pone.0233997

Peer Review History

Original SubmissionNovember 20, 2019
19 Feb 2020 Decision Letter - Roberta Sinatra, Editor PONE-D-19-32031 Where is your field going? A Machine Learning approach to study the relative motion of the domains of Physics PLOS ONE Dear Dr Napoletano, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Both Reviewers agree that the manuscript is of high quality, but point out some issues especially regarding null model and presentation of results, which should be addressed in revision. They also offer some suggestions to improve the overall quality of the manuscript. We would appreciate receiving your revised manuscript by Apr 04 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Roberta Sinatra Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper is well-written and original, and I enjoyed reading it. There are very few minor areas of improvement that I will suggest, but I do not consider them a requirement for publication; the paper is good enough as it is and I only put these forward in the hope of making it stronger. To be clear: if you agree with my suggestions do implement them, but do not consider them binding. The suggestions are not in order of importance. a) Figure 1 does not print well in black and white. You might consider using colors that print very differently in black and white for adjacent clusters, to increase contrast. b) Most readers that are familiar with Word2Vec are probably familiar with t-SNE, but Dynamic t-SNE is less known. A few lines in the Methods to explain the difference between t-SNE and Dynamic t-SNE might make the paper more readable. (A detailed explanation is not necessary, since you cited the appropriate papers). c) Lines 139-140: I am not sure that it is common knowledge that AUC ROC is always 0.5 for random guessing, even in your case (i.e. the presence of unbalanced classes). Maybe mention it to increase readability? d) Lines 142-143. I would not use the wording "a reasonable high number of positive(s) have been dentified", since there is no alternative baseline model to compare the F-score to. I appreciate the novelty of the method, which is a proof of concept, and this means you do not have anything else to compare it to. So I think a baseline other than the random model you already included is not necessary. e) Lines 174-176. Minor suggestion (since these results are not the main point of the paper). You are open to the objection that, depending on what the distribution of context similarity variation over time is, using units of standard deviation does not prove statistical significance. Since you have the full distribution of variations available and enough computing power, why not use the percentiles of that distribution instead of its standard deviation? It would allow to compute the p-value and therefore gauge the significance immediately. With this method you could also use a Bonferroni test to numerically prove that the results in Figure 4 (and lines 183-186) are statistically significant. Alternatively, you could say more about the properties of the distribution of context similarity variation, and argue that the standard deviation is a good metric to use in this case. f) Lines 324-236: You suggest that there is a threshold amount of data below which the algorithm is not reliable. What do you estimate this threshold to be? If you have a good argument, how do you estimate the threshold? These considerations could be very useful in the interest of reproducibility and for people who want to use Word2Vec in new contexts. g) I recommend trying to publish the code for your analysis open-source, in the interest of reproducibility. Again, I liked the paper very much. I gave a "minor revision" suggestion so that you get a chance to fix the typos. I am then inclined to accept the manuscript independently of whether you implement any of the suggestions above. I look forward to reading your future works on the topic and I hope my notes were useful to you. --- English language suggestions for improvement: a) Lines 131-138 could use some clarification. Especially point 2 (line 136): "We classify the set couples of PACS in two separate classes according to their possible co-occurrence." I had to re-read it a couple of times and use context to understand that the two classes are "couples that exist in the test set" and "couples that do not exist in the test set". Why not write it explicitly? b) Lines 256-259: maybe add a couple of PACS as an example to let the reader understand what they are. (I appreciate that there are many examples throughout the text, but I went to check this section, to understand what a PACS was, before encountering any example). c) Line 266: Very minor style note. I would use "published" instead of "realized". Typos: d) line 308: 'Stochastic' instead of the typo 'stocastic'. e) Lines 142-143. "a reasonable high number of positive". It should be "positives", plural, and "reasonably", adverb. Reviewer #2: The paper introduces a methodology that uses NLP techniques to track the similarity between research topics within physics and its dynamics over time. Moreover, it shows the effectiveness of this methodology in mapping the landscape of research topics, forecast their combination and estimate the impact of milestones. The methodology is scientifically sound, the paper is clear and well written and results are interesting and compelling, thus I certainly advocate its acceptance. In what follows I list some minor comments that the authors may consider before proceeding with the publication, sorted by order of importance: - In section 2.2, the authors use the new introduced context similarity to predict the appearance of new couples of PACS. I understand their method and results, but I think it would be interesting to discount their findings with a null model that takes into account the different growth of fields. Indeed, the random guess to which the authors compare their results consider all PACS as equal sized: given a set of fields, all possible combinations are considered equivalent. In fact, as a result of the well known inflation of science, the size of research topics labelled with PACS is growing in time and, in general, different fields can grow at different rates. For this I suggest the implementation of a very basic null model that compute the probability of having a new combination of PACS as the result of their relative growth with respect to the rest of the fields. - Among references [15-19] I suggest adding Gerlach et al. "A network approach to topic models", Science advances, 2018 as an alternative to topic modeling based on NLP - In Fig. 1, it would be interesting to highlight the starting and ending point of the embedded dynamics of each PACS to know the temporal direction of their evolution. This can be made simply by using differently colored dots to pinpoint start and end of each trajectory. - At page 7, line 209 the sentence should read "... the fact that the NUMBER OF articles available is..." ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0233997.r001
Revision 1
15 May 2020 Author Response Dear editor, first of all we would like to thank the referees for their comments, that gave us the possibility to make our paper more readable and more scientifically robust. We have incorporated practically all suggestions in the revised version of our manuscript, and we believe that these resulted in a much improved work. Please find our point-by-point response below. Best regards the authors Reviewer 1: a)Figure 1 does not print well in black and white. You might consider using colors that print very differently in black and white for adjacent clusters, to increase contrast. We have tried different combination of colors to achieve this goal, however, we found that a good image in black and white we had to use too many light colors, which were difficult to read in the colored version. For this reason, we have decided to keep the original figure. You can find some of our attempts at the end of this letter. b) Most readers that are familiar with Word2Vec are probably familiar with t-SNE, but Dynamic t-SNE is less known. A few lines in the Methods to explain the difference between t-SNE and Dynamic t-SNE might make the paper more readable. (A detailed explanation is not necessary, since you cited the appropriate papers). We ha added two sentences to clarify this point (lines 83-87 of the revised version). c) Lines 139-140: I am not sure that it is common knowledge that AUC ROC is always 0.5 for random guessing, even in your case (i.e. the presence of unbalanced classes). Maybe mention it to increase readability? We have better clarified this point (lines 167-169) and added, as explicitly requested by reviewer 2, a null model to control our results with respect to PACS of different and time dependent sizes. d) Lines 142-143. I would not use the wording "a reasonable high number of positive(s) have been dentified", since there is no alternative baseline model to compare the F-score to. I appreciate the novelty of the method, which is a proof of concept, and this means you do not have anything else to compare it to. So I think a baseline other than the random model you already included is not necessary. We agree, and we have removed this sentence. e) Lines 174-176. Minor suggestion (since these results are not the main point of the paper). You are open to the objection that, depending on what the distribution of context similarity variation over time is, using units of standard deviation does not prove statistical significance. Since you have the full distribution of variations available and enough computing power, why not use the percentiles of that distribution instead of its standard deviation? It would allow to compute the p-value and therefore gauge the significance immediately. With this method you could also use a Bonferroni test to numerically prove that the results in Figure 4 (and lines 183-186) are statistically significant. Alternatively, you could say more about the properties of the distribution of context similarity variation, and argue that the standard deviation is a good metric to use in this case. We thank the referee for pointing out this. We have checked that the distribution of context similarity variation does not follow a gaussian distribution. For this reason, we have changed the figures 4,5, and 6 to show the percentiles of the distribution that, as suggested, represent a more appropriate statistical benchmark. We point out that, however, the main scientific conclusions are not affected. f) Lines 324-236: You suggest that there is a threshold amount of data below which the algorithm is not reliable. What do you estimate this threshold to be? If you have a good argument, how do you estimate the threshold? These considerations could be very useful in the interest of reproducibility and for people who want to use Word2Vec in new contexts. Unfortunately, there is no clear-cut recipe to estimate the threshold as a function of the various data features and the final aim of the training. We have added a short paragraph to comment this point (lines 367-373). g) I recommend trying to publish the code for your analysis open-source, in the interest of reproducibility. We fully agree on this point. We have prepared a github with our codes and a data sample to reproduce our results: https://github.com/Andrea-Napoletano/WyFiG. Again, I liked the paper very much. I gave a "minor revision" suggestion so that you get a chance to fix the typos. I am then inclined to accept the manuscript independently of whether you implement any of the suggestions above. I look forward to reading your future works on the topic and I hope my notes were useful to you. Thank you very much for your interest in our work and for your suggestions, that we believe improved the paper a lot. --- English language suggestions for improvement: a) Lines 131-138 could use some clarification. Especially point 2 (line 136): "We classify the set couples of PACS in two separate classes according to their possible co-occurrence." I had to re-read it a couple of times and use context to understand that the two classes are "couples that exist in the test set" and "couples that do not exist in the test set". Why not write it explicitly? We have clarified this point (see lines 141-144) b) Lines 256-259: maybe add a couple of PACS as an example to let the reader understand what they are. (I appreciate that there are many examples throughout the text, but I went to check this section, to understand what a PACS was, before encountering any example). We have added an example (see line 284). We have also provided the list of PACS as supplementary information and corrected a reference in the text since the website originally referenced is no longer available. c) Line 266: Very minor style note. I would use "published" instead of "realized". We have corrected the text accordingly. Typos: d) line 308: 'Stochastic' instead of the typo 'stocastic'. We have corrected the typo e) Lines 142-143. "a reasonable high number of positive". It should be "positives", plural, and "reasonably", adverb. We have removed the sentence Reviewer #2: The paper introduces a methodology that uses NLP techniques to track the similarity between research topics within physics and its dynamics over time. Moreover, it shows the effectiveness of this methodology in mapping the landscape of research topics, forecast their combination and estimate the impact of milestones. The methodology is scientifically sound, the paper is clear and well written and results are interesting and compelling, thus I certainly advocate its acceptance. In what follows I list some minor comments that the authors may consider before proceeding with the publication, sorted by order of importance: - In section 2.2, the authors use the new introduced context similarity to predict the appearance of new couples of PACS. I understand their method and results, but I think it would be interesting to discount their findings with a null model that takes into account the different growth of fields. Indeed, the random guess to which the authors compare their results consider all PACS as equal sized: given a set of fields, all possible combinations are considered equivalent. In fact, as a result of the well known inflation of science, the size of research topics labelled with PACS is growing in time and, in general, different fields can grow at different rates. For this I suggest the implementation of a very basic null model that compute the probability of having a new combination of PACS as the result of their relative growth with respect to the rest of the fields. We thank the referee for this suggestion, that gives an important improvement of our results in terms of robustness. Following her/his suggestions, we have implemented a null model that takes into account the relative growth of each field. Using the curveball algorithm (ref. 27) we randomized the articles-PACS network without changing the degrees of neither the papers nor the PACS, and so preserving the time evolution of their respective sizes. We modified figure 2 accordingly. The performance of this null model is above a random guess (AUC 0.5), but it is still outperformed by the context similarity. - Among references [15-19] I suggest adding Gerlach et al. "A network approach to topic models", Science advances, 2018 as an alternative to topic modeling based on NLP We have added the suggested reference and a short sentence to comment it. - In Fig. 1, it would be interesting to highlight the starting and ending point of the embedded dynamics of each PACS to know the temporal direction of their evolution. This can be made simply by using differently colored dots to pinpoint start and end of each trajectory. We agree with this request, however, trying to highlight the start and end point of each trajectory we have observed that the figure loses its readability because of the high number of trajectories: adding such a high number of start or end points on the plot creates overlaps among trajectories and this good intent results in a very confused plot. For this reason, we decided to leave the original figure in the paper, while uploading as supporting material an interactive plot in .html format that address this point. - At page 7, line 209 the sentence should read "... the fact that the NUMBER OF articles available is..." We have corrected the typo Attachments Attachment Submitted filename: Response to reviewersFinal.docx https://doi.org/10.1371/journal.pone.0233997.r002
18 May 2020 Decision Letter - Roberta Sinatra, Editor Where is your field going? A Machine Learning approach to study the relative motion of the domains of Physics PONE-D-19-32031R1 Dear Dr. Napoletano, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Roberta Sinatra Academic Editor PLOS ONE Additional Editor Comments (optional): The authors have thoroughly and successfully addressed all the minor comments raised by the Reviewers. I recommend the paper for publication. Reviewers' comments: https://doi.org/10.1371/journal.pone.0233997.r003
Formally Accepted
22 May 2020 Acceptance Letter - Roberta Sinatra, Editor PONE-D-19-32031R1 Where is your field going? A Machine Learning approach to study the relative motion of the domains of Physics Dear Dr. Napoletano: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Prof. Roberta Sinatra Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0233997.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .