A pre-training and self-training approach for biomedical named entity recognition

Shang Gao; Olivera Kotevska; Alexandre Sorokine; J. Blair Christian

doi:10.1371/journal.pone.0246310

Peer Review History

Original SubmissionSeptember 16, 2020
21 Oct 2020 Decision Letter - Nicolas Fiorini, Editor PONE-D-20-29224 Biomedical named entity recognition in low resource settings PLOS ONE Dear Dr. Gao, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Particularly, the reviewers highlighted a bias in the experiment setup, which does not simulate low-resource settings properly. There are a few other important remarks, especially with regards to code availability as PLOS One dedicates great efforts for data availability. Finally, the manuscript would benefit from a few grammar/typo corrections. Please submit your revised manuscript by Dec 05 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Nicolas Fiorini Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for providing links to the data used in the experiment in your Data Availability Statement. We ask that you also include these links within your Methods section. 3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 4. Please include your tables as part of your main manuscript and remove the individual files. Please note that supplementary tables (should remain/ be uploaded) as separate "supporting information" files. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This paper investigated cumulative effect of pre-training and semi-supervised self-training on multiple datasets. The pre-training was executed on three datasets, SemMed 100k, SemMed 1M, and MedMentions. The authors showed that the self-training method with the pre-trained parameters can boost the performance. Although the performance improvement by pre-training and semi-supervised self-training was previously known in the NER task, this paper has contribution on the cumulative effect of pre-training and semi-supervised self-training in low resource settings. However, this paper constructed a low resource setting by reducing the number of training samples from a large size of corpus. Thus, it is important to evaluate the proposed approach to the real application with a small number of NER annotation. Major: 1. The proposed approach was tested on the applications with a large number of NER datasets such as disease names, chemicals, and genes. Because UMLS semantic types, which were used for pre-training, include entities such as chemicals and phenomenon or process, pre-training datasets already included many similar entities. The authors need to present the domain similarity between the tested dataset and pre-training datasets to make sure that the experiment settings are real low resource settings. 2. The authors mentioned that CRISPR and Covid-19 are examples of the application areas with a low resource. Thus, it is important to test the proposed approach using a real low resource data set, where the size of a corpus set is small and similar pre-training data sets are not available. 3. What was the criterion to set the confidence threshold to 99.75%? When set to lower threshold, the model is likely to propagate errors more than the knowledge. How can we find the proper threshold? Minor - Please check the grammar errors. For examples, "We note the this", "NER ragging", "naturaly language processing" and so on. Reviewer #2: I would like to thank the authors for this well-written and easy to read article. The article proposes that transfer learning and self-training can be beneficial for biomedical NER tasks with small training sets. Through various ablation studies they show the benefit of transfer learning and self-training individually and also delve into some of the unexpected results in the Discussion. I really like the paper, but unfortunately don’t feel that the paper’s conclusions on “low-resource settings” are warranted. While I accept that pretraining and self-training improve performance, there needs to be better comparisons to show it in context with other methods. I describe my concerns below. Major - I don’t accept that the datasets used for testing and the methods for downsampling them are actually low-resource settings. The entity data used for pretraining is all based on UMLS terms which are a superset of all the downstream datasets evaluated. A truly low resource setting would be one with only a small number of relevant annotations for the task in hand. These can often be found in languages other than English or for more obscure biomedical NER tasks, compared to the comparatively mainstream tasks examined. Further to that, the concept of transfer learning involves training on one different task first and then training on the target task. The prior task in this work is predicting UMLS terms, which is a superset of the different biomedical entities in the eventual target tasks. The results in Table 3 shows this, as the classifiers can predict many of the entities (even with poor F1) with no fine-tuning on the task. So again, this transfer learning primarily works because the prior task is so similar to the final target task. This is an interesting result, but should be framed appropriately. - Comparing against SemRep (MetaMap) and Scispacy aren’t great baselines. Both of them seem to be predicting UMLS terms so haven’t been trained specifically for the tasks. For a fuller context, we really need better comparisons. You either need to show how other good NER tools behave with such small training sets, or use your methods for one of the smaller tasks with the complete dataset + pretraining + self-training and compare it to other tools. - The statement about code availability is not reasonable and code should be made available during peer review - Removing statements about low-resource settings, the main conclusions seem to be that transfer learning from UMLS and self-training often gives a boost for biomedical NER, where the entities are a subset of UMLS and in short supply. This is a good contribution and is well-written but the scope and conclusions need to be reframed accordingly. Minor - The mentions of CRISPR & COVID as potential low resource settings seem click-baity. It also doesn’t relate to any of the datasets used and should be removed, or be explained and better supported. - The paper makes reference to the SemRep tool for entity extraction. I believe they really mean MetaMap, which is the underlying entity extraction tool. SemRep extracts relations between these entities. - For the results presented in Table 5, we really need to know what fraction of the complete dataset is being used for Fine Tuning. This would enable clearer comparison with the Fully Supervised set. Is 2000 samples a lot of the dataset for each one or a fraction? - In Figure 2, “active learning iteration” is a confusing axis label? Perhaps “Iteration of self-training”? - The paper is very well-written and easy to follow. There are a number of tiny mistakes listed below: - “we generate pseudo-labels” -> “we generate psuedo-labels” - “a wide range of naturaly language processing” -> “a wide range of natural language processing” - “showed that pretraining an BiLSTM-CRF” -> “showed that pretraining a BiLSTM-CRF” - “labels for NER ragging” -> “labels for NER tagging” - “SciScpacy” -> “scispaCy” - “both transer and semi-supervised” -> “both transfer and semi-supervised” ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0246310.r001
Revision 1
19 Nov 2020 Author Response Please see reviewer response file. Attachments Attachment Submitted filename: reviewer_response.docx https://doi.org/10.1371/journal.pone.0246310.r002
14 Dec 2020 Decision Letter - Nicolas Fiorini, Editor PONE-D-20-29224R1 A pre-training and self-training approach for biomedical named entity recognition PLOS ONE Dear Dr. Gao, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== First of all, thank you for having revised your manuscript in such depth. This has been appreciated by the reviewers who now believe the manuscript is a lot more suitable for publication. There are a few remaining minor remarks that would be nice to be addressed though. Thanks again for your work and contribution to PLOS ONE. ============================== Please submit your revised manuscript by Jan 28 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Nicolas Fiorini Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) Reviewer #2: (No Response) ******** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ****** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ****** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ****** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In the first review, I had four comments. In this revision, the reviewers were addressed those comments. Reviewer #2: I would like to thank the authors for their substantial revisions and responses to our comments. I feel the paper has been dramatically improved and that the problem reframing seems a lot more suitable. I have a few remaining issues and a few minor things that I hope can be addressed. Overall, I think this is excellent work and has made me think a lot about self-training and methods of transfer learning. - At the moment, the first result (Table 4) reads like an odd experiment. It seems that the hypotheses is that pretraining alone might provide okay performance for problems with no training samples. Could you move that justification (which seems to be in the final sentence of the paragraph discussing Table 4) further up so it's clear why you're showing those results. At first read, I couldn't remember why you were showing results of a model without any fine-tuning which feels weird and had to stop and read around a bit. - For the MetaMap/scispacy baselines, do you filter for entities of the relevant type for each dataset? They are presumably predicting all UMLS terms and I believe should normalize them to UMLS entities, which could be filtered by type. As an example, do you only include MetaMap predictions of UMLS terms that are of type "gene/protein" in the BC2GM dataset? - I wonder why you've put the TAC SRIE result in the Discussion section. It adds nice weight to your argument about NER outside of UMLS. It seems like it would be a nice final result, but really that's up to you. And here are a few small tweaks. - In related work, I'm not sure I would describe a BiLSTM-CRF as simple. I get that it is comparitively simple compared to a language model like BERT. But there are much simpler NER frameworks, so that probably needs a rephrase. - Page 15, 18, 21 & Table 5: "Medmentions" -> "MedMentions" - Page 22, 23, Table 8 caption & Table 10: "Semmed" -> "SemMed" ****** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0246310.r003
Revision 2
8 Jan 2021 Author Response Please see attached cover letter and reviewer response document. Attachments Attachment Submitted filename: Plos_NER_rev2.docx https://doi.org/10.1371/journal.pone.0246310.r004
18 Jan 2021 Decision Letter - Nicolas Fiorini, Editor A pre-training and self-training approach for biomedical named entity recognition PONE-D-20-29224R2 Dear Dr. Gao, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Nicolas Fiorini Academic Editor PLOS ONE Additional Editor Comments (optional): Thank you for substantially improving the manuscript and thoroughly responding to reviewers throughout the process. https://doi.org/10.1371/journal.pone.0246310.r005
Formally Accepted
29 Jan 2021 Acceptance Letter - Nicolas Fiorini, Editor PONE-D-20-29224R2 A pre-training and self-training approach for biomedical named entity recognition Dear Dr. Gao: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Nicolas Fiorini Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0246310.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .