Peer Review History
| Original SubmissionAugust 31, 2020 |
|---|
|
PONE-D-20-27304 Using lexical language models to detect borrowings in monolingual wordlists PLOS ONE Dear Dr. Miller, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. All three reviewers have some good suggestions that you should take into account. For Reviewer 1 it was a stumbling block that the paper seems to try to model a native speaker's ability to identify loanwords. Given that it doesn't do a good job at that the verdict was a rejection. But it seems that the paper is really about the extent to which it is possible for a computer to identify loanwords given information about the target language only. If you clearly spell out that focus and downplay the importance of discussion about what native speakers can and cannot then you might avoid some confusion. Please submit your revised manuscript by Nov 15 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Søren Wichmann, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 2. Please note that in order to use the direct billing option the corresponding author must be affiliated with the chosen institute. Please either amend your manuscript or remove this option (via Edit Submission). 3. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 5, 7, 8, 9 and 10 in your text; if accepted, production will need this reference to link the reader to the Table. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The article is a rather mechanical application of several machine-learning algorithms, two of them severely outdated and none of them state-of-the-art, to what is essentially a non-task. The motivation for the experiments---that speakers of different languages (cited publications refer to Russian and Korean in particular) are good at identifying borrowings---is way too slim. Firstly, this only applies to very recent borrowings (Russian, just like English, is saturated with older borrowings, which are undetectable by native speakers). Secondly, this does not provide a cross-lingual baseline against which to compare the performance of ML algorithms. Thirdly, as the authors point out themselves, this is not how borrowings are detected in the historical-comparative literature (where the principle of irregular sound correspondences is the only one of real standing; of course, this principle is rather hard to automate because one has to establish the correspondences manually to begin with). It is hard to grasp what exactly the study is trying to show. It could have been construed as an attempt to model discriminative abilities of native speakers, in which case the focus should have been on how the neural net discriminates between native and borrowed words (cf. the abundant recent literature on what BERT might know about syntax, etc.). Instead all the models are treated as black boxes, and the analysis boils down to identifying situations in which they perform better or worse. This may have been of interest had the proposed method been of practical use. Some possible applications are listed in the conclusion ("studies in which borrowed words or sentences need to be identified in large amounts of data"; "[work] on code switching, where multilingual language users switch between different varieties based on sociolinguistic contexts"); however, the fact that the proposed methodology can help there itself needs to be tested against appropriate baselines and competing approaches. Reviewer #2: The authors introduced a new approach to automatic loanword detection in the field of computational historical linguistics (CHL). While most of the existing methods aim at identifying loanwords in multilingual wordlists, the attempt of this paper is to identify borrowings using a monolingual approach. The authors use the WOLD database, which is the only database containing loanword information along with information about the donor language and loaned status. What I really liked about the work is that the three methods build on another: the Bag of Sound model using an SVM is the simplest model, integrating only the phonology of the words without considering the order and frequency of the sounds; the Markov Model is a tri-gram model relying on the two previous sounds in the word; the recurrent neural network also relies on the phonotactics of the word, taking all previous sounds into account. The authors perform two experiments, one on artificially seeded borrowings and one on the “real” WOLD data. Since the promising results on the simulated data could not be obtained by the experiments using the WOLD data, the authors made additional experiments in order to explain the performance of the methods, which was achieved. Although, the results and the performance are not satisfying, the three introduced methods along with the lexical language models open a new perspective for further research, especially the recurrent neural network, serves as basis for improvements and further explorations in the field of automatic loanword detection. Two things I really liked about the manuscript are the detailed explanation of the methods and the representation of the data, which help the reader to get clear insights in the evaluation of the different methods. The statistical evaluation are carried out and explained in detail. The authors made the effort to perform additional analyses to explain the performance of the methods in the two experiments. The statistical evaluation are carried out in a rigorous way, giving some explanations according to the performance and the results. The process of nativization plays a crucial role in the motivation of the approach introduced in the first chapters, however it was not revive in more detail in the conclusion. The detection of loanwords using the proposed methods depends highly on the data and the adjustment status of the word in the recipient language. The methods might not identify older loanwords or loaned words from related languages, which show no clear differences in the phonology. This issue was not discussed in detail in the conclusion, but could be one of the reasons of the poor performance of the methods. In addition, the automatically derived IPA transcriptions of the words from the WOLD database could lead to noise in the analyses, depending on correctness of the transcription. However, since no other database is available containing loanword annotations along with the donor language and loanword status, compromises need to be made. The data is completely available online. Within the data, the additional German wordlist used for the artificially seeded borrowings is not identifiable at first sight. I would encourage the authors to provide the list in a format like csv and allocate it at first sight in the python package. Additionally, as a small notice, on p.6 the authors wrote that the German word list and the software package are available on GitHub. However, everything is uploaded on osf.io. For consistency reasons the authors could correct this. Spelling comment: I encourage the authors to check the formula for the softmax activation on p.9. To my understanding a comma is missing in the brackets of the formula. Reviewer #3: Summary: This manuscript introduces three methods to identify the borrowed words based on monolingual wordlists. The author evaluates the performances of these methods by setting up various scenarios. Despite the methods work well in case of artificially seeded data, performances of these methods are not all satisfied. Nevertheless, the author points out that the high proportion of borrowing and the existence of a dominant donor language in a language are beneficial to the task. Also, a more promising method for borrowing detection is recommended for future study. Strength: This paper applies a dataset including a large amount of languages so that we can know the characteristics of languages that are suitable for the examined methods. It is interesting to see that the phonological and phonetic features are applied to build up the lexical models. In terms of the Marko model and neural network model, it is also interesting to see your way to use entropy difference to classify borrowed. The author provides useful suggestion on future works and extract meaningful information despite the unsatisfied result of the models. Weakness: Generally, the structure of the paper is fine, but sometimes I was surprised to see some contents that don’t belong to a section. I also saw some duplicated and redundant information. Besides, it’d be better to clearly indicate the meaning of the notations in formula. Below is the specific comments referring to specific sections. --Abstract You mentioned all necessary information in the abstract. However, it is not always clear and I had to spend some time to look for the information I needed. For instance, what you did and the result of this study are not clearly specified. In my opinion using phrases like “in this study, we did this…” “the result shows that…” could be helpful to catch the necessary information easier and faster before reading it. --Introduction You mention a lot in the introduction on how good a native speaker can identify the borrowing in a his/her own language, and it seems that this is related to your motivation to do the study. But you didn’t address this later in the rest of the paper. Hence, I'm confused about the actual motivation or the problem you try to address in the paper. Also, it is a little wired that you include results and some discussion content at the end of the introduction, making this part like an abstract. --Materials It’d be easier to understand the WOLD dataset if you introduce more about the data format and show some examples. I had to search for the WOLD online to understand how the data looks like. Meanwhile, adding some phonetic transcription examples would be more interesting and easier to understand what you did. Line 134 citation [36], there is typo for the reference of this citation in page 32/39. --Markov Model Page 8/39 it might be better to explain the notation. I had to spend some time to guess what they meant. The same as in Page 9 Line 202. Any reference for this statement? Line 232 to 238 Maybe you wanna sum up the three methods, but it seems a little redundant as they have been introduced previously. Try to re-organise it and avoid duplicated content. --Result In general, you wrote the result for each experiment separately, which is clear and good. However, at the beginning of each section, you introduced the detail of each experiment, and the introduction of the experiments should not be part of the result section. The introduction of the experiments should be somewhere else as I only expect actual result like digits, tables, figures and relevant explanation of them in this section. Another option is to write each experiment totally separately, meaning that you write about the introduction, result, and discussion of an experiment together. The structure of the whole paper would be clearer and more logical. Line351 to 353, any reason you choose these three characteristics? Besides, have you considered that if they are independent from each other? Page 17/39, Table 4, are all these correlations significant? Maybe you can also include p-value. Page 18/39 Table 5, it is a nice table showing regression model R square value. But it’d be better to involve the information in the table when you discuss the influence of the three factors, instead of just putting it here. --Discussion You had to introduce each experiment again in this section, and it is redundant. As I suggest above in the result section, write about the experiment one by one, then you don’t have to constantly describe the experiments again and again. Line 502-507 maybe you can try the frequency of these unique sounds. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Liqin Zhang [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. |
| Revision 1 |
|
Using lexical language models to detect borrowings in monolingual wordlists PONE-D-20-27304R1 Dear Dr. Tresoldi, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Søren Wichmann, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): The revisions look fine, and should be satisfactory for the reviewers, so a second round of reviewing is not necessary. I noted a couple of typos/stylistic issues that you could fix: depends from the initial contact situation -> depends on the initial contact situation and (b) the more borrowings go back -> and (b) when more borrowings go back [or something like that] Reviewers' comments: |
| Formally Accepted |
|
PONE-D-20-27304R1 Using lexical language models to detect borrowings in monolingual wordlists Dear Dr. Tresoldi: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Søren Wichmann Academic Editor PLOS ONE |
Open letter on the publication of peer review reports
PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.
We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.
Learn more at ASAPbio .